## PLINK file creation - batch 8

In order to run SNeP (for long-term Ne estimates) and an alternative to STRUCTURE, I need to have the PLINK files `.ped` and `.map`. This notebook contains all the steps required for PLINK file generation using a final, filtered genepop file. Referenced [GWASpi webpage](http://www.gwaspi.org/?page_id=145) and [this file format reference](https://www.cog-genomics.org/plink2/formats#map) for PLINK file formats.

<br>
<br>
Steps for PLINK file creation: 
1. Align final loci to a genome using Bowtie 2. --- completed in [this notebook](https://github.com/mfisher5/PCod-Korea-repo/blob/master/notebooks/Batch%208%20-%20Alignment%20to%20Atlantic%20cod%20Genome.ipynb)
2. Use Bowtie 2 alignment output to create a `.map` file
3. Use PGD Spider to create a `.ped` file from genepop

<br>

### Creation of .map file

A standard `.map` file consists of 4 columns: 
- chromosome 
- marker ID 
- genetic distance (i.e. position in morgans or centimorgans)
- physical position (i.e. base pair coordinate)

The `.sam` file that is output from bowtie (after the header lines) consists of 11 mandatory columns:
- query template name
- flag
- ref sequence name
- leftmost mapping position
- mapping quality
- CIGAR string
- reference name of mate
- position of mate
- template length
- segment sequence
- ASCII of phred-scaled base value quality (Phred +33)


So in order to create a `.map` file, I need to extract and reorder the following columns from the `.sam` file:
1. Column [2] = "chromosome" (in case of Atlantic cod genome, a linkage group)
2. Column [0] = "marker ID"
3. *not included* = "genetic distance"; will use "dummy" value of 0
4. Column [3] = "physical position"

In [1]:
cd ../analyses

/mnt/hgfs/PCod-Korea-repo/analyses


In [8]:
## open files
infile = open("batch_8_filtered_bowtieACod_filtered10MQ.sam", "r")
outfile = open("batch_8_filtered_bowtieACod_filtered10MQ.map", "w")

## skip over the header lines in the infile
line = infile.readline()
while line.startswith("@"):
    line = infile.readline()

## write content of .sam file to .map output file, starting with first line that isn't a header
## --- note that I added "Locus" in front of my loci names for the bowtie alignment, and now need to strip that out
n_loci = 0

linelist = line.strip().split()
outfile.write(linelist[2] + "\t" + linelist[0].split("_")[1] + "\t0\t" + linelist[3] + "\n")
n_loci += 1

for line in infile:
    linelist = line.strip().split()
    outfile.write(linelist[2] + "\t" + linelist[0].split("_")[1] + "\t0\t" + linelist[3] + "\n")
    n_loci += 1

infile.close()
outfile.close()

print "Wrote ", n_loci, " loci into new .map file."

Wrote  4031  loci into new .map file.


<br>
<br>

### Creation of .ped file

The columns of a `.ped` file are:
- family ID
- sample ID
- paternal ID
- maternal ID
- sex
- phenotype value of individual
- genotypes (space or tab delimited; each allele gets its own column)

I don't actually have any family ID information, nor will I be adding in the sex of each individual (I have this, but in metadata format from original sampling spreadsheets. 

I made the .ped file using PGD Spider. My `.spid` file for that conversion is: 

____________________________

`# spid-file generated: Tue Nov 28 14:45:11 PST 2017`

`# GENEPOP Parser questions`
`PARSER_FORMAT=GENEPOP`

`# Enter the size of the repeated motif (same for all loci: one number; different: comma separated list (e.g.: 2,2,3,2):`
<br>
GENEPOP_PARSER_REPEAT_SIZE_QUESTION=

`# Select the type of the data:`
<br>
GENEPOP_PARSER_DATA_TYPE_QUESTION=SNP

`# How are Microsat alleles coded?`
<br>
GENEPOP_PARSER_MICROSAT_CODING_QUESTION=REPEATS

`# PED Writer questions`
<br>
WRITER_FORMAT=PED

`# Save MAP file`
<br>
PED_WRITER_MAP_FILE_QUESTION=

`# Replacement character for allele encoded as 0 (0 encodes for missing data in PED):`
<br>
PED_WRITER_ZERO_CHAR_QUESTION=

`# Specify the locus/locus combination you want to write to the PED file:`
<br>
PED_WRITER_LOCUS_COMBINATION_QUESTION=

`# Do you want to save an additional MAP file with loci information?`
<br>
PED_WRITER_MAP_QUESTION=false
_______________________________


**NOTE:** when PGD spider makes the `.ped` file out of a genepop file, it puts the population ID into the `family ID` column. I removed this manually.

In [1]:
pwd

u'/mnt/hgfs/PCod-Korea-repo/notebooks'

In [2]:
cd ../analyses

/mnt/hgfs/PCod-Korea-repo/analyses


In [5]:
infile = open("batch_8_filtered_Admixture_input.ped", "r")
outfile = open("batch_8_filtered_Admixture_input_nofam.ped", "w")

for line in infile:
    linelist = line.strip().split()
    newline = " ".join(linelist[1:])
    outfile.write("0 " + newline + "\n")
infile.close()
outfile.close()

<br>
<br>

### Binary PLINK files