## NeON - long term estimates of Ne
<br>
<br>
Run in linux (Ubuntu VM). NeON is an R package. For instructions on running in R, refer to [this R script]()
<br>


<br>


<br>
#### 12/6/2017
### Installing NeON

According to the NeON Tutorial, in order to run NeON you will need to: 

1. download PLINK v1.07 from [this site](http://zzz.bwh.harvard.edu/plink/download.shtml), unzip it, and put the binary files into the working folder. If you are working on an ubuntu VM, use `Linux (x86_64)`  
2. download the NeON package with tutorial from [this site](http://www.unife.it/dipartimento/biologia-evoluzione/ricerca/evoluzione-e-genetica/software). unzip it and put the files / directories into the working folder.
3. in the terminal, navigate to the working directory
4. install the add-on R package with `R CMD INSTALL NeON_1.0.tar.gz`

You can now open NeON in R. 

My working directory file structure: 

In [1]:
pwd

u'/mnt/hgfs/PCod-Korea-repo/notebooks'

In [2]:
cd ../analyses/Ne/NeON/

/mnt/hgfs/PCod-Korea-repo/analyses/Ne/NeON


In [3]:
ls

[0m[01;32mbatch_8_filtered_Admixture_input_ATalleles.ped[0m*     [01;32mNeON_1.0.tar.gz[0m*
[01;32mbatch_8_filtered_Admixture_input_ATallelesSNeP.ld[0m*  [01;32mNeON-manual.pdf[0m*
[01;32mbatch_8_filtered_Admixture_input_nofam.ped[0m*         [34;42mNeON_TUTORIAL_INSTALL[0m/
[01;32mbatch_8_filtered_Admixture_input.ped[0m*               [01;32mNeON Tutorial.pdf[0m*
[01;32mbatch_8_filtered_bowtieACod_filteredMQ10_dist.map[0m*  [01;32mNeON_tutorial.R[0m*
[01;32mbatch_8_filtered_bowtieACod_filteredMQ10.map[0m*       [34;42mpcod-data[0m/
[01;32mCOPYING.txt[0m*                                        [01;32mplink[0m*
[34;42mdata[0m/                                               [01;32mREADME.txt[0m*
[34;42mgenetic_map[0m/                                        [01;32mtest.map[0m*
[01;32mgPLINK.jar[0m*                                         [01;32mtest.ped[0m*


Note how plink and neon executables are in the working directory; data and genetic maps are in their own folder. 

<br>
<br>

### Make PLINK files: .bed, .fam, .bim, .ld

#### STEP ONE: alignment and genepop filtering

**(A)** Align your final, filtered loci to a genome, and then filter by mapping quality for only unique alignments.

See [this notebook](https://github.com/mfisher5/PCod-Korea-repo/blob/master/notebooks/Batch%208%20-%20Alignment%20to%20Atlantic%20cod%20Genome.ipynb) for alignment instructions. 


<br>

**(B)** Filter your genepop file so that it only includes those loci. 



In [4]:
pwd

u'/mnt/hgfs/PCod-Korea-repo/analyses/Ne/NeON'

In [5]:
cd ../../../

/mnt/hgfs/PCod-Korea-repo


In [11]:
infile = open("analyses/Ne/NeON/batch_8_filtered_bowtieACod_filteredMQ10.map", "r")

loci_to_keep = []
for line in infile:
    loci_to_keep.append(line.strip().split()[1])
infile.close()
print "You have ", len(loci_to_keep), " loci"

genfile = open("stacks_b8_wgenome/batch_8_filteredMAF_filteredIndivids30_filteredLoci_filteredHWE_filteredRepsC.txt", "r")
outfile = open("analyses/Ne/NeON/batch_8_filtered_bowtieACod_filteredMQ10_genepop.gen", "w")

# write over header
header = genfile.readline()
outfile.write(header)

# write over loci to keep
loci_ordered = [] # get a list of the loci to keep as they are ordered in the genepop file
keep_genos = [] # binary: keep or skip over the locus in an individual
for line in genfile:
    if "pop" not in line and "Pop" not in line: 
        locus = line.strip()
        if locus in loci_to_keep: 
            loci_ordered.append(locus)
            outfile.write(line)
            keep_genos.append(1)
        else:
            keep_genos.append(0)
    else:
        outfile.write(line) #write in first "pop"
        break

# write over genotypes
for line in genfile:
    linelist = line.strip().split()
    if len(linelist) < 2:
        outfile.write(line)
    else:
        sampleID = linelist[0]
        genotypes = linelist[1:]
        genotypes_keep = []
        for i in range(0, len(genotypes)):
            if keep_genos[i] == 1:
                genotypes_keep.append(genotypes[i])
        outfile.write(sampleID + "\t" + "\t".join(genotypes_keep) + "\n")
        print "Loci retained in individual: ", len(genotypes_keep)
genfile.close()
outfile.close()


You have  4031  loci
Loci retained in individual:  4031
Loci retained in individual:  4031
Loci retained in individual:  4031
Loci retained in individual:  4031
Loci retained in individual:  4031
Loci retained in individual:  4031
Loci retained in individual:  4031
Loci retained in individual:  4031
Loci retained in individual:  4031
Loci retained in individual:  4031
Loci retained in individual:  4031
Loci retained in individual:  4031
Loci retained in individual:  4031
Loci retained in individual:  4031
Loci retained in individual:  4031
Loci retained in individual:  4031
Loci retained in individual:  4031
Loci retained in individual:  4031
Loci retained in individual:  4031
Loci retained in individual:  4031
Loci retained in individual:  4031
Loci retained in individual:  4031
Loci retained in individual:  4031
Loci retained in individual:  4031
Loci retained in individual:  4031
Loci retained in individual:  4031
Loci retained in individual:  4031
Loci retained in individual:  4031

#### STEP TWO: create .ped and .map files

For full set of directions, see the notebook [batch 8 - PLINK files]().

I made a `.ped` file with my new, filtered genepop file. *Note that you must replace numerical alleles with base pairs. I randomly chose to use "A" for "01" and "T" for "02"*

I do need to redo my `.map` file, so that the order of my loci matches the order of the loci in the .ped file (the .ped file does not have loci headers, so PLINK cannot correctly name loci without the order specified in the .map file). I can do this using the list object `loci_ordered` from above.

In [12]:
outmap = open("analyses/Ne/NeON/batch_8_filtered_bowtieACod_filteredMQ10_ordered.map", "w")
lines_copied = 0
for locus in loci_ordered:
    inmap = open("analyses/Ne/NeON/batch_8_filtered_bowtieACod_filteredMQ10.map", "r")
    for line in inmap:
        if line.strip().split()[1] == locus:
            outmap.write(line)
            lines_copied += 1
    inmap.close()
outmap.close()
print "copied over ", lines_copied, " lines"

copied over  4031  lines


<br>
<br>

#### STEP THREE: Make .bed / .bim / .fam file using `plink`, and the .ped and .map files

More on PLINK binary files [here](www.gwaspi.org/?page_id=671)

<br>
Note that this is NOT the same `.bed` file as used for `bedtools`.

In [13]:
cd analyses/Ne/NeON

/mnt/hgfs/PCod-Korea-repo/analyses/Ne/NeON


In [14]:
!mv batch_8_filtered_bowtieACod_filteredMQ10_ne_input_AT.ped pcod-data/batch_8_filtered_bowtieACod_filteredMQ10_ne_input_AT.ped

In [15]:
!mv batch_8_filtered_bowtieACod_filteredMQ10_ordered.map pcod-data/batch_8_filtered_bowtieACod_filteredMQ10_ne_input_AT.map

In [19]:
!./plink --file pcod-data/batch_8_filtered_bowtieACod_filteredMQ10_ne_input_AT --make-bed --out pcod-data/batch_8_filtered_bowtieACod_filteredMQ10


@----------------------------------------------------------@
|        PLINK!       |     v1.07      |   10/Aug/2009     |
|----------------------------------------------------------|
|  (C) 2009 Shaun Purcell, GNU General Public License, v2  |
|----------------------------------------------------------|
|  For documentation, citation & bug-report instructions:  |
|        http://pngu.mgh.harvard.edu/purcell/plink/        |
@----------------------------------------------------------@

Web-based version check ( --noweb to skip )
Recent cached web-check found...Problem connecting to web

Writing this text to log file [ pcod-data/batch_8_filtered_bowtieACod_filteredMQ10.log ]
Analysis started: Wed Dec  6 11:38:07 2017

Options in effect:
	--file pcod-data/batch_8_filtered_bowtieACod_filteredMQ10_ne_input_AT
	--make-bed
	--out pcod-data/batch_8_filtered_bowtieACod_filteredMQ10

4031 (of 4031) markers to be included from [ pcod-data/batch_8_filtered_bowtieACod_filteredMQ10_ne_input_AT.map ]

#### STEP FOUR: Create .ld file using `plink`

more on ld files [here](zzz.bwh.harvard.edu/ls.shtml)

Be sure to calculate `r2` and not `r`

In [38]:
!./plink --file pcod-data/batch_8_filtered_bowtieACod_filteredMQ10_ne_input_AT --r2 --allow-no-sex


@----------------------------------------------------------@
|        PLINK!       |     v1.07      |   10/Aug/2009     |
|----------------------------------------------------------|
|  (C) 2009 Shaun Purcell, GNU General Public License, v2  |
|----------------------------------------------------------|
|  For documentation, citation & bug-report instructions:  |
|        http://pngu.mgh.harvard.edu/purcell/plink/        |
@----------------------------------------------------------@

Web-based version check ( --noweb to skip )
Recent cached web-check found...Problem connecting to web

Writing this text to log file [ plink.log ]
Analysis started: Wed Dec  6 12:11:08 2017

Options in effect:
	--file pcod-data/batch_8_filtered_bowtieACod_filteredMQ10_ne_input_AT
	--r2
	--allow-no-sex

4031 (of 4031) markers to be included from [ pcod-data/batch_8_filtered_bowtieACod_filteredMQ10_ne_input_AT.map ]
Writing list of these individuals to [ plink.nosex ]
253 individuals read from [ pcod-data/b

In [39]:
!head plink.ld

 CHR_A         BP_A   SNP_A  CHR_B         BP_B   SNP_B           R2 
     0       129609    4135      0       142208   21046     0.219354 
     0       996100    6562      0       996231   17935     0.850749 
     0      1317465    6903      0      1317603   10478     0.249635 
     0      1323218   12353      0      1333072    6576     0.824104 
     0      1535383    3518      0      1535521    3602     0.655982 
     0      1535383    3518      0      1540993   23912     0.529246 
     0      1535521    3602      0      1540993   23912     0.467432 
     0      1610166   16064      0      1610306    9250     0.524646 
     0      1866533   16398      0      1896621   15111     0.557202 


In [40]:
!head data/CEU_gen.ld

 CHR_A         BP_A         SNP_A  CHR_B         BP_B         SNP_B           R2 
     1       993492      0.494952      1      1087198      0.640116    0.0103293 
     1       993492      0.494952      1      1120590      0.811253  0.000470105 
     1       993492      0.494952      1      1145994      0.831152    0.0414282 
     1       993492      0.494952      1      1148494      0.833377   0.00544144 
     1       993492      0.494952      1      1201155      0.875979     0.055172 
     1       993492      0.494952      1      1468016        1.0544  0.000109858 
     1       993492      0.494952      1      1490804       1.05928   0.00202653 
     1      1087198      0.640116      1      1120590      0.811253    0.0249491 
     1      1087198      0.640116      1      1145994      0.831152    0.0577768 


<br>
** why are there no chromosome codes?** According to [this website](), only versions PLINK 1.9 and later permit contig names instead of human chromosome names. BUT I can probably add these in manually. 

In [41]:
# Create a dictionary of locus name : contig name
mapfile = open("pcod-data/batch_8_filtered_bowtieACod_filteredMQ10_ne_input_AT.map", "r")
locus_contig_dict = {}
for line in mapfile:
    linelist = line.strip().split()
    locus_contig_dict[linelist[1]] = linelist[0]
mapfile.close()

In [65]:
# use the dictionary to add in appropriate contig names for new `.ld` file
infile = open("plink.ld", "r")
outfile = open("pcod-data/batch_8_filtered_bowtieACod_filteredMQ10.ld", "w")

header = infile.readline()
outfile.write(header)

for line in infile:
    linelist = line.strip().split()
    locusA = linelist[2]
    locusB = linelist[5]
    chromA = locus_contig_dict[locusA]
    chromB = locus_contig_dict[locusB]
    linelist[0] = chromA
    linelist[3] = chromB
    outfile.write("     " + linelist[0] + "       " + linelist[1] + "    " + linelist[2] + "      " + linelist[3] 
                  + "       " + linelist[4] + "   " + linelist[5] + "     " + linelist[6] + " \n")
infile.close()
outfile.close()

In [66]:
!head pcod-data/batch_8_filtered_bowtieACod_filteredMQ10.ld

 CHR_A         BP_A   SNP_A  CHR_B         BP_B   SNP_B           R2 
     GmG20150304_scaffold_7973       129609    4135      GmG20150304_scaffold_7973       142208   21046     0.219354 
     LG16       996100    6562      LG16       996231   17935     0.850749 
     LG21       1317465    6903      LG21       1317603   10478     0.249635 
     LG11       1323218    12353      LG11       1333072   6576     0.824104 
     LG13       1535383    3518      LG13       1535521   3602     0.655982 
     LG13       1535383    3518      LG13       1540993   23912     0.529246 
     LG13       1535521    3602      LG13       1540993   23912     0.467432 
     LG20       1610166    16064      LG20       1610306   9250     0.524646 
     LG18       1866533    16398      LG18       1896621   15111     0.557202 


In [67]:
print locus_contig_dict['4135']

GmG20150304_scaffold_7973


<br>
<br>

### troubleshooting

In [68]:
!head pcod-data/batch_8_filtered_bowtieACod_filteredMQ10.fam

pop_1 PO010715_02 0 0 0 -9
pop_1 PO010715_04 0 0 0 -9
pop_1 PO010715_06.1 0 0 0 -9
pop_1 PO010715_08.1 0 0 0 -9
pop_1 PO010715_10.1 0 0 0 -9
pop_1 PO010715_11.1 0 0 0 -9
pop_1 PO010715_12 0 0 0 -9
pop_1 PO010715_17.1 0 0 0 -9
pop_1 PO010715_19.1 0 0 0 -9
pop_1 PO010715_27.1 0 0 0 -9


In [69]:
!head data/CEU_gen.fam

1334 NA12144 0 0 1 -9
1334 NA12145 0 0 2 -9
1334 NA12146 0 0 1 -9
1334 NA12239 0 0 2 -9
1340 NA06994 0 0 1 -9
1340 NA07000 0 0 2 -9
1340 NA07022 0 0 1 -9
1340 NA07056 0 0 2 -9
1341 NA07034 0 0 1 -9
1341 NA07055 0 0 2 -9


In [70]:
!tail data/CEU_gen.fam

1454 NA12812 0 0 1 -9
1454 NA12813 0 0 2 -9
1454 NA12814 0 0 1 -9
1454 NA12815 0 0 2 -9
1459 NA12872 0 0 1 -9
1459 NA12873 0 0 2 -9
1459 NA12874 0 0 1 -9
1459 NA12875 0 0 2 -9
1463 NA12891 0 0 1 -9
1463 NA12892 0 0 2 -9


In [71]:
!tail pcod-data/batch_8_filtered_bowtieACod_filteredMQ10.fam

pop_9 GEO020414_29 0 0 0 -9
pop_9 GEO020414_3 0 0 0 -9
pop_9 GEO020414_4 0 0 0 -9
pop_9 GEO020414_5 0 0 0 -9
pop_9 GEO020414_6 0 0 0 -9
pop_9 GEO020414_7 0 0 0 -9
pop_9 GEO020414_8_300 0 0 0 -9
pop_9 GEO020414_9_300 0 0 0 -9
pop_9 GEO020414_10_2 0 0 0 -9
pop_9 GEO020414_30_2 0 0 0 -9


In [72]:
!tail pcod-data/batch_8_filtered_bowtieACod_filteredMQ10.ld

     LG14       25158945    13038      LG14       25213698   4621     0.279681 
     LG12       25184524    20174      LG12       25185115   15320     0.204061 
     LG01       26278787    2605      LG01       26278925   9711     0.23956 
     LG14       27100815    8639      LG14       27100953   19422     0.833626 
     LG16       27127755    19437      LG16       27148342   5025     0.380364 
     LG11       27820982    7051      LG11       27865746   17576     0.587917 
     LG16       30500179    18479      LG16       30608740   17384     0.562002 
     LG16       30901231    13014      LG16       30927774   18833     0.302697 
     LG04       33598813    4090      LG04       33904405   24018     0.357029 
     LG04       33904543    8889      LG04       33941419   23250     0.534267 


In [73]:
!tail data/CEU_gen.ld

    22     49436079       75.1956     22     49450558       75.2077     0.164617 
    22     49436079       75.1956     22     49487182       75.2272   0.00178042 
    22     49436079       75.1956     22     49503799       75.2924    0.0139598 
    22     49436079       75.1956     22     49522492        75.338     0.016777 
    22     49450558       75.2077     22     49487182       75.2272     0.452347 
    22     49450558       75.2077     22     49503799       75.2924    0.0114409 
    22     49450558       75.2077     22     49522492        75.338   0.00543686 
    22     49487182       75.2272     22     49503799       75.2924     0.132833 
    22     49487182       75.2272     22     49522492        75.338    0.0307692 
    22     49503799       75.2924     22     49522492        75.338    0.0294198 


<br>
<br>
### Replacing linkage groups with chromosomes

It's possible that `Nestimate` won't work with the linkage groups in PCod, and needs a human chromosome number to run. 

I'm going to create a new `ld` file that renames the PCod linkage groups, along with a `.txt` file that correlates the two. 


In [74]:
# get a list of all scaffold / lg names
infile = open("pcod-data/batch_8_filtered_bowtieACod_filteredMQ10_ne_input_AT.map", "r")
lg_scaffolds = []
for line in infile:
    lg = line.strip().split()[0]
    if lg not in lg_scaffolds:
        lg_scaffolds.append(lg)
infile.close()

In [75]:
print len(lg_scaffolds)

51


In [76]:
# create dictionary and file that lists lg/ scaffold and corresponding chromosome #
outfile = open("PCod_lg_scaffold_dictionary.txt", "w")
outfile.write("PCod_lg_scaffold\tChr_assigned\n")
lg_scaffolds_dict = {}
for i in range(0, len(lg_scaffolds)):
    outfile.write(lg_scaffolds[i] + "\t" + str(i + 1) + "\n")
    lg_scaffolds_dict[lg_scaffolds[i]] = i + 1
outfile.close()
    

In [77]:
# replace lg / scaffold with corresponding chromosome number in `.ld` file
infile = open("pcod-data/batch_8_filtered_bowtieACod_filteredMQ10.ld", "r")
outfile = open("pcod-data/batch_8_filtered_bowtieACod_filteredMQ10_chr.ld", "w")

header = infile.readline()
outfile.write(header)
for line in infile:
    linelist = line.strip().split()
    chromA = lg_scaffolds_dict[linelist[0]]
    chromB = lg_scaffolds_dict[linelist[3]]
    linelist[0] = str(chromA)
    linelist[3] = str(chromB)
    outfile.write("\t".join(linelist) + "\n")
infile.close()
outfile.close()

In [78]:
!head pcod-data/batch_8_filtered_bowtieACod_filteredMQ10_chr.ld

 CHR_A         BP_A   SNP_A  CHR_B         BP_B   SNP_B           R2 
28	129609	4135	28	142208	21046	0.219354
12	996100	6562	12	996231	17935	0.850749
20	1317465	6903	20	1317603	10478	0.249635
25	1323218	12353	25	1333072	6576	0.824104
7	1535383	3518	7	1535521	3602	0.655982
7	1535383	3518	7	1540993	23912	0.529246
7	1535521	3602	7	1540993	23912	0.467432
18	1610166	16064	18	1610306	9250	0.524646
9	1866533	16398	9	1896621	15111	0.557202


In [79]:
!head data/CEU_gen.ld

 CHR_A         BP_A         SNP_A  CHR_B         BP_B         SNP_B           R2 
     1       993492      0.494952      1      1087198      0.640116    0.0103293 
     1       993492      0.494952      1      1120590      0.811253  0.000470105 
     1       993492      0.494952      1      1145994      0.831152    0.0414282 
     1       993492      0.494952      1      1148494      0.833377   0.00544144 
     1       993492      0.494952      1      1201155      0.875979     0.055172 
     1       993492      0.494952      1      1468016        1.0544  0.000109858 
     1       993492      0.494952      1      1490804       1.05928   0.00202653 
     1      1087198      0.640116      1      1120590      0.811253    0.0249491 
     1      1087198      0.640116      1      1145994      0.831152    0.0577768 


In [80]:
infile = open("data/CEU_gen.ld", "r")
outfile = open("data/CEU_gen_subset.ld", "w")
i = 1000
while i < 1000:
    outfile.write(infile.readline())
    i += 1
infile.close()
outfile.close()

In [81]:
!head -n 1000 data/CEU_gen.ld >> data/CEU_gen_subset.ld

#### add only one space between entries on line

In [82]:
# use the dictionary to add in appropriate contig names for new `.ld` file
infile = open("plink.ld", "r")
outfile = open("pcod-data/batch_8_filtered_bowtieACod_filteredMQ10_spaces.ld", "w")

header = infile.readline()
outfile.write(header)

for line in infile:
    linelist = line.strip().split()
    locusA = linelist[2]
    locusB = linelist[5]
    chromA = locus_contig_dict[locusA]
    chromB = locus_contig_dict[locusB]
    linelist[0] = chromA
    linelist[3] = chromB
    outfile.write(" ".join(linelist) + " \n")
infile.close()
outfile.close()

#### change tutorial file to tabs and rerun tutorial

In [100]:
infile = open("data/CEU_gen.ld", "r")
outfile = open("data/CEU_gen_tabs.ld", "w")

for line in infile:
    linelist = line.strip().split()
    outfile.write("\t".join(linelist))
infile.close()
outfile.close()