### Plotting DAPCs with different size subsets of my genepop file


In the last couple days working on my project, I hit a road block. When I try to visualize my data with a DAPC, the dudi.pca function in R keeps hitting an error about na's:


``Error in dudi.pca(x, center = center, scale = scale, scannf = FALSE, nf = maxRank) : 
  na entries in table``

I haven't been able to fix it, and none of my lab mates have that problem, and I've run it on their computers, and I've searched my genepop file for any abnormalities and found nothing.

Just to get a ballpark of what my data looks like, I subsetted the first 50 loci of the genepop file and ran a DAPC, which worked fine. It looked like there was no real differentiation between my cohorts. Then, I subsetted to about 200 loci, and did see some differentiation. So it occurred to me that it might be useful to keep subsetting and plot a bunch of things, just to get a ball park because I likely won't be able to solve my bigger problems this week (when our project is due!! today!!).

So this morning I wrote a script that subsets genepop files for a given number of loci. The command line arguments are [i] the genepop file you are subsetting and [ii] the number of loci to subset.

Here is the code for the python script:
```
# Natalie Lowell 20161214
# Purpose of script: subset a genepop file to a certain number of loci
# Command line arguments: [1] genepop file and [2] number of loci to subset

import sys

n = int(sys.argv[2]) + 1 # add 1 because the first column is just the same of the 

# open your genepop file and read lines into a list of lines
gpfile = open(sys.argv[1], "r") 
lines = gpfile.readlines()
gpfile.close()

# open new file for your output file, the truncated genepop file
newfilename = "genepop_"+str(sys.argv[2])+"_loci.gen"
newfile = open(newfilename, "w")

# write the header line with stacks version and date to the new file
newfile.write(lines[0])

# get header and split into list on commas, for a list of all the loci
header_list = lines[1].strip().split(",")

# grab only the first n, as designated at the command line
retrieve_header = header_list[0:int(sys.argv[2])]


# initiate string for header line w loci
headerstring = "" 

# make a loop to stick the commas back in
for locus in retrieve_header:
	headerstring += locus + ","
	
	
# remove the last comma
headerstring = headerstring[:-1]
headerstring = headerstring + "\n"

# print headerstring # CHECK
	
# write it to the file
newfile.write(headerstring)

# remaining lines = after header w loci
remlines = lines[2:]

# loop: if pop, write pop. if not pop, truncate line to n and add to file
for line in remlines:
	if "pop" in line:
		newfile.write(line)
	else:
		linelist = line.strip().split()
		keep = linelist[0:n] # TESTING THIS LINE
		newline = ""
		print keep
		for item in keep:
			newline += item + "\t"
		newline = newline[:-1] # remove final tab
		newline = newline + "\n" # add new line
		newfile.write(newline)

newfile.close()
```

Here is the code for the R script that I use to make a DAPC, adapted from Charlie's code (my lab mate):

```
# Let's first run a DAPC with all individuals and all loci
data_all_loci <-read.genepop("bigsubset.gen")
names(data_all_loci)
data_all_loci$pop

pop_2005 <- rep("Y2005",20)
pop_2009 <- rep("Y2009",2)
pop_2010 <- rep("Y2010", 38)
pop_2014 <- rep("Y2014",11)
pop_2015 <- rep("Y2014",10)

pop_groups <- as.factor(c(rep("Y2005",20),rep("Y2009",2),rep("Y2010", 38),rep("Y2014",11),rep("Y2015",10)))
pop_labels <- c(pop_2005,pop_2009,pop_2010,pop_2014,pop_2015)
pop_cols <- c("black","dodgerblue","tomato","deepskyblue","red")

dapc_all <- dapc(data_all_loci,data_all_loci$pop,n.pca=465,n.da=5) ##Retain all, then identify optimal number by optim.a.score
test_a_score <- optim.a.score(dapc_all)
dapc_all <- dapc(data_all_loci,data_all_loci$pop,n.pca=40,n.da=5) ##63 PC's is the optimal number

#2D plot
scatter(dapc_all,scree.da=FALSE,cellipse=0,leg=FALSE,label=c("2005","2009","2010","2014","2015"),
        posi.da="right",csub=2,col=pop_cols,cex=1.5,clabel=1,pch=c(12,14,16,18,20),solid=1)
legend(x = -4.5, y = 3,bty='n',legend=c("2005","2009","2010","2014","2015"),pch=c(12,14,16,18,20),col=pop_cols,cex=1.3)
```

In [4]:
cd Desktop

/Users/natalielowell/Desktop


For 200 loci:

In [6]:
!python subset_genepop_nloci.py batch_3_20k.gen 200

For 500 loci:

In [7]:
!python subset_genepop_nloci.py batch_3_20k.gen 500

^ after running my genepop file with 500 loci, I got the same stinking error!

``Error in dudi.pca(x, center = center, scale = scale, scannf = FALSE, nf = maxRank) : 
  na entries in table``
  
  But strangely, I didn't get the error with 1000 or 2000 loci.

For 1000 loci:

In [8]:
!python subset_genepop_nloci.py batch_3_20k.gen 1000

For 2000 loci:

In [9]:
!python subset_genepop_nloci.py batch_3_20k.gen 2000

For 5000 loci:

In [10]:
!python subset_genepop_nloci.py batch_3_20k.gen 5000

But! I got it again for 5000 loci!

``Error in dudi.pca(x, center = center, scale = scale, scannf = FALSE, nf = maxRank) : 
  na entries in table``
  
This makes me think that's not something inherently wrong with a particular value in my genepop, otherwise it would affect all of them past the first one. So, I'm going to keep subsetting larger numbers of loci so I can hopefully find one that doesn't trigger the error and have something somewhat representative of my data.

For 10000 loci:

In [13]:
!python subset_genepop_nloci.py batch_3_20k.gen 9000

And now the DAPC plots for each of these using the R code above that didn't trigger the warning:


<br>
#### 200 Loci
![image](https://github.com/nclowell/FISH546/blob/master/Cod-Time-Series-Project/Notebooks/images_for_notebooks/200loci.jpeg?raw=true)

<br>
<br>
#### 1000 loci
![image](https://github.com/nclowell/FISH546/blob/master/Cod-Time-Series-Project/Notebooks/images_for_notebooks/1000loci.jpeg?raw=true)

<br>
<br>
#### 2000 loci
![image](https://github.com/nclowell/FISH546/blob/master/Cod-Time-Series-Project/Notebooks/images_for_notebooks/2000loci.jpeg?raw=true)

