Questions might be interesting:
1. How many SNPs and individuals are involved (abbgen1k.csv)?
2. What is fraction of the SNPs (rows) passed the quality filter?
3. How many SNPs are C->T (or G->A)?
4. What are the rates of substitutions by nucleotide type?
5. How many SNPs that are singletons (only one copy of the ALT allele)?
6. How many SNPs for which the ALT allele does NOT have frequency 0 in the samples of Europeans individuals? And how many of African individuals? Can this support out-of-Africa hypothesis?

1. How many SNPs and individuals are involved (abbgen1k.csv)?

In [11]:
x22 <- read.csv("abbgen1k.csv")
dim(x22)

In [8]:
system.time(read.csv("abbgen1k.csv"))
system.time(read.csv("abbgen1k.csv", stringsAsFactor=F))
system.time(read.csv("abbgen1k.csv", stringsAsFactor=F, comment.char=""))

   user  system elapsed 
   0.97    0.01    0.99 

   user  system elapsed 
   0.42    0.03    0.45 

   user  system elapsed 
   0.40    0.03    0.44 

2. What is fraction of the SNPs (rows) passed the quality filter?

In [10]:
cntSNP <- function(x){
    cnt <- 0
    for(i in 1:nrow(x)){
        if(x[i,"FILTER"]=="PASS"){
            cnt <- cnt +1
        }
    }
    cnt
}

cntSNP(x22)

In [14]:
mean(x22$FILTER=="PASS")

3. How many SNPs are C->T (or G->A)?

In [18]:
cntSNP_loop <- function(x, ref, alt){
    cnt <- 0
    for(i in 1:nrow(x)){ #ncol(x)
        if(x[i,"REF"]==ref & x[i, "ALT"]==alt){
           cnt <- cnt + 1 
        }
    }
    cnt
}

cntSNP <- function(x, ref, alt){
    sum(x["REF"]==ref & x["ALT"]==alt) 
}


In [20]:
system.time(cntSNP_loop(x22, "C", "T"))
system.time(cntSNP(x22, "C", "T"))

   user  system elapsed 
   1.07    0.03    1.11 

   user  system elapsed 
      0       0       0 

4. What are the rates of substitutions by nucleotide type?

In [23]:
cntAllSNP <- function(x){
    dna <- c("A","C","G","T")
    
    m <- matrix(0,4,4)
    
    for(i in 1:length(dna)){
        for(j in 1:length(dna)){
            m[i,j] <- cntSNP_loop(x, dna[i], dna[j])
        }
    }
    m
}

cntAllSNP(x22)

0,1,2,3
0,534,2076,577
765,0,684,3288
3576,683,0,733
534,1925,537,0


5. How many SNPs that are singletons (only one copy of the ALT allele)?

In [30]:
isRowSing <- function(row){
    cnt <- sum(row=="1|0") + sum(row=="0|1") + 2*sum(row=="1|1")
    cnt == 1
}

sum(apply(x22, 1, isRowSing))

6. How many SNPs for which the ALT allele does NOT have frequency 0 in the samples of Europeans individuals [10:90]? And how many of African individuals? Can this support out-of-Africa hypothesis [91:171]?

In [27]:
cntRowSing <- function(row){
    cnt <- sum(row=="1|0") + sum(row=="0|1") + 2*sum(row=="1|1")
    cnt
}

getFreq <- function(x, idx){
    rowCnt <- apply(x[idx], 1, cntRowSing)
}



In [28]:
sum(getFreq(x22, 10:90)) # Euro
sum(getFreq(x22, 91:171)) # Afri