# Data Inspection 

Before we begin working with the two files, let's get a handle on how they're structured.

## `snp_position.txt`

`snp_position.txt` has 984 lines:

In [16]:
wc -l snp_position.txt

     984 snp_position.txt


There are 15 columns in this file:

In [17]:
head -1 snp_position.txt | awk '{print NF}'

15


This file is 81k big:

In [18]:
ls -lh snp_position.txt | awk '{ print $5 }'

81K


## `fang_et_al_genotypes.txt`

`fang_et_al_genotypes.txt` has 2783 lines:

In [19]:
wc -l fang_et_al_genotypes.txt

    2783 fang_et_al_genotypes.txt


There are 986 columns in this file:

In [20]:
head -1 fang_et_al_genotypes.txt | awk '{print NF}'

986


This file is 11M big:

In [21]:
ls -lh fang_et_al_genotypes.txt | awk '{ print $5 }'

11M


Here is a count of entries per genotype:

In [53]:
cat fang_et_al_genotypes.txt | cut -f 3 | sort | uniq -c

   1 Group
  22 TRIPS
  15 ZDIPL
  17 ZLUXR
  10 ZMHUE
 290 ZMMIL
1256 ZMMLR
  27 ZMMMR
 900 ZMPBA
  41 ZMPIL
  34 ZMPJA
  75 ZMXCH
  69 ZMXCP
   6 ZMXIL
   7 ZMXNO
   4 ZMXNT
   9 ZPERR


# Data Processing



We first begin by making separate directories for maize and teosinte data.  We'll also make an additional directory to keep our raw data in and move the `snp_position.txt` and `fang_et_al_genotypes.txt` into it.

In [26]:
mkdir maize teosinte raw
mv snp_position.txt fang_et_al_genotypes.txt raw

We'll start working with the maize data first.  We begin by extracting all the maize genotype information and saving it to `maize_genotypes.txt`:

In [64]:
cd maize
awk '$3~/ZMMIL|ZMMLR|ZMMMR/' ../raw/fang_et_al_genotypes.txt > maize_genotypes.txt

bash: cd: maize: No such file or directory


As a sanity check, we'll see how many genotypes are in `maize_genotypes.txt`:

In [65]:
wc -l maize_genotypes.txt

    1573 maize_genotypes.txt


There are 1573 genotypes, and this agrees with the sum of the number of `ZMMIL`, `ZMMLR`, and `ZMMMR` genotypes in `fang_et_al_genotypes.txt`.  

Before we transpose the `maize_genotypes.txt` in preparation for `join`ing it with `snp_position.txt`, we will prepend the header from `fang_et_al_genotypes.txt` to it.  Unfortunately, this requires the creation of an intermediate file, but `sed` wasn't being a pain in the ass.

In [66]:
head -1 ../raw/fang_et_al_genotypes.txt | cat - maize_genotypes.txt > temp
mv temp maize_genotypes.txt
wc -l maize_genotypes.txt

    1574 maize_genotypes.txt


We then use the provided `transpose.awk` script to transpose the genotype data:

In [61]:
awk -f ../transpose.awk maize_genotypes.txt > transposed_maize_genotypes.txt

We will now merge `transposed_maize_genotypes.txt` and `SNP_ID`.  The merge will be performed on the `Sample_ID` field in `transposed_maize_genotypes.txt` and the `SNP_ID` field in `snp_position.txt`.  As directed in the instructions, bthe merged files first three columns will be `SNP_ID`, `Chromosome`, and `Position`, then all of the remaining columns from `feng_et_al_genotypes.txt`.  

In [None]:
join -o 1.1,1.3,1.4 -o "$(echo 2.{2..1574})" -t $'\t' \
    <(sort -k1 ../raw/snp_position.txt) \
    <(sort -k1 transposed_maize_genotypes.txt) \
    > merged_maize_genotypes.txt

Some explanation:

   * The `-o` argument specifies which columns will be kept in the final merged table.  `1.1` refers to the first column from file 1, `1.2` the second column from file 2, etc.   The second `-o` argument specifies that we want all the remaining columns from file 2, save for the first column (since we already specfied it with `1.1`.
   * The default output delimiter is a space; we switch the delimiting character to tab via the `-t $'\t'` argument so that the data plays nicely later.
   * `join` requires that the two files to be joined be sorted.  We sort the files in place by using the `<(sort ...)` syntax.  
    
We'll now add the header back into the file:

In [92]:
cat <(head -1 ../raw/snp_position.txt | cut -f 1,3,4 | tr '\n' '\t') \
    <(head -1 transposed_maize_genotypes.txt | cut -f 2-1574) | cat - merged_maize_genotypes.txt > temp 
mv temp merged_maize_genotypes.txt

We'll make two separate directories for the chromosome sorted by increasing and decreasing position:

We'll first take a look at `merged_maize_genotypes.txt` to see how many entries there are for each chromosome:

In [94]:
cat merged_maize_genotypes.txt | cut -f 2 | sort | uniq -c

 155 1
  53 10
 127 2
 107 3
  91 4
 122 5
  76 6
  97 7
  62 8
  60 9
   1 Chromosome
   6 multiple
  27 unknown


After we separate each of the entries into separate files by chromosome, we expect there to be 950 total entries for chromosomes 1 through 10, 6 entries with multiple positions, and 27 for unknown positions.  

Before we separate the entries by chromosome, let's make a directory to hold all of the files that will be sorted in increasing order:

In [97]:
mkdir sorted_inc_pos

We can grab all the entries for chromosome 1 and store them in `chromosome_1_sort_inc.txt` by using `awk`:

In [112]:
awk -F '\t' '$2==1' merged_maize_genotypes.txt > ./sorted_inc_pos/chromosome_1_sort_inc.txt
wc -l ./sorted_inc_pos/chromosome_1_sort_inc.txt

     155 ./sorted_inc_pos/chromosome_1_sort_inc.txt


We see that there is 155 entries, matching the the number we found above when `grep`-ing `merged_maize_genotypes.txt`.  

We'll now sort `chromosome_1_sort_inc.txt` on `Position`:

In [113]:
sort -o ./sorted_inc_pos/chromosome_1_sort_inc.txt \
    -nk3 ./sorted_inc_pos/chromosome_1_sort_inc.txt

In [106]:
for i in {1..10}
do
    awk -F '\t' -v chrom=$i '$2==$chrom' merged_maize_genotypes.txt \
    > ./sorted_inc_pos/chrom_${i}_inc_pos.txt
done