##### Exam 2 Answers by Javier López Rodríguez  (javier.lopez.rodriguez@alumnos.upm.es)

### Problem 1:  Controls

Write a Python script that proves that the lines of data in Germplasm.tsv, and LocusGene are in the same sequence, based on the AGI Locus Code (ATxGxxxxxx).  (hint: This will help you decide how to load the data into the database)

---------

We are dealing with .tsv files with headers. In order to read them, I'll use csv.DictReader

# Explanation! + complete the comments

I iterate through each file one additional time at the start because I haven't found a method of csv.DictReader that returns its length. The object csv.DictReader is a generator, so it doesn't load every line to memory at the same time, unlike a list. Therefore, converting it into a list in order to use len() is a memory dangerous process, so I avoided doing that.

In [16]:
import csv

filename1 = "LocusGene.tsv"
filename2 = "Germplasm.tsv"

with open(filename1) as file1:
    with open(filename2) as file2:
        
        # getting the length of each file iterating through each line and adding 1 for each line
        # substracting 1 so that we don't count the header
        length1 = sum(1 for _ in file1) - 1 
        length2 = sum(1 for _ in file2) - 1 
        
        # reset the pointer to the start of each file
        file1.seek(0) 
        file2.seek(0)

        # opening each file with csv.DictReader
        locusgene = csv.DictReader(file1, delimiter="\t", quotechar='"') # default fieldnames because of the header
        germplasm = csv.DictReader(file2, delimiter="\t", quotechar='"') # default fieldnames because of the header
        
        if length1 != length2: # different number of lines
            print("Warning: There are not the same number of lines in both files.")
            print("Only the first {} lines of each file will be compared.".format(min(length1, length2)))
        else:
            print("Both files have the same number of lines: {} (without header).".format(length1))
        
        mismatched_lines = [] # will store the indexes of the mismatched lines, if any
        correct_lines = [] # will store the indexes of the correct lines
        
        for index in range(min(length1, length2)):  # the minimum so that, if different, we don't get an IndexError
            # getting the next item from the DictReaders, and accessing the value of the key "Locus"
            locus1 = next(locusgene)["Locus"] 
            locus2 = next(germplasm)["Locus"]
            #print(locus1 + " " + locus2) # checking that locus1 and locus2 contain the expected strings
            # checking if they match or mismatch
            if locus1 == locus2: # match
                correct_lines.append(index)
            else: # mismatch
                mismatched_lines.append(index)
        
        if len(mismatched_lines) > 0: # there are mismatches, output them
            print("Warning: There are some mismatches.")
            print("Mismatched lines: " + " ".join(mismatched_lines))
            print("There were {} lines with matching Locus code.".format(len(correct_lines)))
        elif len(correct_lines) == min(length1, length2): # there are no mismatches and every line checked was a match
            print("No mismatches found. There were {} lines with matching Locus code.".format(len(correct_lines)))
        else: # there are no mismatches but not every line checked was a match -> this should not happen
            print("Error: there were less matches than expected. Something went wrong.")


Both files have the same number of lines: 32 (without header).
No mismatches found. There were 32 lines with matching Locus code.


### Problem 2:  Design and create the database
* It should have two tables - one for each of the two data files.
* The two tables should be linked in a 1:1 relationship
* you may use either sqlMagic or pymysql to build the database

---------

# Explanation

We know that both files contain the same AGI Locus codes in the same positions, and both tables are going to have that field. 

Because the relationship between the two tables is 1:1 and the AGI Locus code in this case is a unique identifier of each entry in both tables, I am going to use it as the primary key of both tables. Therefore, the tables won't include additional numeric ids. Linking one table with the other in queries that involve both is going to happen via the AGI Locus codes.




In [None]:
#Connecting to sqlMagic
%load_ext sql
#%config SqlMagic.autocommit=False
%sql mysql+pymysql://root:root@127.0.0.1:3306/mysql

In [2]:
%sql create database examweek2;

 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
1 rows affected.


[]

In [3]:
%sql use examweek2;

 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
0 rows affected.


[]

In [4]:
%sql CREATE TABLE locusgene (locus VARCHAR(10) NOT NULL PRIMARY KEY, \
                             gene VARCHAR(10) NOT NULL, \
                             protein_length INTEGER NOT NULL);
%sql DESCRIBE locusgene;

 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
0 rows affected.
 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
3 rows affected.
 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
0 rows affected.
 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
4 rows affected.


Field,Type,Null,Key,Default,Extra
locus,varchar(10),NO,PRI,,
germplasm,varchar(20),NO,,,
phenotype,varchar(500),NO,,,
pubmed,int(11),NO,,,


In [None]:
%sql CREATE TABLE germplasm (locus VARCHAR(10) NOT NULL PRIMARY KEY, \
                             germplasm VARCHAR(20) NOT NULL, \
                             phenotype VARCHAR(500) NOT NULL, \
                             pubmed INTEGER NOT NULL);
%sql DESCRIBE germplasm;

### Problem 3: Fill the database
Using pymysql, create a Python script that reads the data from these files, and fills the database.  There are a variety of strategies to accomplish this.  I will give all strategies equal credit - do whichever one you are most confident with.

------

# Explain