##### Exam 2 Answers by Javier López Rodríguez  (javier.lopez.rodriguez@alumnos.upm.es)

### Problem 1:  Controls

Write a Python script that proves that the lines of data in Germplasm.tsv, and LocusGene are in the same sequence, based on the AGI Locus Code (ATxGxxxxxx).  (hint: This will help you decide how to load the data into the database)

---------

#### Explanation:

We are dealing with .tsv files with headers. In order to read them, I'll use csv.DictReader because it is more efficient and easier (and I haven't used it before this course, unlike the normal python input/output, so it helps me practice new things).

After opening both of the files, I first iterate through each file one time because I haven't found a method of csv.DictReader that returns its length. The object csv.DictReader is an iterator, so it doesn't load every line to memory at the same time, unlike a list. Therefore, converting it into a list in order to use len() is a memory dangerous process, so I avoided doing that.

Then, I reset the pointer to the start and create the csv.DictReader objects for both files.

Because we want to prove that both files have the same locus code in the same line, the number of lines should be the same in both cases. I check this (that's why I need to iterate the first time for the lengths). If the lengths aren't the same, it will only compare until the smallest file ends (minimum of length1 and length2).

I create two lists for storing the line number of matches and mismatches (to check at the end if there is any mismatch, how many there are, and if the number of matches and mismatches is what we expected).

Then, in a for loop, I iterate through both DictReaders at the same time, using the zip function. In case of different number of entries, zip stops when the smallest one ends (this is equivalent to iterating until the minimum of length1 and length2).

I go through every pair of entries, store the Locus code (which DictReader makes very easy because each row is a dictionary and there's no need to split/do anything else), and compare both codes. I append the line number to either list, depending on it being a match or a mismatch.

 * Note that zip creates an iterator, not a list. Therefore, a zip of two DictReaders still does not load everything into memory at the same time, and the advantage of using DictReaders instead of turning everything into a list is maintained.

In the end, I check if there are any mismatches (and print them), print the number of matches, and have a third condition just in case the number of matches + mismatches is less than the number of compared lines. The latter is just a precaution in case the code is incorrect, because (I think) it should never happen if the code is correct.


In [21]:
import csv

filename1 = "LocusGene.tsv"
filename2 = "Germplasm.tsv"

with open(filename1) as file1, open(filename2) as file2:
        
    # getting the length of each file iterating through each line and adding 1 for each line
    # substracting 1 so that we don't count the header
    length1 = sum(1 for _ in file1) - 1 
    length2 = sum(1 for _ in file2) - 1 

    # reset the pointer to the start of each file
    file1.seek(0) 
    file2.seek(0)

    # opening each file with csv.DictReader
    locusgene = csv.DictReader(file1, delimiter="\t", quotechar='"') # default fieldnames because of the header
    germplasm = csv.DictReader(file2, delimiter="\t", quotechar='"') # default fieldnames because of the header

    if length1 != length2: # different number of lines
        print("Warning: There are not the same number of lines in both files ({} and {})".format(length1, length2))
        print("Only the first {} lines of each file will be compared.".format(min(length1, length2)))
    else: # equal number of lines
        print("Both files have the same number of lines: {} (without header).".format(length1))

    mismatched_lines = [] # will store the indexes of the mismatched lines, if any
    correct_lines = [] # will store the indexes of the correct lines

    # iterating through every pair of elements
    linenumber = 1 # keeps track of the line number we're in, starts in 1 (because we skip the header using DictReader)
    for entry1, entry2 in zip(locusgene, germplasm): # iterating through both DictReaders at the same time
        locus1, locus2 = entry1["Locus"], entry2["Locus"]
        #print(locus1 + " " + locus2) # checking that locus1 and locus2 contain the expected strings
        # checking if they match or mismatch
        if locus1 == locus2: # match
            correct_lines.append(linenumber)
        else: # mismatch
            mismatched_lines.append(linenumber)
        linenumber += 1 # increment linenumber
        
    if len(mismatched_lines) > 0: # there are mismatches, output them
        print("Warning: There are some mismatches.")
        print("Mismatched lines: " + " ".join(mismatched_lines))
        print("There were {} lines with matching Locus code.".format(len(correct_lines)))
    elif len(correct_lines) == min(length1, length2): # there are no mismatches and every line checked was a match
        print("No mismatches found. There were {} lines with matching Locus code.".format(len(correct_lines)))
    else: # there are no mismatches but not every line checked was a match -> this should never happen
        print("Error: there were less matches than expected. Something went wrong.")

# using "with open(...) as ...", we don't need to close the files afterwards, it is done automatically.

Both files have the same number of lines: 32 (without header).
No mismatches found. There were 32 lines with matching Locus code.


This problem can be solved in a simpler way using the pandas library:

In [22]:
import pandas as pd

filename1 = "LocusGene.tsv"
filename2 = "Germplasm.tsv"

# reading the .tsv into pandas dataframes
df1 = pd.read_csv(filename1, sep = "\t")
df2 = pd.read_csv(filename2, sep = "\t")

# renaming the columns so that, when concatenating the columns, they are named differently
df1 = df1.rename(columns = {"Locus": "Locus1"})
df2 = df2.rename(columns = {"Locus": "Locus2"})

# printing number of items (size) of each column
print("Number of items in " + filename1 + " is {}".format(df1["Locus1"].size))
print("Number of items in " + filename2 + " is {}".format(df2["Locus2"].size))

# concatenating both columns so that the following comparison can be made
# if the number of elements is different, pd.concat adds NaN to the missing elements of the smallest column
# so that both columns have the same length
dfconcat = pd.concat([df1["Locus1"], df2["Locus2"]], axis = 1)

# comparing the contents of both columns
# the comparison gives a boolean array, the sum of that array is the number of True elements (number of matches)
# doing this without concatenating both columns first gives an error if the number of elements is different,
# that is why we need to concatenate both columns first into the same data frame
print("Number of matching Locus codes: " + str(sum(dfconcat["Locus1"] == dfconcat["Locus2"])))

Number of items in LocusGene.tsv is 32
Number of items in Germplasm.tsv is 32
Number of matching Locus codes: 32


### Problem 2:  Design and create the database
* It should have two tables - one for each of the two data files.
* The two tables should be linked in a 1:1 relationship
* you may use either sqlMagic or pymysql to build the database

---------

# Explanation

I'm using sqlMagic because I find it easier for creating databases and tables.

We know that both files contain the same AGI Locus codes in the same positions, and both tables are going to have that field. 

Because the relationship between the two tables is 1:1 and the AGI Locus code in this case is a unique identifier of each entry in both tables, I am going to use it as the primary key of both tables. Therefore, the tables won't include additional numeric ids. Linking one table with the other in queries that involve both is going to happen via the AGI Locus codes.




In [None]:
#Connecting to sqlMagic
%load_ext sql
#%config SqlMagic.autocommit=False
%sql mysql+pymysql://root:root@127.0.0.1:3306/mysql

In [2]:
%sql create database examweek2;

 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
1 rows affected.


[]

In [3]:
%sql use examweek2;

 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
0 rows affected.


[]

In [4]:
%sql CREATE TABLE locusgene (locus VARCHAR(10) NOT NULL PRIMARY KEY, \
                             gene VARCHAR(10) NOT NULL, \
                             protein_length INTEGER NOT NULL);
%sql DESCRIBE locusgene;

 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
0 rows affected.
 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
3 rows affected.
 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
0 rows affected.
 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
4 rows affected.


Field,Type,Null,Key,Default,Extra
locus,varchar(10),NO,PRI,,
germplasm,varchar(20),NO,,,
phenotype,varchar(500),NO,,,
pubmed,int(11),NO,,,


In [None]:
%sql CREATE TABLE germplasm (locus VARCHAR(10) NOT NULL PRIMARY KEY, \
                             germplasm VARCHAR(20) NOT NULL, \
                             phenotype VARCHAR(500) NOT NULL, \
                             pubmed INTEGER NOT NULL);
%sql DESCRIBE germplasm;

### Problem 3: Fill the database
Using pymysql, create a Python script that reads the data from these files, and fills the database.  There are a variety of strategies to accomplish this.  I will give all strategies equal credit - do whichever one you are most confident with.

------

# Explain

With the design I've chosen for the database, because the relationship is 1:1 and both of them have the same primary key and no additional id, it doesn't matter which table we fill out first.

In [None]:
# Importing pymysql.cursors and connecting to the database
import pymysql.cursors

# Connecting to the database examweek2
connection = pymysql.connect(host='localhost',
                             user='root',
                             password='root',
                             db='examweek2', # database name
                             charset='utf8mb4',  
                             cursorclass=pymysql.cursors.DictCursor,
                             autocommit = True) # I'm setting autocommit to True

In [None]:
# Requires to be connected to the database examweek2

filename1 = "LocusGene.tsv"
filename2 = "Germplasm.tsv"

with open(filename1) as file1, open(filename2) as file2:
    # opening each file with csv.DictReader
    locusgene = csv.DictReader(file1, delimiter="\t", quotechar='"') # default fieldnames because of the header
    germplasm = csv.DictReader(file2, delimiter="\t", quotechar='"') # default fieldnames because of the header
    
    # inserting locusgene entries into the database:
    for row in locusgene:
        try:
            with connection.cursor() as cursor:
            sql = """INSERT INTO locusgene (locus, gene, protein_length) 
                     VALUES ('""" + row["Locus"] + """', '""" + row["Gene"] + """', 
                     """ + row["ProteinLength"] + """ )"""
            cursor.execute(sql)
            # We don't need to store ids because there isn't any auto incremented id
        except:
            print("There was an error.")
            
    # inserting germplasm entries into the database:
    for row in germplasm:
        try:
            with connection.cursor() as cursor:
            sql = """INSERT INTO germplasm (locus, germplasm, phenotype, pubmed) 
                     VALUES ('""" + row["Locus"] + """', '""" + row["germplasm"] + """', 
                     '""" + row["phenotype"] + """', """ + row["pubmed"] + """ )""" 
            cursor.execute(sql)
        except:
            print("There was an error.")


### Problem 4: Create reports, written to a file

1. Create a report that shows the full, joined, content of the two database tables (including a header line)

2. Create a joined report that only includes the Genes SKOR and MAA3

3. Create a report that counts the number of entries for each Chromosome (AT1Gxxxxxx to AT5Gxxxxxxx)

4. Create a report that shows the average protein length for the genes on each Chromosome (AT1Gxxxxxx to AT5Gxxxxxxx)

When creating reports 2 and 3, remember the "Don't Repeat Yourself" rule! 

All reports should be written to **the same file**.  You may name the file anything you wish.

---------

# Explanation

In [None]:
# Requires to be connected to the database examweek2 (should already be connected from problem 3)
#connection = pymysql.connect(host='localhost',
#                             user='root',
#                             password='root',
#                             db='examweek2', # database name
#                             charset='utf8mb4',  
#                             cursorclass=pymysql.cursors.DictCursor,
#                             autocommit = True) # I'm setting autocommit to True



### Report 1
try:
    with connection.cursor() as cursor:
        # Performs a full outer join to read everything from both tables. 
        # In this case, any join would be equivalent because every locus exists in both tables. 
        sql = """SELECT * FROM locusgene FULL OUTER JOIN germplasm 
                 ON germplasm.locus = locusgene.locus;"""
        cursor.execute(sql)
        results = cursor.fetchall()
        for result in results:
            print(result)
            print()
finally:
    print("")

### Report 2
try:
    with connection.cursor() as cursor:
        # Performs a full outer join to read everything from both tables. 
        # In this case, any join would be equivalent because every locus exists in both tables. 
        sql = """SELECT * FROM locusgene FULL OUTER JOIN germplasm 
                 ON germplasm.locus = locusgene.locus 
                 WHERE locusgene.gene = 'SKOR' OR locusgene.gene = 'MAA3' ;"""
        cursor.execute(sql)
        results = cursor.fetchall()
        for result in results:
            print(result)
            print()
finally:
    print("")

### Report 3
## Creating a function:
def count_regex_in_field(regex, fieldname, tablename):
    """
    Counts the number of entries of a table that match a regular expression in one of its fields.
    Requires an open pymysql connection to the database.
    
    Parameters:
    regex: regular expression (in sql format)
    fieldname: the name of the field (column) of the table
    tablename: the name of the table
    
    Returns: the number of entries, or None if there was an error.
    """
    
    try:
        with connection.cursor() as cursor:
            sql = """SELECT COUNT(*) AS 'Number of matches' FROM """ + tablename + """ 
                     WHERE """ + fieldname + """ REGEXP '""" + regex + """'; """
            cursor.execute(sql)
            results = cursor.fetchall()
            count = results[0]["Number of matches"]
    except:
        print("There was an error.")
        count = None
    return count

## Generating the report:
# creates a list of the corresponding regex for the chromosomes 1 to 5 (0 to 4, +1)
chromosome_regexs = ["AT" + str(num + 1) + "G[0-9]{5}" for num in range(5)]

num_of_entries = {} # chromosome number : number of entries
for chr_regex in chromosome_regexs:
    count = count_regex_in_field(regex = chr_regex, fieldname = "locus", tablename = "locusgene")
    num_of_entries[chr_regex[2]] = count # index 2 of the regex is the chromosome number

### Report 4
## Creating a function:
def mean_of_field_where_regex_in_field(mean_fieldname, regex, regex_fieldname, tablename):
    """
    Given a table, calculates the mean of the elements of a field 
    of every entry in which another field matches a regular expression.
    Requires an open pymysql connection to the database.
    
    Parameters:
    mean_fieldname: the name of the field where the mean is going to be computed
    regex: the regular expression to match on field regex_fieldname
    regex_fieldname: the name of the field where the regular expression is going to be matched
    tablename: the name of the table
    
    Returns: the mean of the elements, or None if there was an error.
    """
    # requires an open connection to the database
    try:
        with connection.cursor() as cursor:
            sql = """SELECT AVG(""" + mean_fieldname + """) AS 'Average' FROM """ + tablename + """ 
                     WHERE """ + fieldname + """ REGEXP '""" + regex + """'; """
            cursor.execute(sql)
            results = cursor.fetchall()
            average = results[0]["Average"]
    except:
        print("There was an error.")
        average = None
    return average

## Generating a report:
# we already have the chromosome_regexs list
#chromosome_regexs = ["AT" + str(num + 1) + "G[0-9]{5}" for num in range(5)]

average_lengths = {} # chromosome number : average protein length
for chr_regex in chromosome_regexs:
    average = mean_of_field_where_regex_in_field(mean_fieldname = "protein_length", regex = chr_regex, 
                                                regex_fieldname = "locus", tablename = "locusgene")
    average_lengths[chr_regex[2]] = average # index 2 of the regex is the chromosome number
    

In [9]:
chromosome_regexs = ["AT" + str(num + 1) + "G[0-9]{5}" for num in range(5)]

In [10]:
print(chromosome_regexs)

['AT1G[0-9]{5}', 'AT2G[0-9]{5}', 'AT3G[0-9]{5}', 'AT4G[0-9]{5}', 'AT5G[0-9]{5}']
