# Exercise 3: Introduction to Dataframes

Author: Laura Gutierrez Funderburk

Created on: April 21 2018

Last modified on: April 21 2018
 
### Abstract

In this notebook I will introduce dataframes and run through a few examples. Please note this only an introductory exercise. Participants are welome to learn more about it. 

Here are a few resources:

<a href="https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python" target='_Blank'>DataCamp.com</a>

<a href="https://pandas.pydata.org/pandas-docs/stable/dsintro.html" target="_blank">PyData.org</a>

My goal in this exercise is to provide the workshop participants with a taste of what dataframes can do and to provide an opportunity to explore and learn outside this workshop. 

### About Python dataframes

Dataframes can be called via the Pandas library. Python dataframes are two-dimensional labelled data structures whose columns may or may not contain different data types. Columns and rows are indexed and can be labelled. 

This is like dialling up what we have been learning so far by using comprehension lists and dictionaries, as dataframes can be thought of as dictionary-based <a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html" target="_blank">numpy arrays</a> $^{[1]}$.

$^{[1]}$ A numpy array is an array that can hold different data types: float, int, str, arrays,...

### Warm up Examples

We will start by taking small arrays and defining dataframes. We will use pandas function read_csv to read from out data.csv file and, with the help of a few functions I have defined, I will showcase how powerful they can be in conjunction with dictionaries and comprehension lists when dealing with output that looks nothing like a table. 

In [None]:
# We begin by importing pandas library
import pandas as pd

In [None]:
# Let us take our celestials example along with the individuals we sent out into space
celestials = ['Moon','Sun','Neptune','Mars','Jupiter','Venus']
# Define space_travellers where each entry is a person
space_travellers = ["James","Sonia","Vero","Tom","Lily","Manny"]

In [None]:
# Define dataframes via help of our dictionary notation
space_df = pd.DataFrame({"Celestial Bodies": celestials, "Space Travellers": space_travellers})
print(space_df)

We can also use the pandas library to read csv files and extract information as we need. For instance, by printing the data below we no longer need to refer to the original file!

We can also isolate specific columns within our file. 

In [None]:
data = pd.read_csv("./DATA/data.csv")
print(data)

In [None]:
# Print specific columns. 
print(data.Cluster_A)

### Dataframes and plotting our data

In the analyze.py script I have predefined a number of functions that deal with our output-.files. As the participants can notice, these are messy files, but with the help of comprehension lists, we can massage and transform into arrays we can later manipulate using dataframes. 

Below I will define a few functions and showcase how the use of dataframes made the interpretation of output results much easier. 

In [None]:
# Run analyze.py script
%run -i analyze.py

In [None]:
# Indicate where to find all files of the form output-*.out
output = "./DATA/OUTPUT_FOR_PLOTTING/"
# Use glob library to store file names in array
all_the_files = glob.glob(output + "output-*.out")
print(all_the_files[0:2])

In [None]:
# We define a function that takes as input a file array, as above and outputs a "dirty" dataframe

def data_to_dataframe(file_array):
    
    """This function will turn all data stored in tables for F1,F2 into dataframes"""
    
    # Store indeces for files which are free from error
    file_index = select_index(file_array)
    
    # Turn data in F1 into dataframe, attach column names
    F1_df = pd.DataFrame([extract_family_length_scores(file_array[i],1) for i in file_index],\
                      columns = ["Family", "Number_Seq", "Q_Score", "TC_Score","Cline_Score"])
    
    # Turn data in F2 into dataframe, attach column names
    F2_df = pd.DataFrame([extract_family_length_scores(file_array[i],2) for i in file_index],\
                      columns = ["Family", "Number_Seq", "Q_Score", "TC_Score","Cline_Score"])
    
    # Return both datafrains as a 2-tuple
    return (F1_df,F2_df)

In [None]:
def clean_data_frame(data_frame_pair):
    
    """This function cleans our dataframes"""
    # empty array: save cleaned dataframes
    clean_family_dataframes = []
    
    # Loop through both family dataframes
    for i in range(2):
        # dataframe on variable clean_Fi_df
        clean_Fi_df = data_frame_pair[i]
        # remove '/n' from columns Cline_Score and Family
        clean_Fi_df['Cline_Score'] = clean_Fi_df['Cline_Score'].map(lambda x: x.rstrip('\n'))
        clean_Fi_df['Family'] = clean_Fi_df['Family'].map(lambda x: x.rstrip('\n'))
        # Turn Cline, Q and TC scores into float (originally they are coded as strings)
        clean_Fi_df['Cline_Score'] = clean_Fi_df['Cline_Score'].apply(lambda x:float(x))
        clean_Fi_df['Q_Score'] = clean_Fi_df['Q_Score'].apply(lambda x:float(x))
        clean_Fi_df['TC_Score'] = clean_Fi_df['TC_Score'].apply(lambda x:float(x))
        clean_Fi_df['Number_Seq'] = clean_Fi_df['Number_Seq'].apply(lambda x:int(x))
        # store clean_Fi_df into array
        clean_family_dataframes.append(clean_Fi_df)
        
    # return array with clean versions of F1, F2
    return clean_family_dataframes


In [None]:
# Apply data_to_dataframe on all output files
data_pair = data_to_dataframe(all_the_files)

# Cleaning data files
clean_data_pair = clean_data_frame(data_pair)

In [None]:
# Dataframes for both clusters
F1_Data_Frame = clean_data_pair[0]
F2_Data_Frame = clean_data_pair[1]

In [None]:
print(F1_Data_Frame)

In [None]:
plot_frequency(F2_Data_Frame,"F2")

In [None]:
plot_number_seq_vs_scores(F2_Data_Frame,"F2")

### Your turn

Open the analyze.py script along with the output-.out files and discuss with a peer how the use of comprehension lists and dataframes made plotting results possible. 

What other kinds of information can you extract from the F1_Data_Frame and the F2_Data_Frame?