# Exercise 0:  Python Use Review

Author: Laura Gutierrez Funderburk

Created on: April 18 2018

Last modified on: April 20 2018

### Abstract

This notebook is divided into three main areas:

1) Importing libraries

2) Refresh your memory of for and if statements on Python 3.6

3) Reading files from a given directory and storing the content into a table



### Part I: Importing libraries

This section is straightforward, we will simply use the <span style="color:green">**import**</span> command. 

In [None]:
# We will import the glob package as we want a way to access files within our directory. 
# The package glob does a great job at it. 
import glob

### Part II: For and if statements

Although I assume knowledge of Python for, while and if statements, I will provide a few examples that we will use as we build up our tool kit in the next exercises. 

Suppose we want to create an array and add elements iteratively, then we can do so as follows. 

In [None]:
# Take an existing array
celestials = ['Moon','Sun','Neptune','Mars','Jupiter','Venus']

# Create empty array
planets = []

# Imagine we want to select only those elements in this array that are planets
# By inspection we identify only Moon and Sun do not belong to this category
# We will iterate over each item in the for loop
for item in celestials:
    # If the item is either Moon or Sun, we skip
    if item =="Moon" or item=="Sun":
        continue
    # Otherwise, we will add the item to our empty array
    else:
        planets.append(item)

# Print original array
print("Our celestial array contains " + str(celestials))
print("\n")
# Print array with planets 
print("The planets in our celestial array are " + str(planets))

### Part III: Reading files & storing the content into a table

We will now use a very basic for loop to read and store contents of a file inside a table. 

----> STOP AND THINK: WHY DO WE WANT TO DO THIS?

Opening and reading files is time consuming. If we open and read the file for each entry, the computational time will compound unnecesarily. By storing the content of our file in a table, we ensure to read the file once and from there work with the contents as needed. 

In [None]:
# Access all the files in the directory /DATA
data_directory = "./DATA/"

# Use the glob package to store file names in an array
data_files = glob.glob(data_directory + "*.csv")

# Access data.csv
data = data_files[0]

In [None]:
# Create an empty array where we will store the content of our file. 
all_the_data = []

# We open the file data with the with open() command, and variable data, we use 'r' to specify that we are reading
# the file. 
with open(data,'r') as f:
    # We use a for loop to iteratively append the content of the file into a table
    for line in f:
        all_the_data.append(line)
        
# It is a good practice to close files whenever we are done reading or writing on them!!!
f.close()

Let us print the first 5 elements in our table all_the_data

In [None]:
print(all_the_data[0:5])

We can use the .split() method to disect and work with this data. For instance, we notice that each column is separated by commas ',' and in each column all pieces of information are separated by colons ':'. We can then use the .split() method to create an array for each entry so that we can manipulate the content as we need. 

Let us take the second row and split it into its corresponding columns. 

In [None]:
second_row = all_the_data[1].split(",")
print(second_row)

Say we are interested in extracting the names MZ22523024 and MZ22514750 from each entry in our array second_row. We can apply the split() method once more as follows:

In [None]:
specific_info_one = second_row[0].split(":")
specific_info_two =  second_row[3].split(":")
print(specific_info_one[0], specific_info_two[0])

### Your turn:

#### Exercise 0.1 
Complete the following for loop whose purpose is to apply the split method to get an array whose elements are arrays of size 2, each containing the rows of our data.csv file. 
 
Print the first 5 elements in the array columns_in_data discarding the Cluster_A and Cluster_B columns. 

In [None]:
# Exercise
# We use the len() function to get length of array all_the_data
size_of_all_the_data = len(_ _ _)

# We define an empty array
columns_in_data = []

# We run from 1 to size_of_all_the_data (recall we do not want the entries Cluster_A, Cluster_B
for i in range(1,_ _ _):
    
# Append the entries in our array columns_in_data
    columns_in_data.append(all_the_data[i].split("_ _ _"))
    
# Print the first 5 entries in column_in_data
print(_ _ _[0:5])

#### Exercise 0.2

Suppose we are interested in extracting very specific information from each row (discaring the rows Cluster_A, ClusterB). 

For example, we know that the two elements in the first row are:

'MZ22523024:ACUA002041:Anopheles_culicifacies.KI425380:7891-8301:+''MZ22514750:AGAP012534:Anopheles_gambiae.UNKN:11990889-11991197:

and suppose we are only interested in getting, for each row, the pair [MZ22523024,MZ22514750].

By applying the split.() method, we can disect and extract the data we want. 

Follow the exercise below and complete where necessary. 

In [None]:
# Exercise
# We use the len() function to get length of array columns_in_data
size_of_columns_in_data = len(_ _ _)

# Define an empty array
tabulated_names_cols = []

# Run from 0 to size_of_columns_in_data
for i in range(_ _ _):
    
# For each i, split into subarrays separated by : and store only the first entry
    tabulated_names_cols.append([columns_in_data[_ _ _][0].split(":")[0],columns_in_data[i][3].split(":")[0]])
    
# Print the first 5 entries in tabulated_names_cols
print(tabulated_names_cols[_ _ _])

### Review

In this exercise, we imported the glob library, used it to access file names within our folder, read and stored file content into a table, which we then manipulated using foor loops. 

The reader will notice an unnecessary number of steps. 

In the next section, we will make this process much more efficient by using comprehension lists. 

In [None]:
# We will import the glob package as we want a way to access files within our directory. 
# The package glob does a great job at it. 
import glob

# Access data.csv in the directory /DATA
data_directory = "./DATA/"
data_files = glob.glob(data_directory + "*.csv")
data = data_files[0]

# Create an empty array where we will store the content of our file. 
all_the_data = []
with open(data,'r') as f:
    # We use a for loop to iteratively append the content of the file into a table
    for line in f:
        all_the_data.append(line)
f.close()

# We use the len() function to get length of array all_the_data
size_of_all_the_data = len(all_the_data)

# We run from 1 to size_of_all_the_data (recall we do not want the entries Cluster_A, Cluster_B) 
# that disects each row into an array separated by commas
columns_in_data = []
for i in range(1,size_of_all_the_data):
    columns_in_data.append(all_the_data[i].split(","))

print("First five rows are: \n")
print(columns_in_data[0:5])

print("\n")

# Define an empty array that stores, for each row, a pair containing the first and fourth entry. 
tabulated_names_cols = []
size_of_columns_in_data = len(columns_in_data)
for i in range(size_of_columns_in_data):
    tabulated_names_cols.append([columns_in_data[i][0].split(":")[0],columns_in_data[i][3].split(":")[0]])

print("First five pairs containing first and fourth entry are: \n")    
print(tabulated_names_cols[0:5])