# Week 9 Problem 1

If you are not using the `Assignments` tab on the course JupyterHub server to read this notebook, read [Activating the assignments tab](https://github.com/UI-DataScience/info490-fa16/blob/master/Week2/assignments/README.md).

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_ → _Save and CheckPoint_)

5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

In [1]:
from nose.tools import assert_equal
import csv

# Problem 1.

For this problem we will be revisiting [Week 5 Problem 2.1](https://github.com/UI-DataScience/info490-fa16/blob/master/Week5/assignments/Problem_2.ipynb) and [Week 7 Problem  3.1](https://github.com/UI-DataScience/info490-fa16/blob/master/Week7/assignments/Problem_3.ipynb), but this time you will be using the [csv module](https://docs.python.org/3/library/csv.html). Write a function to retrieve the columns of a delimited text file by their index. For example, `get_cols_by_index('foo.txt', cols=[0])` should return the first column of `foo.txt`, and `get_cols_by_index('foo.txt', cols=[0, 1, 4])` should return the first, second, and fifth. **`cols` should always be a list, even if it only has one element.** Also be sure to keep **all** rows of the csv (don't skip any headers).

Your function should return the result as a *list of tuples*, where the list items are rows of the csv and the tuples are the data corresponding to the selected columns. Notice that if you only return one element as a tuple it will look like there is a blank second entry but its length will still be 1:

```
>>> singleton_tuple = tuple([1])
>>> print('singleton tuple:', singleton_tuple)
singleton tuple: (1,)

>>> print('length of tuple:', len(singleton_tuple))
length of tuple: 1

```

In [2]:
def get_cols_by_index(fname, cols, delim=','):
    
    '''
    Gets columns of a delimited text file by index or column number
    
    Params
    -------
    fname: the file name (and path)
    cols : a list of indices for the columns (a list of column numbers)
    delim: the delimiter character to pass to csv.reader
    
    Returns
    -------
    A list of tuples where each element corresponds to a row of data
    from the text file. The tuples should have the same length as cols,
    and each element of the tuple should be a field from the text file.
    
    '''
    
    # YOUR CODE HERE
    # Create an empty list to store the results
    colslist = []
    n = len(cols)
    with open(fname, 'r') as csvfile:        
        for row in csv.reader(csvfile, delimiter = delim):
            # get the data of columns
            apd =[row[cols[i]] for i in range(n)]
            # Turn it to tuple
            apd = tuple(apd)
            # Append it to the list
            colslist.append(apd)
    return colslist
        


In [3]:
# did you return the correct datatype?
random_answer = get_cols_by_index('/home/data_scientist/data/airports.csv', cols=[1,2])
assert_equal(type(random_answer[0]), tuple)
[assert_equal(type(x), tuple) for x in random_answer]

# is the header still there?
all_cols = ['iata', 'airport', 'city', 'state', 'country', 'lat', 'long']
assert_equal(tuple(all_cols),
            get_cols_by_index('/home/data_scientist/data/airports.csv', delim=',', cols=list(range(7)))[0])

# test with comma and 7 cols
fifth_row = !sed -n 6p '/home/data_scientist/data/airports.csv'
fifth_row = tuple(x.strip('"') for x in fifth_row[0].split(','))
fifth_row_test = get_cols_by_index('/home/data_scientist/data/airports.csv', delim=',', cols=list(range(7)))[5]
assert_equal(fifth_row, fifth_row_test)

# test with comma and 1 col
eigth_row_one_col = !sed -n 9p '/home/data_scientist/data/airports.csv' | cut -d ',' -f2
eigth_row_one_col = tuple(x.strip('"') for x in eigth_row_one_col[0].split(','))
eigth_row_one_col_test = get_cols_by_index('/home/data_scientist/data/airports.csv', delim=',', cols=[1])[8]
assert_equal(eigth_row_one_col, eigth_row_one_col_test)

# test with bar and 2 cols
third_row_two_cols_bar = !sed -n 4p '/home/data_scientist/data/data.csv' | cut -d '|' -f2,4
third_row_two_cols_bar = tuple(x.strip('"') for x in third_row_two_cols_bar[0].split('|'))
third_row_two_cols_bar_test = get_cols_by_index('/home/data_scientist/data/data.csv', delim='|', cols=[1,3])[3]
assert_equal(third_row_two_cols_bar, third_row_two_cols_bar_test)

# Problem 2.

Now you need to get the columns by *name*. We are assuming that the column names are in the first row of the CSV file and we are not asking you to deal with cases where they are not. Hint: first [find out the indices](https://www.tutorialspoint.com/python/list_index.htm) of the entries of `cols` within the first row. Then either call your function from Problem 1 or reuse that code. As with Problem 1, return **all** rows of the csv including the header. 

In [4]:
def get_cols_by_name(fname, cols, delim=','):
    
    '''
    Gets columns of a delimited text file by name. Assumes the first
    row of the text file contains the header or column names.
    
    Params
    -------
    fname: the file name (and path)
    cols : a list of names of the fields (columns) to be returned
    delim: the delimiter character to pass to csv.reader
    
    Returns
    -------
    A list of tuples where each element corresponds to a row of data
    from the text file. The tuples should have the same length as cols,
    and each element of the tuple should be a field from the text file.
    
    '''
    
    # YOUR CODE HERE
    colslist = []
    n = len(cols)
    a = 0
    with open(fname, 'r') as csvfile:        
        for row in csv.reader(csvfile, delimiter = delim):
            # If a=0, the row is head
            if a == 0:
                # Get the index number
                colsindex = [row.index(cols[i]) for i in range(n)]
                a = 1
            # The rest is the same with prob1
            apd =[row[colsindex[i]] for i in range(n)]
            apd = tuple(apd)
            colslist.append(apd)
    return colslist    

In [5]:
# did you return the correct datatype?
random_answer = get_cols_by_name('/home/data_scientist/data/airports.csv', cols=['iata', 'city'])
assert_equal(type(random_answer[0]), tuple)
[assert_equal(type(x), tuple) for x in random_answer]

# is the header still there?
all_cols = ['iata', 'airport', 'city', 'state', 'country', 'lat', 'long']
assert_equal(tuple(all_cols),
            get_cols_by_name('/home/data_scientist/data/airports.csv', delim=',', cols=all_cols)[0])


# test with comma and 1 col
eigth_row_one_col = !sed -n 9p '/home/data_scientist/data/airports.csv' | cut -d ',' -f6
eigth_row_one_col = tuple(x.strip('"') for x in eigth_row_one_col[0].split(','))
eigth_row_one_col_test = get_cols_by_name('/home/data_scientist/data/airports.csv', delim=',', cols=['lat'])[8]
assert_equal(eigth_row_one_col, eigth_row_one_col_test)

# test with bar and 2 cols
third_row_two_cols_bar = !sed -n 4p '/home/data_scientist/data/data.csv' | cut -d '|' -f2,4
third_row_two_cols_bar = tuple(x.strip('"') for x in third_row_two_cols_bar[0].split('|'))
third_row_two_cols_bar_test = get_cols_by_name('/home/data_scientist/data/data.csv', delim='|', cols=['airport', 'state'])[3]
assert_equal(third_row_two_cols_bar, third_row_two_cols_bar_test)

# Problem 3.

The main purpose of this problem is to give you experience creating a function that human beings might use. You now have the tools to read in csv files using Python either by specifying the name or indices of the columns you wish to keep.

Now, write a generic function to read in CSV files. You can call your previous functions or write it as a self-contained function without dependencies. Your function should return the data in the same format as the previous functions and should now have the following features:
* If `cols` is a list of integers, assume they are the indices of the columns (See [StackExchange](http://stackoverflow.com/questions/13252333/python-check-if-all-elements-of-a-list-are-the-same-type) for help determining type)
* If `cols` is a list of strings, assume they are the names of the columns (See [StackExchange](http://stackoverflow.com/questions/13252333/python-check-if-all-elements-of-a-list-are-the-same-type) for help determining type)
* If `cols` is neither a list of strings or a list of integers, return an **empty list**.

Your function should work for generic delimited text files which means support for multiple delimiters (via the `delim` argument). I won't test the case where `cols` is a list of strings but there is no header column, but note that is a case your function would have to deal with in a real-world application.

In [6]:
def my_read_csv(fname, cols, delim = ','):
    
    '''
    Gets columns of a delimited text file by name or index. Assumes 
    the first row of the text file contains the header or column names.
    
    Params
    -------
    fname: the file name (and path)
    cols : a list of names or indices of the fields (columns) to be 
           returned
    delim: the delimiter character to pass to csv.reader
    
    Returns
    -------
    A list of tuples where each element corresponds to a row of data
    from the text file. The tuples should have the same length as cols,
    and each element of the tuple should be a field from the text file.
    
    '''
    
    # YOUR CODE HERE
    colslist = []
    n = len(cols)
    a = 0
    with open(fname, 'r') as csvfile:  
        # If cols is list of integers
        if all(isinstance(i, int) for i in cols):
            for row in csv.reader(csvfile, delimiter = delim):
                # get the data of columns
                apd =[row[cols[i]] for i in range(n)]
                # Turn it to tuple
                apd = tuple(apd)
                # Append it to the list
                colslist.append(apd)
        # If cols is list of strings
        elif all(isinstance(i, str) for i in cols):
            for row in csv.reader(csvfile, delimiter = delim):
                # If a=0, the row is head
                if a == 0:
                    # Get the index number
                    colsindex = [row.index(cols[i]) for i in range(n)]
                    a = 1
                # The rest is the same
                apd =[row[colsindex[i]] for i in range(n)]
                apd = tuple(apd)
                colslist.append(apd)
    return colslist    
    

In [7]:
# did you return the correct datatype?
random_answer = my_read_csv('/home/data_scientist/data/airports.csv', cols=[1,2])
assert_equal(type(random_answer[0]), tuple)
[assert_equal(type(x), tuple) for x in random_answer]

# is the header still there?
all_cols = ['iata', 'airport', 'city', 'state', 'country', 'lat', 'long']
assert_equal(tuple(all_cols),
            my_read_csv('/home/data_scientist/data/airports.csv', delim=',', cols=all_cols)[0])

# test with comma and 1 col by name
myans1 = !sed -n 21p '/home/data_scientist/data/airports.csv' | cut -d ',' -f6
myans1 = tuple(x.strip('"') for x in myans1[0].split(','))
mytest1 = my_read_csv('/home/data_scientist/data/airports.csv', delim=',', cols=['lat'])[20]
assert_equal(myans1, mytest1)

# test with bar and 2 cols by name
myans2 = !sed -n 11p '/home/data_scientist/data/data.csv' | cut -d '|' -f2,4
myans2 = tuple(x.strip('"') for x in myans2[0].split('|'))
mytest2 = my_read_csv('/home/data_scientist/data/data.csv', delim='|', cols=['airport', 'state'])[10]
assert_equal(myans2, mytest2)

# test with comma and 7 cols by index
myans3 = !sed -n 6p '/home/data_scientist/data/airports.csv'
myans3 = tuple(x.strip('"') for x in myans3[0].split(','))
mytest3 = my_read_csv('/home/data_scientist/data/airports.csv', delim=',', cols=list(range(7)))[5]
assert_equal(myans3, mytest3)

# test with comma and 1 col by index
myans4 = !sed -n 9p '/home/data_scientist/data/airports.csv' | cut -d ',' -f2
myans4 = tuple(x.strip('"') for x in myans4[0].split(','))
mytest4 = my_read_csv('/home/data_scientist/data/airports.csv', delim=',', cols=[1])[8]
assert_equal(myans4, mytest4)


# does the error handling work?
assert_equal([], my_read_csv('/home/data_scientist/data/airports.csv', delim=',', cols=[1, 'a']))