# Example Code from Workshop #1

In [55]:
import numpy as np
import pandas as pd

## Table of Contents:
- [I. Basic Python Exercises](#I.-Basic-Python-Exercises:)
- [II. Basic Numpy Exercises](#II.-Basic-Numpy-Exercises:)
- [III. Basic Pandas Exercises](#III.-Basic-Pandas-Exercises:)

## I. Basic Python Exercises:

### 1. Write and test a function that takes a list of numbers and returns the largest number (without using pre-defined functions). 

In [2]:
def find_max(a_list):
    #we'll use a variable to remember the biggest number we've seen so far as we go through the list
    #at the start, we set the biggest number we've seen so far to the be first element in the list
    max_so_far = a_list[0]
    #let's iterate through the list
    for value in a_list:
        #compare the new value to the biggest number we've seen so far
        #if the new value is bigger, remember the new value
        if value > max_so_far:
            max_so_far = value
    
    #when finished iterating through list return the largest number we've seen
    return max_so_far

In [3]:
#make up a list of numbers
my_list = range(10, 40, 2)
my_list

[10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38]

In [6]:
#test our function find_max
print 'the max of your list is:', find_max(my_list)

the max of your list is: 38


Here's a pre-defined function that will do it for you!

In [7]:
print 'the max of your list is:', max(my_list)

the max of your list is: 38


We can even do it with `numpy` arrays and `pandas` objects!

In [10]:
my_array = np.array(my_list)

print 'the max of your array is:', my_array.max()

 the max of your array is: 38


In [11]:
my_series = pd.Series(my_array)

print 'the max of your series is:', my_series.max()

the max of your series is: 38


### 2. Write a function that takes a list and does each of the following things
- prints every other item in the list
- prints each element of the list in reverse order
- prints the last 5 elements in the list

In [13]:
def every_other(a_list):
    print a_list[::2] #from beginning to end counting by 2's
    
def reverse(a_list):
    print a_list[::-1] #from beginning to end counting by -1's (i.e. backwards)
    
def last_five(a_list):
    print a_list[-5:] #from five to last to end

In [14]:
every_other(my_list)

[10, 14, 18, 22, 26, 30, 34, 38]


In [15]:
reverse(my_list)

[38, 36, 34, 32, 30, 28, 26, 24, 22, 20, 18, 16, 14, 12, 10]


In [16]:
last_five(my_list)

[30, 32, 34, 36, 38]


### 3. Write a function that takes a string and returns True if it is a palindrom and False otherwise

In [17]:
#Recall that a palindrome is a string that reads the same forwards and backwards
def palindrome(string):
    if string == string[::-1]:
        return True
    else:
        return False

In [18]:
string_1 = 'hello'
string_2 = 'helleh'

palindrome(string_1)

False

In [19]:
palindrome(string_2)

True

### 4. Write a function that zips two lists together

In [22]:
def zip_lists(list_1, list_2):
    #first we should check that the two lists have the same lengths, otherwise we should throw 
    #an error (for now the error handling is that we return an error message)
    if len(list_1) == len(list_2):
        #the standard way to build a list is to start with an empty list, [], then 
        #iteratively append things to it, using .append()
        zipped = []
        for index in range(len(list_1)):
            zipped.append([list_1[index], list_2[index]])
        return zipped
    else:
        return "Lengths don't match!"

In [23]:
list_1 = [1, 2, 3]
list_2 = ['a', 'b', 'c']

zip_lists(list_1, list_2)

[[1, 'a'], [2, 'b'], [3, 'c']]

A alternative way to build up a list from scratch is to use a list comprehension, a list comprehension is just syntaxtic sugar for a for loop (i.e. start with empty list [ ], then iteratively use .append()). The syntax for list comprehension looks almost exactly like set builder notation in mathematics.

**Example:** If we want all the even numbers between 0 and 10 (inclusive) we'd write
$$
\{x : x\in \mathbb{N}, 0\leq x \leq 10\text{ and  }2|x\}
$$
In `python` we'd write
```
even_ints = [x for x in range(11) if x % 2 == 0]
```

In [24]:
#list comprehension definition of zipped list of two lists
zipped_list = [[list_1[index], list_2[index]] for index in range(len(list_1))]
zipped_list

[[1, 'a'], [2, 'b'], [3, 'c']]

But there is a `python` function that zip things for you!

In [25]:
zipped_list = zip(list_1, list_2)
zipped_list

[(1, 'a'), (2, 'b'), (3, 'c')]

---

## II. Basic Numpy Exercises:

### 1. Load the Carbon Monoxide CSV data for Boston

In [54]:
#When running this on your computer, you'll need to change the path to this file!
data = np.loadtxt('../mass_aq_data/boston_year_to_date/boston_co.csv', delimiter=',')

ValueError: could not convert string to float: State Code

Oh no! The code didn't work "out of the box"!

Let's parse the error. The summary of the error from the interpreter is always the last line:
```
ValueError: could not convert string to float: State Code
```
This indicates that `loadtxt` wants to convert the data into floats and it's failing, and that the reason is that the data is in string format (strings that aren't numerals).

The way to address this is to look at the data and read the documentation for `loadtxt`. 

It looks like the data is entirely string formatted (even numerals have quotes around them), the documentation for `loadtxt` suggests that the function assumes by default the data is floating point, to make it read in data as string you must specify so in the `dtype` parameter.

Let's try it again.

In [29]:
data = np.loadtxt('../mass_aq_data/boston_year_to_date/boston_co.csv', delimiter=',', dtype=str)

Great! Let's checkout the shape of our data and see what kind of values are in it.

In [30]:
#print the shape of the data
data.shape

(334, 32)

In [31]:
#print the first row
data[0]

array(['\t\tState Code', 'County Code', 'Tribal Code', 'Site ID',
       'Support Agency Code', 'Location Address', 'City Code',
       'Postal Code', 'Local ID', 'Local Name', 'Urban Area Code',
       'AQCR Code', 'Land Use ID', 'Location Setting ID',
       'Site Established Date', 'Latitude', 'Longitude',
       'Horizontal Method Code', 'Horizontal Datum ID', 'Parameter Code',
       'Parameter Name', 'Substance Occurrence Code', 'Duration Code',
       'Method ID', 'Measure Unit Code', 'Sample Collection Start Date',
       'Sample Collection Start Time', 'Measure Value', 'Measure Unit',
       'Null Data Code', 'Qualifier Code', 'Data Validity Code'], 
      dtype='|S28')

Problem! It looks like the first row of the data is the column headers, but `numpy` is treating these headers like values! What we need to do is to read the data again, skipping the first row!

In [44]:
data = np.loadtxt('../mass_aq_data/boston_year_to_date/boston_co.csv', delimiter=',', dtype=str, skiprows=2)

In [45]:
#print the shape of the data
data.shape

(333, 32)

In [46]:
#print the first row
data[0]

array(['\t\tMA', '"\'025"', '""', '"\'250250042"', '"\'0660"',
       '"HARRISON AV"', '"7000"', '"\'02119"', '"Boston Roxbury"',
       '"BOSTON"', '"1120"', '"\'119"', '"COMMERCIAL"',
       '"URBAN AND CENTER CITY"', '"1998-12-15"', '"42.3294"',
       '"-71.082499999999996"', '"\'015"', '"WGS84"', '"42101"',
       '"Carbon Monoxide"', '"1"', '"1"', '"\'593 "', '"\'007"',
       '"2017-01-01 00:00:00"', '"00:00:00"', '".67"', '"ppm"', '""', '""',
       '""'], 
      dtype='|S28')

**Observation:** The values in the data needs cleaning, there are a number of useless characters like tabs and quotation marks!

### 2. Filter the data for observations taken at the site 'Boston Roxbury'

We know that the column that contains the value 'Boston Roxbury' is labeled 'Local ID', but unfortunately in `numpy` there is no way to use the name of a column or row to refer a value.

In [48]:
#The position of the column that stores the location name is 8
filtered_data = data[data[:, 8] == '"Boston Roxbury"']
filtered_data.shape

(147, 32)

In [49]:
filtered_data[0]

array(['\t\tMA', '"\'025"', '""', '"\'250250042"', '"\'0660"',
       '"HARRISON AV"', '"7000"', '"\'02119"', '"Boston Roxbury"',
       '"BOSTON"', '"1120"', '"\'119"', '"COMMERCIAL"',
       '"URBAN AND CENTER CITY"', '"1998-12-15"', '"42.3294"',
       '"-71.082499999999996"', '"\'015"', '"WGS84"', '"42101"',
       '"Carbon Monoxide"', '"1"', '"1"', '"\'593 "', '"\'007"',
       '"2017-01-01 00:00:00"', '"00:00:00"', '".67"', '"ppm"', '""', '""',
       '""'], 
      dtype='|S28')

---

## III. Basic Pandas Exercises:

### 1. Load the Carbon Monoxide CSV data for Boston

In [50]:
data_frame = pd.read_csv('../mass_aq_data/boston_year_to_date/boston_co.csv')
data_frame.shape

(333, 32)

In [51]:
data_frame.head()

Unnamed: 0,State Code,County Code,Tribal Code,Site ID,Support Agency Code,Location Address,City Code,Postal Code,Local ID,Local Name,...,Duration Code,Method ID,Measure Unit Code,Sample Collection Start Date,Sample Collection Start Time,Measure Value,Measure Unit,Null Data Code,Qualifier Code,Data Validity Code
0,\t\tMA,'025,,'250250042,'0660,HARRISON AV,7000,'02119,Boston Roxbury,BOSTON,...,1,'593,'007,2017-01-01 00:00:00,00:00:00,0.67,ppm,,,
1,\t\t\t\tMA,'025,,'250250042,'0660,HARRISON AV,7000,'02119,Boston Roxbury,BOSTON,...,1,'593,'007,2017-01-02 00:00:00,00:00:00,1.053,ppm,,,
2,\t\t\t\tMA,'025,,'250250042,'0660,HARRISON AV,7000,'02119,Boston Roxbury,BOSTON,...,1,'593,'007,2017-01-03 00:00:00,00:00:00,0.253,ppm,,,
3,\t\t\t\tMA,'025,,'250250042,'0660,HARRISON AV,7000,'02119,Boston Roxbury,BOSTON,...,1,'593,'007,2017-01-04 00:00:00,00:00:00,0.304,ppm,,,
4,\t\t\t\tMA,'025,,'250250042,'0660,HARRISON AV,7000,'02119,Boston Roxbury,BOSTON,...,1,'593,'007,2017-01-05 00:00:00,00:00:00,0.343,ppm,,,


**Observations:** reading the data using `pandas` is much easier, we don't need to specify the types of the data because `pandas` will guess those types. `pandas` has also removed a number of those extraneous characters in the data for us (although a large number still remain). The nicest thing is that `pandas` correctly guessed that the first row in the data should be interpreted as column header and it has automatically made this distinction.

### 2. Filter the data for observations taken at the site 'Boston Roxbury'

We know that the column that contains the value 'Boston Roxbury' is labeled 'Local ID', with `pandas` we can directly refer to this column by its name!

In [52]:
filtered_dataframe = data_frame[data_frame['Local ID'] == 'Boston Roxbury']
filtered_dataframe.shape

(147, 32)

In [53]:
filtered_dataframe.head()

Unnamed: 0,State Code,County Code,Tribal Code,Site ID,Support Agency Code,Location Address,City Code,Postal Code,Local ID,Local Name,...,Duration Code,Method ID,Measure Unit Code,Sample Collection Start Date,Sample Collection Start Time,Measure Value,Measure Unit,Null Data Code,Qualifier Code,Data Validity Code
0,\t\tMA,'025,,'250250042,'0660,HARRISON AV,7000,'02119,Boston Roxbury,BOSTON,...,1,'593,'007,2017-01-01 00:00:00,00:00:00,0.67,ppm,,,
1,\t\t\t\tMA,'025,,'250250042,'0660,HARRISON AV,7000,'02119,Boston Roxbury,BOSTON,...,1,'593,'007,2017-01-02 00:00:00,00:00:00,1.053,ppm,,,
2,\t\t\t\tMA,'025,,'250250042,'0660,HARRISON AV,7000,'02119,Boston Roxbury,BOSTON,...,1,'593,'007,2017-01-03 00:00:00,00:00:00,0.253,ppm,,,
3,\t\t\t\tMA,'025,,'250250042,'0660,HARRISON AV,7000,'02119,Boston Roxbury,BOSTON,...,1,'593,'007,2017-01-04 00:00:00,00:00:00,0.304,ppm,,,
4,\t\t\t\tMA,'025,,'250250042,'0660,HARRISON AV,7000,'02119,Boston Roxbury,BOSTON,...,1,'593,'007,2017-01-05 00:00:00,00:00:00,0.343,ppm,,,
