# Introduction to Data Science - Homework 1
*CS 5963 / MATH 3900, University of Utah, http://datasciencecourse.net/*

Due: Friday, September 2, 11:59pm.

This homework is designed to practice the skills we learned in Lab 1: working with loops, conditions, functions, and the built-in Python data structures. Make sure to go through the lab again in case you have any troubles.

In this homework we'll do some calculations that are also available in various libraries. For the purpose of this homework, however, **stick to standard python functionality and the math library** and re-implement, e.g., the functionality for calculating the mean of a vector instead of just calling a mean function. 

However, we encourage you to check your results using, e.g., the [NumPy library](http://docs.scipy.org/doc/numpy-1.11.0/reference/routines.statistics.html) and include the checks as a separate code cell. 

## Your Data
Fill out the following information: 

**First Name:** Matthew
**Last Name:** Olson   
**E-mail:** matthew.olson@geog.utah.edu   
**UID:** u1037042  


## Part 1: Vector data

We first will work with a vector of yearly average temperatures from New Haven published [here](https://vincentarelbundock.github.io/Rdatasets/datasets.html). The data is included in this repository in the file `nhtmep.csv`.

The data is stored in the CSV format, which is a simple textfile with 'Comma Seperated Values'.
To load the data to into a (nested) python array, we use the [csv](https://docs.python.org/3/library/csv.html) library. The following code reads the file and stores it in a vector:

In [2]:
# import the csv library
import csv
# import the math library we'll use later
import math

# initialize the array
temperature_vector = []

# open the file and append the values of the last column to the array
with open('nhtemp.csv') as csvfile:
    filereader = csv.reader(csvfile, delimiter=',', quotechar='|')
    # remove the first item as it is the title.
    next(filereader)
    for row in filereader:
        # here we append to the array and also cast from string to float
        temperature_vector.append(float(row[2]))
        
# print the vector to see if it worked
print (temperature_vector)

 [49.9, 52.3, 49.4, 51.1, 49.4, 47.9, 49.8, 50.9, 49.3, 51.9, 50.8, 49.6, 49.3, 50.6, 48.4, 50.7, 50.9, 50.6, 51.5, 52.8, 51.8, 51.1, 49.8, 50.2, 50.4, 51.6, 51.8, 50.9, 48.8, 51.7, 51.0, 50.6, 51.7, 51.5, 52.1, 51.3, 51.0, 54.0, 51.4, 52.7, 53.1, 54.6, 52.0, 52.0, 50.9, 52.6, 50.2, 52.6, 51.6, 51.9, 50.5, 50.9, 51.7, 51.4, 51.7, 50.8, 51.9, 51.8, 51.9, 53.0]


We'll use the `temperature_vector` to calculate a couple of standard statistical measures next.

### Task 1.1: Calculate the Mean of a Vector

Write a function that calculates and returns the [arithmetic mean](https://en.wikipedia.org/wiki/Arithmetic_mean) of a vector that you pass into it. 

Pass the temperature vector into this function and print the result. Provide a written interpretation of your results (e.g., "The mean temperature for New Haven for the years 1912 to 1971 is XXX degrees Fahrenheit.")

In [3]:
## your code goes here
def mean(vector):
    a = sum(vector)/len(vector)
    return a
# the call to your function (won't work before you implement it)
print "My Function: ", mean(temperature_vector)

# check work
import numpy as np
print "Error Checking: ", np.mean(temperature_vector)

My Function:  51.16
Error Checking:  51.16


**Your Interpretation:** The mean temperature for New Haven between years 1912 and 1971 is 51.16 degrees Farenheit.

### Task 1.2: Calculate the Median of a Vector
Write a function that calculates and returns the [median](https://en.wikipedia.org/wiki/Median) of a vector. Pass the temperature vector into this function and print the result. Make sure that your function works for both, functions with an even and with an odd number of elements. In case of an even number of elements, use the mean of the two middle values. Provide a written interpretation of your results.

Hint: the [`sorted()`](https://docs.python.org/3/library/functions.html#sorted) function might be helpful for this.

In [4]:
## your code goes here
def median(vector):
    vector = sorted(vector)
    n = len(vector)
    o = n - 1 # if 'n' is even will return mean between two numbers (if odd, 'o' and 'n' are same)
    med = (vector[n/2] + vector[o/2])/ 2.0
    return med

# the call to your function
print "My Function: ", median(temperature_vector)

# check work
import numpy as np
print "Error Checking: ", np.median(temperature_vector)

My Function:  51.2
Error Checking:  51.2


**Your Interpretation:** The median temperature for New Haven between years 1912 and 1971 is 51.2 degrees Farenheit, which is close to the mean.


### Task 1.3: Calculate the Standard Deviation of a Vector

Write a function that calculates and returns the [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation) of a vector. Pass the temperature vector into this function and print the result. Provide a written interpretation of your results.

The standard deviation is the square root of the average of the squared deviations from the mean, i.e.,

$$\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N} {{(x_i - \mu)}^2} }$$

where $\mu$ is the mean of the vector. Hint: use your mean function to calculate it.

Hint: the `sqrt()` function from the [`math library`](https://docs.python.org/3/library/math.html) might be helpful for this. If you use a seperate file you need to load the library as we did in Part 1 to read in the data. The import looks like this:

In [5]:
import math as m

## your code goes here
def standard_deviation(vector):
    s = []
    for i in vector:
        s.append((i - mean(vector))**2)
    std = sum(s)/len(vector)
    return m.sqrt(std)

# the call to your function
print "My Function: ", standard_deviation(temperature_vector)

# check work
import numpy as np
print "Error Checking: ", np.std(temperature_vector)

My Function:  1.25501660016
Error Checking:  1.25501660016


**Your Interpretation:** The standard deviation of temperature for New Haven between years 1912 and 1971 is 1.2550166016 degrees Farenheit.

### Task 1.4: Histogram

Write a function that takes a vector and an integer `b` and calculates a [histogram](https://en.wikipedia.org/wiki/Histogram) with `b` bins. The function should return an array containing two arrays. The first should be the counts for each bin, the second should contain the borders of the bins.

For `b=5` your output should look like this: 

`[[3, 12, 33, 10, 2], [47.9, 49.24, 50.58, 51.92, 53.26, 54.6]]`

Here, the first array gives the size of these bins, the second defines the bands. I.e., the first band from 47.9-49.24 has 3 entries, the second, from 49.24-50.58 has 12 entries, etc. 

Provide a written interpretation of your results. Comment on whether the histogram is skewed, and if so, in which direction.

In [16]:
## your code goes here
def histogram(vector, bins):
    vector = sorted(vector)
    mx = max(vector)
    mn = min(vector)
    binL = (mx - mn)/bins
    borders = [mn]
    for n in range(bins):
        b = borders[-1]+binL
        borders.append(round(b,2))
    counts = []
    for c in range(len(borders)-1):
        #counts.append(sum((borders[c] < vector) & (vector < borders[c+1])))
        #counts.append(len([vector[(borders[c] < vector) & (vector < borders[c+1])]]))
        counts.append(len([i for i,v in enumerate(temperature_vector) if (borders[c] <= v) & (v <= borders[c+1])]))
    return  [counts,borders]
    
# the call to your function
print "My Function: ", histogram(temperature_vector,5)
print ''
# check work
import numpy as np
print "Error Checking: ", np.histogram(temperature_vector,5)

 My Function:  [[3, 12, 33, 10, 2], [47.9, 49.24, 50.58, 51.92, 53.26, 54.6]]

Error Checking:  (array([ 3, 12, 33, 10,  2]), array([ 47.9 ,  49.24,  50.58,  51.92,  53.26,  54.6 ]))


**Your interpretation:** Temperature data for New Haven between years 1912 and 1971 appears to be almost symmetrical, however there is a slight right skewedness. 

### Task 1.5: Filtering
Write a function that takes a vector and returns a vector that contains every other element of the original vector. Print the result of the function as applied to the temperature vector.

Hint: slicing might be helpful here.

In [24]:
## your code goes here
def skip(v):
    v = v[::2]
    return v

# the call to your function
print "My Function: ", skip(temperature_vector)

My Function:  [49.9, 49.4, 49.4, 49.8, 49.3, 50.8, 49.3, 48.4, 50.9, 51.5, 51.8, 49.8, 50.4, 51.8, 48.8, 51.0, 51.7, 52.1, 51.0, 51.4, 53.1, 52.0, 50.9, 50.2, 51.6, 50.5, 51.7, 51.7, 51.9, 51.9]


## Part 2: Working with Matrices

For the second part of the homework, we are going to work with matrices. The [dataset we will use](https://www.wunderground.com/history/airport/KSLC/2015/1/1/CustomHistory.html?dayend=31&monthend=12&yearend=2015&req_city=&req_state=&req_statename=&reqdb.zip=&reqdb.magic=&reqdb.wmo=) contains different properties of the weather in Salt Lake City for 2015 (temperature, humidity, sea level, ...). It is stored in the file [`SLC_2015.csv`](SLC_2015.csv) in this repository.

We first read the data from the file and store it in a nested python array (`weather_matrix`). A nested python array is an array, where each element is an array itself. Here is a simple example: 

In [25]:
arr1 = [1,2,3]
arr2 = ['a', 'b', 'c']

nestedArr = [arr1, arr2]
nestedArr

[[1, 2, 3], ['a', 'b', 'c']]

We provide you with the data import code, which will write the data into the nested list `temperature_matrix`. The list contains one list for each month, which, in turn, contain the mean temparature of every day of that month. 

In [26]:
# initialzie the 12 arrays for the months
temperature_matrix = [[] for i in range(12)]

# open the file and append the values of the last column to the array
with open('SLC_2015.csv') as csvfile:
    filereader = csv.reader(csvfile, delimiter=',', quotechar='|')
    # get rid of the header
    next(filereader)
    for row in filereader:
        month = int(row[0].split('/')[0])
        mean_temp = int(row[2])
        temperature_matrix[month-1].append(mean_temp)

print(temperature_matrix)

# the mean tempertarure on August 23. Note the index offset:
print("Mean temp on August 23: " + str(temperature_matrix[7][22]))


[[15, 19, 26, 28, 37, 38, 38, 36, 35, 31, 39, 36, 35, 30, 31, 31, 37, 44, 40, 35, 31, 31, 31, 33, 42, 41, 44, 42, 36, 40, 39], [39, 49, 50, 50, 53, 57, 60, 53, 55, 45, 43, 47, 46, 48, 43, 40, 38, 44, 47, 44, 39, 33, 31, 35, 44, 35, 37, 36], [40, 37, 34, 33, 39, 43, 45, 45, 46, 50, 54, 50, 51, 56, 62, 63, 61, 53, 47, 53, 57, 54, 52, 47, 42, 48, 56, 62, 53, 57, 63], [46, 44, 44, 54, 60, 50, 52, 46, 49, 53, 58, 50, 57, 56, 33, 44, 50, 54, 56, 56, 60, 61, 61, 59, 51, 46, 50, 57, 65, 63], [63, 71, 68, 67, 62, 59, 58, 57, 49, 53, 59, 68, 65, 65, 53, 48, 56, 58, 55, 59, 58, 58, 55, 57, 62, 59, 61, 61, 64, 71, 76], [80, 68, 69, 68, 69, 70, 66, 73, 77, 78, 72, 74, 75, 76, 81, 77, 78, 83, 83, 78, 81, 78, 78, 83, 82, 84, 87, 88, 91, 89], [87, 87, 87, 89, 79, 79, 76, 75, 73, 72, 77, 79, 81, 77, 80, 80, 79, 74, 74, 73, 76, 77, 75, 78, 78, 84, 77, 66, 70, 76, 79], [80, 79, 69, 76, 82, 74, 76, 69, 72, 79, 83, 81, 83, 88, 83, 79, 77, 72, 74, 76, 81, 74, 76, 84, 85, 78, 77, 80, 85, 82, 75], [82, 83, 82

We will now use the nested array `temperature_matrix` to compute the same metrics as in Part 1.

**Note:** Since the lists in the matrix are of varying lengths (28 to 31 days) many of the standard NumPy functions won't work.

### Task 2.1: Calculates the mean of a whole matrix

Write a function that calculates the mean of a matrix. For this version calculate the mean over all elements in the matrix as if it was one large vector. 
Pass in the matrix with the weather data and return the result. Provide a written interpretation of your results.
Can you use your function from Part 1 and get a valid result?

In [52]:
## your code goes here
def mean_matrix(mat):
    all_days = []
    for month in mat:
        for day in month:
            all_days.append(day)
    return mean(all_days)
    
# the call to your function
print "My Function: ", mean_matrix(temperature_matrix)

My Function:  56


**Your Interpretation:** The mean temperature in Salt Lake City for 2015 is 56 degrees Farinheit. I am able to use my mean function created earlier in part of the code, however, you cannot use the mean function created earlier on the entire matrix as you will get a

TypeError: unsupported operand type(s) for +: 'int' and 'list' 


### Task 2.2:  Calculate the mean of each vector of a matrix

Write a function that calculates the mean temperature of each month and returns an array with the means for each column. Provide a written interpretation of your results. Can you use the function you implemented in Part 1 here efficiently? If so, use it.

In [34]:
## your code goes here
def mean_matrix_columns(mat):
    month_means = []
    for month in mat:    
        mm = mean(month)
        month_means.append(mm)
    return month_means

# the call to your function
mean_matrix_columns(temperature_matrix)

[34, 44, 50, 52, 60, 77, 77, 78, 71, 61, 39, 31]

**Your Interpretation:** The mean temperatures for each month in Salt Lake City for 2015 are as follows: [34, 44, 50, 52, 60, 77, 77, 78, 71, 61, 39, 31]

I was effectively able to use the mean function I created earlier.

### Task 2.3:  Calculate the median of a whole matrix

Write a function that calculates and returns the median of a matrix over all values (independent from which row they are coming) and returns it. Provide a written interpretation of your results. Can you use your function from Part 1 and get a valid result?

In [53]:
## your code goes here
def median_matrix(mat):
    all_days = []
    for month in mat:
        for day in month:
            all_days.append(day)
    return median(all_days)

# the call to your function
median_matrix(temperature_matrix)

57.0

**Your Interpretation:** The median temperature in Salt Lake City for 2015 is 57.0 degrees Farinheit, which is close to the mean. I am able to use my median function created earlier in part of the code, however, you cannot use the median function created earlier on the entire matrix as you will get a

TypeError: unsupported operand type(s) for +: 'int' and 'list'

### Task 2.4: Calculate the median of each vector of a matrix

Write a function that calculates the median of each sub array (i.e. each column in the csv file) in the matrix and returns an array of medians (one entry for column in the csv file). To do so, use the function you implemented in Part 1. Provide a written interpretation of your results. 

In [44]:
## your code goes here
def median_matrix_columns(mat):
    month_med = []
    for month in mat:
        month_med.append(median(month))
    return month_med

# the call to your function
median_matrix_columns(temperature_matrix)

[36.0, 44.0, 51.0, 53.5, 59.0, 78.0, 77.0, 79.0, 73.0, 62.0, 40.0, 32.0]

**Your Interpretation:** The median temperatures for each month in Salt Lake City for 2015 are as follows: [36.0, 44.0, 51.0, 53.5, 59.0, 78.0, 77.0, 79.0, 73.0, 62.0, 40.0, 32.0] - which are close to the monthly means of each month.

I was effectively able to use the median function I created earlier.

### Task 2.5: Calculate the standard deviation of a whole matrix

Write a function that calculates the standard deviation of a matrix over all values in the matrix (ignoring from which column they were coming) and returns it. Can you use your function from Part 1 and get a valid result? Provide a written interpretation of your results. 

In [54]:
## your code goes here
def standard_deviation_matrix(mat):
    all_days = []
    for month in mat:
        for day in month:
            all_days.append(day)
    return standard_deviation(all_days)

# the call to your function
standard_deviation_matrix(temperature_matrix)

17.916472867168917

**Your Interpretation:** The Standard deviaiton in temperature in Salt Lake City for 2015 is 17.91647... degrees Farinheit. I am able to use my median function created earlier in part of the code, however, you cannot use the median function created earlier on the entire matrix as you will get a

TypeError: unsupported operand type(s) for +: 'int' and 'list'

### Task 2.6: Calculate the standard deviation of each vector of a matrix

Write a function that calculates the standard deviation of each array in the matrix and returns an array of standard deviations (one standard deviation for each column). To do so, use the function you implemented in Part 1.
Pass in the matrix with the temperature data and return the result. Provide a written interpretation of your results - is the standard deviation consistent across the seasons? 

In [51]:
## your code goes here
def standard_deviation_matrix_columns(mat):
    month_std = []
    for month in mat:
        month_std.append(standard_deviation(month))
    return month_std

# the call to your function
standard_deviation_matrix_columns(temperature_matrix)

[6.48074069840786,
 7.3484692283495345,
 8.246211251235321,
 6.928203230275509,
 6.244997998398398,
 6.557438524302,
 5.0,
 4.58257569495584,
 7.54983443527075,
 6.928203230275509,
 8.717797887081348,
 8.94427190999916]

**Your Interpretation:** The standard deviation of temperatures for each month in Salt Lake City for 2015 are as follows: 

[6.48074069840786,
 7.3484692283495345,
 8.246211251235321,
 6.928203230275509,
 6.244997998398398,
 6.557438524302,
 5.0,
 4.58257569495584,
 7.54983443527075,
 6.928203230275509,
 8.717797887081348,
 8.94427190999916]

I was effectively able to use the median function I created earlier.