## WEEKEND 2 AGENDA:
  * **An overview of things learned on weekend 1, finish up if necessary**
  * **Intro to effectively working with timeseries and incomplete data**
  * **Learning to plot**
  * **More advanced table manipulations and pipelines**
  * **A few common data transformations for machine learning pipelines**
  * **A brief intro to supervised learning using recommender systems**

**WELCOME TO WEEKEND 2 OF THE INTRO TO DATA SCIENCE WORKSHOP!**

I hope you've learned a few things so far about the power of **Python** and **pandas** when leveraged on datasets that can't be properly handled using traditional tabular analysis tools like excel.

### Overview of what we've learned so far (and a refresher on python containers)

## Lists

A list is an ordered, indexable collection of data. Lets say you have collected some price and count data that looks like this:

    prices:
     25.31
     33.23
     20.16
     11.11
     14.33

    counts:
     10
     5
     3
     40
     1

We can create lists of the voltages and the currents like this:

In [231]:
prices = [25.31,33.23,20.16,11.11,14.33]
counts = [10,5,3,40,1]
print prices
print counts

[25.31, 33.23, 20.16, 11.11, 14.33]
[10, 5, 3, 40, 1]


Indexing on lists starts at 0 and ends at n-1, where n is the size of the list. You can also index from the back to the front.

In [232]:
print "First price:", prices[0]#get first price
print "Next to last price:", counts[3]#get last count
print prices[-1] #you can also index from the back of the list

First price: 25.31
Next to last price: 40
14.33


## Tuples

Tuples are another of Python's basic container data types. They are very similar to lists but with one major difference. Tuples are **immutable**. Once data is placed into a tuple, the tuple cannot be changed. You define a tuple as follows:

In [233]:
my_tuple = ("red", "white", "blue")

You can slice and index the tuple exactly like you would a list. Tuples are used in the inner workings of python, and a tuple can be used as a key in a dictionary, whereas a list cannot as we will see in a moment.

See if you can retrieve the third element of `my_tuple`:

In [234]:
#Your code here

## Dictionaries

A Python dictionary is an unordered collection of key-value pairs.

In [235]:
data_dict = {"experiment": "sergey vs. data science",
        "run": 47,
        "score": 372.756, 
        "values1": [-1.0, -0.5, 0.0, 0.5, 1.0], 
        "values2": [-2.0, -1.0, 0.0, 1.0, 2.0],
        }

This data structure is better because you no longer have to remember that the run number is in the second position of the list, you just refer directly to "run". `run` is referred to as a key:

In [236]:
print "Keys: ", data_dict.keys()
print "Values: ", data_dict.values()


Keys:  ['values1', 'experiment', 'run', 'score', 'values2']
Values:  [[-1.0, -0.5, 0.0, 0.5, 1.0], 'sergey vs. data science', 47, 372.756, [-2.0, -1.0, 0.0, 1.0, 2.0]]


Over the course of the first weekend we learned how to do the following:

In **Lesson 1** we learned:
* There are 2 basic data types in **pandas** - `DataFrame` and `Series` objects
* They can be accessed in similar ways, however, `DataFrame` objects are effectively collections of `Series` objects
* You can query them in a variety of ways, and can be accessed via their indices in the object, or via raw indexing
* You will love both of these objects to no end by the end of the course

In **Lesson 2** we learned how to:

* Load data in correctly into a `DataFrame` object using `pandas.read_csv()`
* View parts of the dataset quickly using `datasetName.head()` and `datasetName.tail()`
* Get the format of every column of the dataset using `datasetName.dtypes`
* Get all of the unique values of a specific column using `datasetName.unique()`
* Select columns and rows based on specific criteria by generating and applying masks
* Subselect entire portions of a dataset based on criteria
* Delete columns we are not interested in using `del datasetName[columnName]`
* Format columns with time data into a time format using `pandas.to_datetime()` that then allows us to access many aspects of the times using `datasetName[timeColumn].dt`
* Get the overall size and shape of the dataset (excluding the `index`) using `datasetName.shape`
* Get summary statistics about the dataset or about specific columns using `datasetName[optionalColumnNames].describe()`
* Group data based on certain columns using `datasetName.groupby()` and work with groups using:
  * Predefined functions like `groupedDatasetName.size()` directly on the grouped objects
  * Other functions that are passed to `groupedDatasetName.agg(yourFunctionHere)` like `mean` and `std` or functions that you create yourself
  * Filters that remove data you are not interested in and spits out a new filtered, ungrouped dataset using `groupedDatasetName.filter(yourFilterFunction)` where `yourFilterFunction` can either be an anonymous function (`lambda x: your function condition(s)`) or a standard python function (`def yourFunctionName(yourInputs): return your function condition(s)`)
* Compute correlations and correlation matrices between various numeric columns in a dataset using `datasetName[columnName].corr(otherColumn)` or `datasetName.corr()`.

In **Lesson 3** we learned how to:

* Use some basic `str` functionality to format columns with text and create new columns from the formatting/parsing of the `string` values found within them.
* Convert categorical variables, stored as `string` types, into indicator variables using `get_dummies()`
* Convert numerical variables into categorical variables through discretization using `cut()`
* Join 2 datasets along one or multiple keys using `yourDataset.merge(yourOtherDataset,on=joinKeys)`
* Look for inconsistent data and remove it from our dataset

Gaze upon all that you've learned thus far and marvel at your progress!

...

Ok, thats enough, lets learn to work with timeseries and how to plot them.

###Data Frames are effectively dictionaries on steroids.

####tabular data structure
####ordered collection of columns
####each column can be of a different type - numeric, string, boolean
####row and column index

In [237]:
import pandas as pd
import numpy as np

In [238]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'population': [1.5, 1.7, 3.6, 2.4, 2.9]}
df = pd.DataFrame(data)
print df

   population   state  year
0         1.5    Ohio  2000
1         1.7    Ohio  2001
2         3.6    Ohio  2002
3         2.4  Nevada  2001
4         2.9  Nevada  2002


In [170]:
#You can access values in a dataframe in a variety of ways
print df[[0,1]]
print df.ix[:,0:2]
print df[["population","state"]]

   population   state
0         1.5    Ohio
1         1.7    Ohio
2         3.6    Ohio
3         2.4  Nevada
4         2.9  Nevada
   population   state
0         1.5    Ohio
1         1.7    Ohio
2         3.6    Ohio
3         2.4  Nevada
4         2.9  Nevada
   population   state
0         1.5    Ohio
1         1.7    Ohio
2         3.6    Ohio
3         2.4  Nevada
4         2.9  Nevada


In [160]:
type(df)

pandas.core.frame.DataFrame

####Each column in a dataframe is a Series, which is a pandas data structure in it's own right.
####There are a variety of notations used to access the columns of a data frame

In [162]:
type(df["population"])
print df.population

0    1.5
1    1.7
2    3.6
3    2.4
4    2.9
Name: population, dtype: float64


###Heres another example dataframe:

In [174]:
dates = pd.date_range('20140101',periods=6)
print dates
type(dates)
dates[2]

df2 = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))

DatetimeIndex(['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04',
               '2014-01-05', '2014-01-06'],
              dtype='datetime64[ns]', freq='D', tz=None)


In [175]:
# You can access a dataframe's datatypes like this
df2.dtypes

A    float64
B    float64
C    float64
D    float64
dtype: object

## Viewing Data

In [181]:
df2.head()

Unnamed: 0,A,B,C,D
2014-01-01,0.960801,1.171381,0.525141,-0.329779
2014-01-02,-0.627231,-0.079685,0.671225,-0.67453
2014-01-03,0.690754,0.396934,-0.504431,1.596217
2014-01-04,-0.418247,0.13149,2.068749,0.765621
2014-01-05,0.709363,0.391902,0.04759,-0.795258


In [182]:
df2.tail()

Unnamed: 0,A,B,C,D
2014-01-02,-0.627231,-0.079685,0.671225,-0.67453
2014-01-03,0.690754,0.396934,-0.504431,1.596217
2014-01-04,-0.418247,0.13149,2.068749,0.765621
2014-01-05,0.709363,0.391902,0.04759,-0.795258
2014-01-06,-1.365953,-1.001444,0.969324,1.410294


In [183]:
df2.index

DatetimeIndex(['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04',
               '2014-01-05', '2014-01-06'],
              dtype='datetime64[ns]', freq='D', tz=None)

In [185]:
df2.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.008419,0.16843,0.6296,0.328761
std,0.931367,0.712684,0.874525,1.064959
min,-1.365953,-1.001444,-0.504431,-0.795258
25%,-0.574985,-0.026891,0.166978,-0.588342
50%,0.136253,0.261696,0.598183,0.217921
75%,0.704711,0.395676,0.8948,1.249126
max,0.960801,1.171381,2.068749,1.596217


You can sort along a specific column:

In [186]:
df2.sort(columns='B')

Unnamed: 0,A,B,C,D
2014-01-06,-1.365953,-1.001444,0.969324,1.410294
2014-01-02,-0.627231,-0.079685,0.671225,-0.67453
2014-01-04,-0.418247,0.13149,2.068749,0.765621
2014-01-05,0.709363,0.391902,0.04759,-0.795258
2014-01-03,0.690754,0.396934,-0.504431,1.596217
2014-01-01,0.960801,1.171381,0.525141,-0.329779


## Boolean Indexing

In [188]:
df2[df2.A < 0] # Basically a 'where' operation

Unnamed: 0,A,B,C,D
2014-01-02,-0.627231,-0.079685,0.671225,-0.67453
2014-01-04,-0.418247,0.13149,2.068749,0.765621
2014-01-06,-1.365953,-1.001444,0.969324,1.410294


## Missing Data

In [189]:
# Add a column with missing data
df3 = df2.reindex(index=dates[0:4],columns=list(df2.columns) + ['E'])

In [190]:
df3

Unnamed: 0,A,B,C,D,E
2014-01-01,0.960801,1.171381,0.525141,-0.329779,
2014-01-02,-0.627231,-0.079685,0.671225,-0.67453,
2014-01-03,0.690754,0.396934,-0.504431,1.596217,
2014-01-04,-0.418247,0.13149,2.068749,0.765621,


In [193]:
# find where values are null
pd.isnull(df3)

Unnamed: 0,A,B,C,D,E
2014-01-01,False,False,False,False,True
2014-01-02,False,False,False,False,True
2014-01-03,False,False,False,False,True
2014-01-04,False,False,False,False,True


## Simple Aggregations

In [198]:
print df2.mean() #mean per-column
print df2.mean(axis=1) #mean per row

A   -0.008419
B    0.168430
C    0.629600
D    0.328761
dtype: float64
2014-01-01    0.581886
2014-01-02   -0.177555
2014-01-03    0.544868
2014-01-04    0.636903
2014-01-05    0.088399
2014-01-06    0.003055
Freq: D, dtype: float64


## Applying functions

In [199]:
df2

Unnamed: 0,A,B,C,D
2014-01-01,0.960801,1.171381,0.525141,-0.329779
2014-01-02,-0.627231,-0.079685,0.671225,-0.67453
2014-01-03,0.690754,0.396934,-0.504431,1.596217
2014-01-04,-0.418247,0.13149,2.068749,0.765621
2014-01-05,0.709363,0.391902,0.04759,-0.795258
2014-01-06,-1.365953,-1.001444,0.969324,1.410294


In [200]:
df2.apply(np.cumsum) #when you apply the function here, you dont supply any parameters, you apply it unbound

Unnamed: 0,A,B,C,D
2014-01-01,0.960801,1.171381,0.525141,-0.329779
2014-01-02,0.33357,1.091696,1.196366,-1.004309
2014-01-03,1.024324,1.48863,0.691935,0.591908
2014-01-04,0.606077,1.620119,2.760684,1.357529
2014-01-05,1.31544,2.012021,2.808274,0.562271
2014-01-06,-0.050513,1.010577,3.777598,1.972565


In [205]:
df2.apply(lambda x: x.max() - x.min()) #Here the output is a series, whereas in the other case it was a dataframe, why?

A    2.326754
B    2.172825
C    2.573181
D    2.391475
dtype: float64

###Working with strings

In [207]:
# Built in string methods
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
print "Original series: ", s
print "Operating on strings in series: ", s.str.lower()

Original series:  0       A
1       B
2       C
3    Aaba
4    Baca
5     NaN
6    CABA
7     dog
8     cat
dtype: object
Operating on strings in series:  0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object


##Grouping

In [208]:
df4 = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                       'foo', 'bar', 'foo', 'foo'],
                       'B' : ['one', 'one', 'two', 'three',
                             'two', 'two', 'one', 'three'],
                       'C' : np.random.randn(8),
                       'D' : np.random.randn(8)})
df4

Unnamed: 0,A,B,C,D
0,foo,one,1.998232,0.721194
1,bar,one,0.733426,-0.119302
2,foo,two,-0.664899,2.080191
3,bar,three,-0.02739,1.339258
4,foo,two,0.585594,0.225102
5,bar,two,-1.870667,-1.100231
6,foo,one,1.603587,0.451163
7,foo,three,0.926215,0.678841


In [209]:
df4.groupby(['A','B']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,0.733426,-0.119302
bar,three,-0.02739,1.339258
bar,two,-1.870667,-1.100231
foo,one,3.601819,1.172357
foo,three,0.926215,0.678841
foo,two,-0.079305,2.305293


##Concatenating pieces

In [216]:
np.random.randn(10,4)

array([[-1.14296105,  0.83337485, -0.16338536,  2.52067546],
       [-0.82269406,  1.2160463 , -0.13902902,  0.58724569],
       [-0.76717535, -0.42665718,  1.41911439,  1.15532216],
       [ 0.73444478, -0.44160454, -1.27873869, -0.19217871],
       [ 0.36245202,  0.75947756,  0.3588593 ,  2.07224147],
       [-0.35176726, -1.69557511, -0.18890609, -0.30754481],
       [-0.12622954,  1.3145355 ,  0.15903419,  1.04072445],
       [-1.37594285,  0.42907178, -0.25153442, -0.20532171],
       [-0.38739266,  1.30997449, -0.43488732, -0.92147005],
       [-1.83588412, -1.8789914 ,  0.62343605,  0.63299324]])

In [217]:
#Concatenating pandas objects together
df5 = pd.DataFrame(np.random.randn(10,4))
df5

Unnamed: 0,0,1,2,3
0,-2.041897,-1.209262,-0.646355,0.63701
1,-0.076819,-2.289804,-0.724763,-0.374453
2,-1.728321,-1.180203,0.626506,0.023966
3,0.851832,1.062758,-0.385286,0.738367
4,-0.924217,1.570674,0.257635,-0.729136
5,1.47772,0.279673,-0.822543,-1.835085
6,0.439104,0.112922,-0.743189,-0.189266
7,-1.569584,-0.952367,0.126604,0.859859
8,-1.234078,-0.916723,0.562158,-0.441334
9,1.858314,-1.020991,0.060013,1.270094


In [218]:
# Break it into pieces
pieces = [df5[:3], df5[3:7],df5[7:]]
pieces

[          0         1         2         3
 0 -2.041897 -1.209262 -0.646355  0.637010
 1 -0.076819 -2.289804 -0.724763 -0.374453
 2 -1.728321 -1.180203  0.626506  0.023966,
           0         1         2         3
 3  0.851832  1.062758 -0.385286  0.738367
 4 -0.924217  1.570674  0.257635 -0.729136
 5  1.477720  0.279673 -0.822543 -1.835085
 6  0.439104  0.112922 -0.743189 -0.189266,
           0         1         2         3
 7 -1.569584 -0.952367  0.126604  0.859859
 8 -1.234078 -0.916723  0.562158 -0.441334
 9  1.858314 -1.020991  0.060013  1.270094]

In [214]:
pd.concat(pieces)

Unnamed: 0,0,1,2,3
0,0.419084,-0.897757,0.286845,1.024689
1,0.676102,-0.434695,-0.13231,-1.651151
2,-1.127066,-0.257796,0.544062,-0.628465
3,0.018546,-0.867174,0.63735,-0.459063
4,-0.333693,-1.90827,-0.963644,-1.40421
5,-0.123689,-1.973577,0.604741,0.208593
6,-1.142666,-0.389683,0.566603,1.631587
7,0.594018,-1.560414,-1.557318,-1.395482
8,-1.404587,0.40227,0.113107,-0.867286
9,-0.518196,-1.063976,-1.378926,-1.348535


In [215]:
pd.concat(pieces,axis=1)

Unnamed: 0,0,1,2,3,0.1,1.1,2.1,3.1,0.2,1.2,2.2,3.2
0,0.419084,-0.897757,0.286845,1.024689,,,,,,,,
1,0.676102,-0.434695,-0.13231,-1.651151,,,,,,,,
2,-1.127066,-0.257796,0.544062,-0.628465,,,,,,,,
3,,,,,0.018546,-0.867174,0.63735,-0.459063,,,,
4,,,,,-0.333693,-1.90827,-0.963644,-1.40421,,,,
5,,,,,-0.123689,-1.973577,0.604741,0.208593,,,,
6,,,,,-1.142666,-0.389683,0.566603,1.631587,,,,
7,,,,,,,,,0.594018,-1.560414,-1.557318,-1.395482
8,,,,,,,,,-1.404587,0.40227,0.113107,-0.867286
9,,,,,,,,,-0.518196,-1.063976,-1.378926,-1.348535


In [219]:
df5.merge(df5,left_index=True,right_index=True)

Unnamed: 0,0_x,1_x,2_x,3_x,0_y,1_y,2_y,3_y
0,-2.041897,-1.209262,-0.646355,0.63701,-2.041897,-1.209262,-0.646355,0.63701
1,-0.076819,-2.289804,-0.724763,-0.374453,-0.076819,-2.289804,-0.724763,-0.374453
2,-1.728321,-1.180203,0.626506,0.023966,-1.728321,-1.180203,0.626506,0.023966
3,0.851832,1.062758,-0.385286,0.738367,0.851832,1.062758,-0.385286,0.738367
4,-0.924217,1.570674,0.257635,-0.729136,-0.924217,1.570674,0.257635,-0.729136
5,1.47772,0.279673,-0.822543,-1.835085,1.47772,0.279673,-0.822543,-1.835085
6,0.439104,0.112922,-0.743189,-0.189266,0.439104,0.112922,-0.743189,-0.189266
7,-1.569584,-0.952367,0.126604,0.859859,-1.569584,-0.952367,0.126604,0.859859
8,-1.234078,-0.916723,0.562158,-0.441334,-1.234078,-0.916723,0.562158,-0.441334
9,1.858314,-1.020991,0.060013,1.270094,1.858314,-1.020991,0.060013,1.270094


##Reshaping

In [220]:
df4

Unnamed: 0,A,B,C,D
0,foo,one,1.998232,0.721194
1,bar,one,0.733426,-0.119302
2,foo,two,-0.664899,2.080191
3,bar,three,-0.02739,1.339258
4,foo,two,0.585594,0.225102
5,bar,two,-1.870667,-1.100231
6,foo,one,1.603587,0.451163
7,foo,three,0.926215,0.678841


In [225]:
sums = df4.groupby(["A","B"]).sum()
sums

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,0.733426,-0.119302
bar,three,-0.02739,1.339258
bar,two,-1.870667,-1.100231
foo,one,3.601819,1.172357
foo,three,0.926215,0.678841
foo,two,-0.079305,2.305293


In [226]:
sums.unstack()

Unnamed: 0_level_0,C,C,C,D,D,D
B,one,three,two,one,three,two
A,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
bar,0.733426,-0.02739,-1.870667,-0.119302,1.339258,-1.100231
foo,3.601819,0.926215,-0.079305,1.172357,0.678841,2.305293


In [227]:
sums.unstack(level=0)

Unnamed: 0_level_0,C,C,D,D
A,bar,foo,bar,foo
B,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
one,0.733426,3.601819,-0.119302,1.172357
three,-0.02739,0.926215,1.339258,0.678841
two,-1.870667,-0.079305,-1.100231,2.305293


In [228]:
df4

Unnamed: 0,A,B,C,D
0,foo,one,1.998232,0.721194
1,bar,one,0.733426,-0.119302
2,foo,two,-0.664899,2.080191
3,bar,three,-0.02739,1.339258
4,foo,two,0.585594,0.225102
5,bar,two,-1.870667,-1.100231
6,foo,one,1.603587,0.451163
7,foo,three,0.926215,0.678841


In [230]:
pd.melt(df4,id_vars=["A","B"],value_vars=["C","D"],var_name="columns_original_name",value_name="columns_original_value")

Unnamed: 0,A,B,columns_original_name,columns_original_value
0,foo,one,C,1.998232
1,bar,one,C,0.733426
2,foo,two,C,-0.664899
3,bar,three,C,-0.02739
4,foo,two,C,0.585594
5,bar,two,C,-1.870667
6,foo,one,C,1.603587
7,foo,three,C,0.926215
8,foo,one,D,0.721194
9,bar,one,D,-0.119302


# Other resources:

Name | Description
--- | ---
[Official Pandas Tutorials](http://pandas.pydata.org/pandas-docs/stable/tutorials.html) | Wes & Company's selection of tutorials and lectures
[Julia Evans Pandas Cookbook](https://github.com/jvns/pandas-cookbook) | Great resource with examples from weather, bikes and 311 calls
[Learn Pandas Tutorials](https://bitbucket.org/hrojas/learn-pandas) | A great series of Pandas tutorials from Dave Rojas
[Research Computing Python Data PYNBs](https://github.com/ResearchComputing/Meetup-Fall-2013/tree/master/python) | A super awesome set of python notebooks from a meetup-based course exclusively devoted to pandas