Ok, now that we have the basics of **IPython Notebooks** down, lets get to work!

As is almost always the case when working with **Python**, we are going to need more than just its basic functionality available to us as we develop our analytical pipelines. 

In order to have this additional functionality available (being able to use **pandas**), we will rely on a  couple `import` statements.

Here they are:

In [1]:
import pandas as pd
import numpy as np

The code above did two things:

* Loaded in all of the functionality that **pandas** provides (`import pandas as pd`)
* Loaded in some additional functionality from a different package that **pandas** relies on called **NumPy** (`import numpy as np`)

Importantly, `pd` is now the alias (new name) for the entire `pandas` library and `np` is the alias for the `numpy` library. Instead of having to type `pandas.something` or `numpy.something` to access a given function, you can now just type `pd` or `np`. 

So what exactly is [**pandas**](http://pandas.pydata.org) and why the funny name (we will talk about [**NumPy**](http://www.numpy.org) a bit later)?

**pandas** is a Data Analysis Library written in and for the **Python** programming language and is a very loose acronym for **P**ython **An**alysis of **Da**taset**s** (or something like that anyway). 

It provides open source, easy-to-use data structures and data analysis tools.

We will be using it exclusively for the next two days.

Before we get started with an actual dataset, lets make a dummy dataset and just understand the basics of the two main kinds of objects we will be working with in **pandas**, `Series` and `DataFrame` objects.

Here is an example `Series` stored in a variable we will call `example_series`:

In [2]:
example_series = pd.Series(range(5), index=['a', 'b', 'c', 'd', 'e'])
print example_series

a    0
b    1
c    2
d    3
e    4
dtype: int64


From the **pandas** documentation, a `Series` is "a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.)."

This means that it is simply a table with a single column (that doesn't have a name) and an `index`, which is a pointer that identifies every single row in that `Series`.

In our case, `example_series` contains 5 rows, whose values are the integers from 0-4 (inclusive) and whose index values are the letters a-e (inclusive).

To create a `Series` object you call `pd.Series(data,index)` where `data` is the data you want stored, and `index` is **optional**, so if you don't provide it, it will be made for you:

In [4]:
example_series_no_index_given = pd.Series(range(5))
print example_series_no_index_given

0    0
1    1
2    2
3    3
4    4
dtype: int64


By default, when you don't provide an `index` **pandas** constructs one for you, starting at 0 and ending at the number of rows found in the `Series` minus 1. 

To access just the values or just the index in the `Series` object, you can call `index` or `values` on the objects you just created:

In [11]:
print example_series.values
print example_series.index
print example_series_no_index_given.index

[0 1 2 3 4]
Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')
Int64Index([0, 1, 2, 3, 4], dtype='int64')


As you can see, the indices of the two objects we just created are different and of different type (one is an `int` and the other is `object`).

This is just **pandas** way of saying this is a type that it knows isnt a number or a `DateTime` (this is for timeseries, we will cover them later).

You can access values in a `Series` by their `index`:

In [14]:
print example_series['a']
print example_series_no_index_given[0]

0
0


Or by their position in the `Series`, including multiple positions at a time:

In [6]:
print "A single value: ", example_series[0]

A single value:  0


When you access multiple rows, you get a series back instead of a single number:

In [7]:
print "A series in return: \n", example_series[0:2]

A series in return: 
a    0
b    1
dtype: int64


You can also rearrange the values in a `Series` when you query it:

In [24]:
print "A series rearranged: \n", example_series[['d','a','c']]

A series rearranged: 
d    3
a    0
c    2
dtype: int64


When working with `Series` objects, you can do all sorts of math and selections on them (as long as the values in the object are numbers!):

In [30]:
print "Multiplying every value in the series * 2: \n", example_series * 2,"\n"
print "Get those indices in the Series that have values greater than 1: \n",\
example_series > 1, "\n"
print "Select those values in the Series that have values greater than 1: \n",\
example_series[example_series > 1]

Multiplying every value in the series * 2: 
a    0
b    2
c    4
d    6
e    8
dtype: int64 

Get those indices in the Series that have values greater than 1: 
a    False
b    False
c     True
d     True
e     True
dtype: bool 

Select those values in the Series that have values greater than 1: 
c    2
d    3
e    4
dtype: int64


**Whenever you extract a single column from a `DataFrame` object, or whenever you compute some values on a `DataFrame` object that are only a single column, you will always get a `Series` back in return.**

In the backend, a `Series` object is essentially a **Python** `dict` object (which you should have practiced with in the pre-work!) where the `keys` are the index values in the `Series` and the `values` of the `dict` are the actual values stored in the `Series`.

This is important to understand for the remainder of the course. If you only get a single column, its a `Series` (represented as a `dict` in the background). If there are multiple columns together, you get a `DataFrame`. 

So let's talk about `DataFrame` objects now. 

Here is an example `DataFrame` object:

In [48]:
d = {'one': pd.Series(range(4), index=['a','b','c','e']),
    'two': pd.Series(['aa','oo','ee','ii',"yy"],index=['a','b','c','d','e'])}
example_df = pd.DataFrame(d)
print "An example dataframe:"
example_df

An example dataframe:


Unnamed: 0,one,two
a,0.0,aa
b,1.0,oo
c,2.0,ee
d,,ii
e,3.0,yy


`example_df` is a `DataFrame` that contains 2 columns, `one` and `two`. They have different datatypes and an `index` that is non-numeric:

In [49]:
print "The datatypes for the columns in the DataFrame: \n", example_df.dtypes ,"\n"
print "The index of the DataFrame: \n", example_df.index
print "The values in the DataFrame: \n", example_df.values

The datatypes for the columns in the DataFrame: 
one    float64
two     object
dtype: object 

The index of the DataFrame: 
Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')
The values in the DataFrame: 
[[0.0 'aa']
 [1.0 'oo']
 [2.0 'ee']
 [nan 'ii']
 [3.0 'yy']]


Also, notice that in the case of our example `DataFrame`, one of the elements is labeled `NaN` because although the index was created (for the second column), no value was supplied for that index in the first column. By default, **pandas** is smart and automatically fills in `NaN` for that value (this stands for "not a number" and is the default way that it handles nulls). 

As an aside, the `u` before each letter in the `index` tells you that the characters are encoded using the UNICODE format. This is a common format that allows one to represent more symbols than just ASCII can handle (things like characters in non-European languages, characters with accents, non-standard symbols, etc.)

You can also access the column names directly:

In [38]:
example_df.columns

Index([u'one', u'two'], dtype='object')

And you can access the values in a column by passing the column name to the dataframe:

In [39]:
example_df["one"]

a     0
b     1
c     2
d   NaN
Name: one, dtype: float64

You can also access all the values in a set of rows and columns by their index.

To do so, you have to treat the values in the dataframe as part of a 2-d grid and access the specific elements you want directly. If you want the whole row or column, use `:`. 

Here is an example where I simply am getting all of the values in the first column just as I had done above (remember, in Python indexing starts from 0, not 1):

In [36]:
example_df.ix[:,0]

a     0
b     1
c     2
d   NaN
Name: one, dtype: float64

And here is how I would access only the first two rows in the second column of the dataframe by either calling the column or via indexing on the values:

In [40]:
print "Calling the specific column: \n", example_df["two"][0:2],"\n"
print "Using pure indexing on the values: \n", example_df.ix[0:2,1]

Calling the specific column: 
a    aa
b    oo
Name: two, dtype: object 

Using pure indexing on the values: 
a    aa
b    oo
Name: two, dtype: object


Also, keep in mind that `0:2` actually means the indices at 0 and 1, excluding 2. 

If you want to go from some index to the end, use `::`. 

So, here is a way to get all of the rows in the first column from the 3rd row on (again, I will show you two ways of doing it):

In [41]:
print "Access via column name: \n", example_df["one"][2::],"\n"
print "Pure indexing: \n", example_df.ix[2::,0]

Access via column name: 
c     2
d   NaN
Name: one, dtype: float64 

Pure indexing: 
c     2
d   NaN
Name: one, dtype: float64


And this is how you would get all of the values in every column from the 3rd row down:

In [43]:
print "Calling via access on the dataframe: \n", example_df[2::], "\n"
print example_df.ix[2::,:]

   one two
c    2  ee
d  NaN  ii


Unnamed: 0,one,two
c,2.0,ee
d,,ii


If by now you are starting to grok how you accesss data via pure data indexing, then you should quickly see that the following two ways to access all the data in our example dataframe are functionally equivalent:

In [24]:
print "This is one way to get the whole dataframe: \n", example_df,"\n"
print "And this one is equivalent: \n", example_df.ix[:,:]

This is one way to get the whole dataframe: 
   one two
a    0  aa
b    1  oo
c    2  ee
d  NaN  ii 

And this one is equivalent: 
   one two
a    0  aa
b    1  oo
c    2  ee
d  NaN  ii


Selecting and performing math on columns within a `DataFrame` object works identically to how it does in a `Series`, except you need to be careful that the type of the column youre working on matches the operation youre trying to perform:

In [34]:
example_df["one"] * 2

a     0
b     2
c     4
d   NaN
Name: one, dtype: float64

Because sometimes the behavior it gives you is not what you want, if you don't understand what you're doing:

In [46]:
example_df["two"] * 2

a    aaaa
b    oooo
c    eeee
d    iiii
Name: two, dtype: object

What does all of this mean? How do `DataFrame` and `Series` objects relate to each other? 

A `DataFrame` is essentially a collection of `Series` objects, all having the same indices. 

Again, straight from the [**pandas documentation**](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe):

"...`DataFrame` is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a `dict` of `Series` objects."

So, at bottom, what we will be doing for this workshop is learning how to manipulate **Python** `dict` objects in a variety of useful ways.

This is just a basic prelude to get you to understand what we are going to be dealing with.

Just to get some practice with `DataFrame` and `Series` objects, do the following:

1. Get all of the values in the first column of `example_df`
*  Get all of the values in the second column of `example_df`
*  Get all of the values less than 2 in `example_df` and in `example_series`
*  Get the value found in the 4th row of the second column in `example_df`
*  Get the values in every column from the 4th row on in `example_df`
*  Divide every value in `example_series` by 3

In [None]:
##YOUR CODE HERE

If you look at the documentation for `read_csv`, you'll see that it is very large and provides for lots of different functionality. 

As a first pass, we just passed the path to the file in as a ```string``` to the `read_csv` function, without any other arguments. 

Lets take a look at the first few rows and see what we get using the `head` function on our newly loaded dataset `ratingData`. 

The `head` function returns the first 5 rows by default of the DataFrame you call it on. You can change it to be a larger or smaller number by passing in a positive `integer` into `head` as an argument like so: `ratingData.head(100)`.

The function `tail` does the exact same thing, except with the last records in the dataset.

In [3]:
ratingData.head()

Unnamed: 0,1::1193::5::978300760
0,1::661::3::978302109
1,1::914::3::978301968
2,1::3408::4::978300275
3,1::2355::5::978824291
4,1::1197::3::978302268


Ok, well that looks terrible.
Lets diagnose the problems we see and make it unterrible: 

1. Everything is in a single column (so we can't separately look at ratings, user ids,movie ids, or timestamps)!
2. The first row is used as the name of the only column (we call this the **header**), which is no good, as the first record shouldn't be the header, but an actual record.
3. The timestamp is in a format that doesn't really tell us anything useful about when the ratings occurred.

So, lets use some of the additional functionality of `read_csv` to load this dataset in cleanly, and hopefully that will solve problems **1 and 2**.

In [5]:
ratingData = pd.read_csv("./movieData/ratings.dat",sep = "::",names = ['UserID','MovieID','Rating','Timestamp'])



Don't worry about the `ParserWarning:` message (if you get one) as it doesn't affect what we are doing, and lets just take a look at the data now. I'll explain exactly what I did below.

In [4]:
ratingData.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [None]:
# try putting a number in between the parentheses () in the .head() function below
#YOUR CODE HERE


MUCH BETTER!

So, what did we just do?

The function `read_csv` has lots of functionality, as I had mentioned (and as you saw when you pulled up its documentation). One of its options is called `sep`, and allows you to provide your own separator for dividing the columns that you have in your dataset. 

The default separator for `read_csv` is the comma (`,`), since `csv` stands for **c**omma **s**eparated **v**alues. 

However, since `::` separated the fields in this dataset, we supplied that as an argument (again, as a `string`) to the argument `sep` instead.

A separate argument, `names`, allows you to pass in your own list of names, again as `strings`, to `read_csv` to be treated as the column names (or the **header**) of the dataset. Since we knew what the names for the columns should be, we put them in.

Now that our dataset looks more reasonable, lets do a couple brief sanity checks and then fix issue **3.**

Lets do a sanity check and make sure that:

1. All our data is in the format that we expect (everything is an `int`).
2. The ratings range across the values we expect (1-5 and nothing else).

The property `dtypes` is accessible from our `ratingData` object, and tells us the types of the data in all of our columns (which addresses **1.**).

In [77]:
ratingData.dtypes

UserID       int64
MovieID      int64
Rating       int64
Timestamp    int64
dtype: object

Ok, so everything appears to be an `int` as `int64` is an `integer` (positive or negative whole number) data type that can represent very large numbers.

As an important aside about **pandas**, all the values in a given column have to be of the same type. So, **if even one value was not a whole number, (1.0 for example), the values for the entire column would be inferred to be something else (either a `float64` or an `object` if any of the entries were `strings`).**

Now let's address **2.** by looking at all of the unique entries in the `Rating` column using `unique` (we expect there to be 5 unique values, 1-5 inclusive).

In [78]:
ratingData.Rating.unique()

array([5, 3, 4, 2, 1])

As long as your column names do not contain strange characters (spaces and escape characters like !\/), you can simply access the values in a column by doing `dataFrameName.columnName` where `dataFrameName` is the name of your `DataFrame` object and `columnName` is the name of your column.

The function `unique` is accessible from our `ratingData` object, and simply returns all of the unique values within our column of choice as a `List`.

However, if your dataset contains weird column names, you have another way of accessing columns:

In [None]:
ratingData["Rating"].unique()

In [None]:
# try calling the .unique() function by using other column names in the dataset.
# Data science is all about getting to know your data, so get to know the unique values of the columns in the dataset!
#YOUR CODE HERE



In this way of accessing the `Rating` column, we have to pass the name of the column as a `string` (in quotes "").

What if we wanted to access multiple columns? Here's how you would do that (this is the only way):

In [80]:
ratingData[["Rating","UserID"]]
##OR
multipleColumns = ["Rating","UserID"]
ratingData[multipleColumns]

Unnamed: 0,Rating,UserID
0,5,1
1,3,1
2,3,1
3,4,1
4,5,1
5,3,1
6,5,1
7,5,1
8,4,1
9,4,1


In the multi-column case, you must pass the columns you are interested in as a `List` of `string` values (or as a `variable` that points to that list).

In [None]:
# pass in other multiple columns in as column names into the ratingData object
#YOUR CODE HERE



Now, on to fixing issue **3**.

**pandas** has pretty fantastic date conversion functionality, as long as you know the format of the date data you are using. We know the format of our `Timestamp` column, so we are good to go.

As a refresher, it was seconds since epoch time (12:00AM January 1, 1970).

The pandas library has a function called `to_datetime` that, when given a column, and some optional parameters, converts the timestamp into nice, prettily formatted text.

Lets create a new column called `FormattedTimestamp` by formatting our `Timestamp` column:

In [6]:
ratingData['FormattedTimestamp'] = pd.to_datetime(ratingData.Timestamp,unit = 's')

Ok, there is a lot going on here, we are creating a new column `FormattedTimestamp` from a computation on another column, so lets work through it. 

To create a new column, you just use the same syntax as when you want to select a column, except you use a name that isn't found in the dataset's column list. 

Everything following the `=` sign in the expression is what you want to put into that new column.

And what we did following the equals sign is:

1. We passed the `Timestamp` column of our ratingData dataset as a mandatory parameter.
2. We supplied an optional string parameter called `unit` with the unit of our data as a `string`.

(Look at the documentation, and you can see there is lots more stuff you can do with `to_datetime`!).

Now, we can simply remove the old `Timestamp` column, since we don't need it anymore!

To do so, you simply pass the name of the dataframe and the column you are trying to delete to the `del` function:

In [7]:
print "Columns before removal: ", ratingData.columns
del ratingData['Timestamp']
print "Columns after removal: ", ratingData.columns

Columns before removal:  Index([u'UserID', u'MovieID', u'Rating', u'Timestamp', u'FormattedTimestamp'], dtype='object')
Columns after removal:  Index([u'UserID', u'MovieID', u'Rating', u'FormattedTimestamp'], dtype='object')


The `del` operation simply deletes the columns you specify from the given `DataFrame`.

One more thing about dates before we move on, once you've got them converted to the pretty format we saw above, you can access all kinds of information from each date by calling the `dt` module from within the column that stores your dates.

So, as an example, getting the year of every row in our dataset is as simple as:

In [7]:
ratingData.FormattedTimestamp.dt.year

0     2000
1     2000
2     2000
3     2000
4     2001
5     2000
6     2000
7     2000
8     2000
9     2000
10    2001
11    2000
12    2000
13    2000
14    2000
...
1000194    2000
1000195    2000
1000196    2000
1000197    2000
1000198    2000
1000199    2000
1000200    2000
1000201    2000
1000202    2000
1000203    2000
1000204    2000
1000205    2000
1000206    2000
1000207    2000
1000208    2000
Length: 1000209, dtype: int64

Other properties (like day, hour, day of month, etc.) are available as well.

Again, just check the [timestamp documentation.](http://pandas.pydata.org/pandas-docs/stable/timeseries.html).

In [None]:
#try using the other functions found in the dt module: call ratingData.FormattedTimestamp.dt. and the Tab key
#to see what else is offered
#YOUR CODE HERE




One note about the outputs of calling these datetime **properties.** When you call the property on a single value (like a single row) or on an entire column you will get a `Series` object returned to you, which is the **pandas** representation of a single column.

`Series` objects are different from `DataFrame` objects (which we've been working with exclusively thus far) because they can be appended (attached) as new columns to your dataset (which is already a `DataFrame` object) without any problems.

So if you want to store the `year` of every row in a new column called `year` then do:

In [29]:
ratingData["year"] = ratingData.FormattedTimestamp.dt.year

In [None]:
#make another column called "month" and call the appropriate function to store the month of each rating
#do the same thing for "day", just to get a good handle on the kinds of things you can do
#YOUR CODE HERE



Let's move on and get a better feel for our dataset now.

What if we wanted to know the exact shape of our dataset? That is, the *exact* number of rows and columns found in it? (I told you there were ~1,000,000 ratings here, but exactly how many are there?

We can use another **property** that is available in our `ratingData` object called `shape`:

In [7]:
ratingData.shape

(1000209, 5)

The `shape` property (which doesn't have any documentation, unfortunately), tells us the number of dimensions and the number of values in each dimension in our dataset (we can have 1, 2, or more dimensions in our dataset, after all), as a tuple (a very common datatype in **Python** that is of a fixed size and cannot be changed).

As a brief aside, you saw that I just said **property** and not **function** right? 

That means, this value is available to us as part of the dataset intrinsically and cannot be changed based on inputs (in CS parlance, it is read-only), and doesn't have to be called like `unique(), head(), tail(), max(), mean()` and all the other functions that are re-computed on the dataset because they can effectively change! 

(Another example property that is available to you and that you've already used is `dtypes`)

Ok, back to `shape`.

The number of dimensions in our dataset is the arity of the tuple (the number of commas + 1) and the numbers between commas tell us the number of distinct values in that dimension. So the **arity is 2 because all we have are rows and columns.** 

So, we have **1,000,209 distinct values in the first dimension** (here they are our ratings, or **rows**) and **4 distinct values in the second dimension** (here the second dimension is our **columns**, and is as we expect, since we just added a new column and deleted an old one).

Now that you have a bit of **pandas** functionality at your disposal, you should give me the answer to the following questions by writing a bit of code:

1. How many users are there in the dataset?
2. How many movies are in the dataset?
3. How many unique times are in the dataset?

**Hint: All you need is the `unique()` function and the `shape` property to answer these questions!**

In [None]:
##WRITE YOUR CODE HERE, replace "pass" with your code in all 3 cases
numUsers = pass
numMovies = pass
numTimes = pass

print "There are",numUsers[0],"unique users,",numMovies[0],\
" unique movies, and",numTimes[0],"unique times in this dataset."

Awesome! You've written your first bit of code using **pandas** and have actually answered some useful questions about this dataset! Pats on the back all around, and lets keep exploring!

Now, let's look at the ratings in the dataset to get a feel for their central values and spread:

In [None]:
print "The average rating in the dataset is:",ratingData.Rating.mean()
print "The middle rating in the dataset is:",ratingData.Rating.median()
print "The standard deviation of the ratings is:",ratingData.Rating.std()

So it looks like the people in our dataset don't like to rate movies too low, as the movies have  an average rating >3 (which is supposed to be average on a 1-5 scale).

The functions `mean`,`median`, and `std` compute exactly what you expect (the mean, median, and standard deviation {or spread} of a given set of values) and can even be applied to the entire dataset (although in our case, that doesnt make too much sense as all the other columns, except `Timestamp` are numeric mappings of categorical data).

In [None]:
print "means: \n", ratingData.mean(),"\n"

In [None]:
print "medians: \n", ratingData.median(), "\n"

In [None]:
print "std. deviations: \n", ratingData.std()

Other descriptive stats can be computed as well (min/max, sum, variance, skew, kurtosis, etc.). Just take a look at the [descriptive statistics](http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics) documentation!  

In [None]:
#Look at the descriptive statistics documentation and print a couple more descriptive stats on just the 
#"Rating" column. You get to choose what stats you want to print
#YOUR CODE HERE


Now time for a little trick! 

If you want all these basic statistics and a few more, just use the `describe` function, which works just like the others mentioned above (you can call it on just a single column, or the whole dataset).

In [9]:
ratingData.Rating.describe()

count    1000209.000000
mean           3.581564
std            1.117102
min            1.000000
25%            3.000000
50%            4.000000
75%            4.000000
max            5.000000
Name: Rating, dtype: float64

Now lets learn how to subselect values within dataframes.

Subselection in **pandas** works along the following very common pattern:

1. You create a condition that can be evaluated to either **true** or **false** for every row in the dataset and store the outcome in a variable (this is traditionally called a **mask**).
2. You apply that **mask** onto your `DataFrame` (dataset).

Let's try this subselection + application with the ratings in our dataset by only getting all of the really low ratings (lets say low ratings are those that are < 3) in our dataset.

We will create a mask that expresses our "crappy ratings" condition, and then apply that mask to our dataset.

In [88]:
crappyRatingMask = ratingData.Rating < 3
crappyRatings = ratingData[crappyRatingMask]

In [91]:
print crappyRatings.Rating.unique() # This is a sanity check to make sure that our filtered data contains only low ratings
crappyRatings.head(20)

[2 1]


Unnamed: 0,UserID,MovieID,Rating,FormattedTimestamp
67,2,1213,2,2000-12-31 21:34:18
73,2,434,2,2000-12-31 22:02:54
75,2,3107,2,2000-12-31 22:00:02
83,2,902,2,2000-12-31 21:41:45
91,2,3256,2,2000-12-31 21:57:19
114,2,3699,2,2000-12-31 21:46:13
125,2,2427,2,2000-12-31 21:58:33
133,2,95,2,2000-12-31 22:02:23
148,2,21,1,2000-12-31 21:57:19
151,2,1090,2,2000-12-31 21:36:20


Awesome, this worked as expected!
How many of these crappy ratings are there? (Fill in what is in **pass** so that numCrappyRatings returns a number)

In [None]:
numCrappyRatings = pass
print "There were",numCrappyRatings,"ratings below 3."

In [None]:
#make another mask that filters out user ids that are less than 30 and get the size of the resulting dataset
#(its shape)
#YOUR CODE HERE



Now it is time for you to show how much you've learned. Give me the answers to the following questions:

1. What was the average rating in January?
* What was the average crappy rating in January?

**Hint:** Although there are lots of ways you can tackle these questions (many of which you will learn soon). I suggest you use the following procedure:

1. Create a new column in `ratingData` and call it "month", so that it tells you the month of every row
* Create a new dataset of:
  1. all January records and store it in `januaryRatings`
  * create a mask of crappy ratings using that dataset called `crappyJanuaryRatingsMask`
  * apply that januaryCrappyRatingsMask to your dataset and store it in a new dataset called `crappyJanuaryRatings`
  * crappy January records by creating a mask from the new column
  * compute the mean of the `Rating` column for both `januaryRatings` and `crappyJanuaryRatings`

In [8]:
#YOUR CODE HERE



Ok, now that you know how to do some basic subselection, sorting, and calculations on data, we are going to do something a bit more complicated, and start subdividing our data into groups to be able to answer some more general questions about our dataset.

Once you have this functionality down, you will be able to: 

1. Answer more interesting kinds of questions
* Answer the questions above using fewer lines of code 

Lets say you wanted to know or do the following:

1. **In what month did users rate the most movies?**
2. **What month had the highest average rating?**
3. **Remove users with too few ratings (lets say < 30) and reanswer these same questions**

Our approach here will be:

* Learn to use the `groupby` functionality of **pandas** to create subgroups of our ratings based on either the `month` the ratings were given or on the `UserID` of the rater.
* Apply an aggregating function to these groups to return:
  * The `size` of each group (since the `size` of each group is either the number of ratings in that `month` or the number of ratings for that `UserID`)
  * The `mean` of the `Rating` column within each group
  * A filtered version of the original dataset so that only the groups that are large enough (when grouping on `UserID`, those users that have made enough ratings) are returned.

The `groupby` function in **pandas** is analagous to the grouping operations you may be familiar with if you've ever used any **SQL** variants. 

A generic **SQL** translation of **1.**, for example, would look something like:

```
SELECT month,numRatings FROM (SELECT month,size(month) AS numRatings
FROM ratingData
GROUP BY month) WHERE numRatings = MAX(numRatings)
```

(If this looks like wizardry, don't worry, I'm just trying to show this to those users that are familiar with SQL and any of its variants; this is the only SQL you will see all weekend!)

Grouping can get very complicated, but as a first pass you can think of it as a way to split your dataset into non-overlapping subsets along any axis (along rows or columns, in our case). 

The values along which you **group** your dataset are traditionally called **keys**, so **each key should be unique to each group, and each group can have at most one key associated with it** (although the key for identifying each group can be really complicated).

Once you've grouped your dataset, the `GroupBy` object isn't too useful by itself. It becomes useful when you apply a **transformation** to it and get a new dataset back. 

We typically call this **transformation** an **aggregation**, as we are getting some aggregate value back for each group.

The **aggregation** functions we will be using for questions **1.,2.** are `size` and `mean`.

So, enough explaining, lets get to some hacking.

Lets address grouping in the context of answering our first question:

1. **In what month did users rate the most movies?**

In **pandas**, to create groups, you must create a `GroupBy` object (more on what that is later) from your `DataFrame` object (your dataset) by passing the values along which you want to group to the `groupby` function.

We want to **groupby** the **month** column and store it:

In [9]:
monthGroups = ratingData.groupby("month")

In [None]:
#try to print monthGroups
#what do you get?
#YOUR CODE HERE



And then we want to get the number of records (rows) in each group in our `monthGroups` object using the `size` function (which is accessible from this object):

In [31]:
ratingsPerMonth = monthGroups.size()
ratingsPerMonth #this will simply print the result inside of this notebook so we can see it
#to print to the screen outside of the notebook, you would have to say "print ratingsPerMonth"

month
1         23072
2         12128
3          8537
4         19407
5         74278
6         61110
7         97004
8        188674
9         56791
10        45500
11       295461
12       118247
dtype: int64

And just to make it nice and clean, we will simply output the month with the largest number of ratings using the `max` function:

In [17]:
monthWithMostRatings = ratingsPerMonth[ratingsPerMonth==ratingsPerMonth.max()]
monthWithMostRatings #same as above, print to notebook without having to say "print highestRating"

month
11       295461
dtype: int64

Got it? Good!

One more thing before you try it yourself. Once you've created a `groupby` object, you don't have to select everything in the object to perform an aggregation, and can subselect a given column within each group (**Hint, Hint!**)

Now you try answering question **2.**:

2. **What was the average rating given to movies in each month of the year?**

Fill in some code for `avgMonthlyRatings` and `highestAvgRating`, removing the `pass` in both cases (remember to use the `monthGroups` object!):

In [11]:
avgMonthlyRatings = pass
print avgMonthlyRatings
highestAvgRating = pass
highestAvgRating

SyntaxError: invalid syntax (<ipython-input-11-32c3ae1eb04a>, line 1)

These aggregations are already implemented in the `GroupBy` object and can be called directly from the object, but this is not generally the way this is handled.

Under the hood, **pandas** is passing the function we apply (`mean` or `size` in our case) to another function called `aggregate` and that function is actually doing the heavy lifting. 

So, what is actually happening when we wrote our code to answer question **1.** above like this:

`ratingsPerMonth = monthGroups.size()`

Was actually being implemented more like this:

``ratingsPerMonth = monthGroups.aggregate(size())``

So, this means that whenever we want to do more complicated transformations or reductions of our data, we should supply our function(s) of interest to `aggregate` or the shorhand `agg` function.

A couple more explanations before we tackle filtering for answering question **3.**

You can pass multiple functions to `aggregate` as a `List`, if you want multiple transformations applied to the data, like so:

In [18]:
monthGroups.Rating.agg([np.mean,np.size,np.std])

Unnamed: 0_level_0,mean,size,std
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,3.542996,23072,1.075643
2,3.523664,12128,1.089588
3,3.456952,8537,1.07946
4,3.522028,19407,1.096929
5,3.601847,74278,1.137355
6,3.616528,61110,1.110452
7,3.617985,97004,1.092208
8,3.566315,188674,1.122584
9,3.602279,56791,1.113047
10,3.609143,45500,1.08717


In all of these cases, you will notice that we have to call the functions using the **NumPy** module (but using the `np` alias), which was installed when you got **pandas** working on your system (since **NumPy** is a dependency of **pandas** and gives **pandas** all of the math and matrix wrangling functionality it relies on behind the scenes).

**NumPy** is a really powerful matrix and math library in **Python** and has lots of functionality we won't go into here, so if you're interested, head over to their [website](http://www.numpy.org) to learn more!

Now, lets move on to filtering.

Lets go through how to answer question **3.**:

* **Remove users with too few ratings (lets say < 30) and reanswer these same questions**

Here is our pipeline:

1. `groupby ratingData on "UserId"` and call this new `GroupBy` object `userGroups`
* `filter` so that only groups (users) containing > 30 ratings are kept in a new `DataFrame`, called `filteredRatings`
* `groupby filteredRatings on "month"` and call this new `GroupBy` object `filteredMonthGroups`
* recompute `mean` and `size` statistics on the `Rating` column in `filteredMonthGroups` and store it in a variable called `filteredMonthAggs`
* use `max` to see if the month when the largest rating mean and rating size have changed

In [19]:
userGroups = ratingData.groupby("UserID")

This first step is very similar to what we did before, except we are grouping on a different column, `UserID`.

Now comes the more challenging part, using `filter`, and involves learning a bit about **anonymous functions**. Here goes:

In [25]:
filteredRatings = userGroups.filter(lambda x: x.Rating.size >= 30)

**AAAAA WHAT IS THAT? lambda? x? What are these things?**

Ok, lets take a breath and work through this...

The `filter` function takes a function as an input and requires that the function you pass to it return either `True` or `False` on a per-group basis. 

It then returns a new `DataFrame` object, sorted into the groups you grouped on initially, with all of the groups removed that don't satisfy the constraints of the function you passed to it (that is, removing those groups that, when the function is applied to them, return `False`).

So, why are we using this weird `lambda` thing? 

Well, `lambda` is the keyword for creating an **anonymous function** in **Python**.

**anonymous functions** are functions that you define within some restricted place that:

1. Usually accomplish some very minimal functionality
2. Don't need the syntactic sugar that functions usually come with because of 1. and to maintain the compactness (and hopefully clarity) of your code.

To declare an anonymous function you:

1. type `lambda`, followed by
* arbitrary names for the parameters that function accepts (x,y, etc.) separated by commas, followed by
* a colon, which tells **Python** that we are done with specifying the parameters, followed by
* the expression that defines how the function operates on the parameters. This expression will dictate what the function returns.

So:
`lambda x: x.Rating.size >= 30` means:

1. **This function accepts a single parameter (in our case the group) arbitrarily called x.**
* **It operates on this parameter, x, by finding some attribute in it called "Rating" (which it definitely has from earlier steps) and checking whether its size property is greater than or equal to 30.**
* **Because this function is checking a condition, it will return either `True` or `False` for every parameter you pass to it**

If you understand that, you grok **anonymous functions**.

**However, if you don't understand WTF just happened, you can actually write your own function, and just pass it directly to the filter!**

So, instead of:

`filteredRatings = userGroups.filter(lambda x: x.Rating.size >= 30)`

You can do this (they are functionally equivalent!):
```
def filterFewestRatings(dataset):
    return dataset.Rating.size >= 30
    
filteredRatings = userGroups.filter(filterFewestRatings)
```

Here, we've replaced the variable `x` in the anonymous function with a more easily understood variable `dataset` in the named (non-anonymous) function `filterFewestRatings`. This has the same exact functionality as the **anonymous function**, but is clearly more code to write. 

**It's your choice as to how you want to implement small functions like this.**

Lets look at filtered ratings and see how many ratings and users we eliminated before we recompute our answers:

In [193]:
print "Original number of ratings:",ratingData.shape[0]
print "Filtered number of ratings:",filteredRatings.shape[0]
print "Fraction of ratings eliminated:",(1.0-float(filteredRatings.shape[0])/ratingData.shape[0])

Original number of ratings: 1000209
Filtered number of ratings: 980300
Fraction of ratings eliminated: 0.0199048398885


In [194]:
print "Original number of users:",ratingData.UserID.unique().size
print "Filtered number of users:",filteredRatings.UserID.unique().size
print "Fraction of users eliminated:",(1.0-float(filteredRatings.UserID.unique().size)/ratingData.UserID.unique().size)

Original number of users: 6040
Filtered number of users: 5231
Fraction of users eliminated: 0.133940397351


So we got rid of about 2% of the ratings, but over 13% of the users!

Now that we've filtered, you should be able to do the rest and recompute the answers:

In [200]:
filteredMonthGroups = pass
filteredMonthAggs = pass

(940925, 5)

To hammer all of this home, try to answer the following:

**What was the ID of the average highest-rated movie in the first half of the year for those movies that were rated at least 5 times within that timespan?**

Fill in the next cell by replacing the `pass` (or make more cells if you need to) and tell me when you know the answer (but quietly, so you don't tell everyone else!).

In [None]:
pass

A couple more data analysis functions before I set you loose on a completely new dataset!

**pandas** provides some pretty cool basic statistical functionality apart from the really simple functions we've used so far (like `mean`, `std`, `max`, and `size`).

What I'm talking about are statistical functions that compare two variables, specifically  **covariance** and **correlation**. 

Briefly, **covariance** is a way to tell whether two variables tend to move in the same or opposite directions, but cannot measure the absolute strength of this relationship (if two variables show positive covariance, it means that as one increases, so does the other one, and vice versa). 

**Correlation** on the other hand, can tell you both whether two variables tend to move with or against each other, and how strong this relationship between them is.

I will be showing you a fairly contrived example here, but you will be using this functionality in the actual assignment you'll be working on.

I will keep this short and sweet, because I know you are dying to get started on your own!

To correlate two columns using the standard pearson correlation in **pandas** simply do something like this, where we correlate `month` with `Rating`:

In [205]:
ratingData.month.corr(ratingData.Rating)

0.00097862025825141258

If you want to get the full correlation matrix among all variables that are of a numeric type, just call `corr` on the whole `DataFrame`. **pandas** will know to only make the correlations among numeric columns only and will exclude all non-numeric columns from the resulting correlation matrix:

In [206]:
ratingData.corr()

Unnamed: 0,UserID,MovieID,Rating,month
UserID,1.0,-0.017739,0.012303,-0.633559
MovieID,-0.017739,1.0,-0.064042,-0.003123
Rating,0.012303,-0.064042,1.0,0.000979
month,-0.633559,-0.003123,0.000979,1.0


Obviously, this is a contrived example because **the only truly numeric column here is Rating** all the other columns (except for `FormattedTimestamp`) are numeric labels for categorical items. 

What this means is that there is no way we can actually say that one `UserID` is greater than some other `UserID` because the mapping from the actual user's identity to the `UserID` number associated with them is arbitrary, and the same thing applies to every other numeric column here except for `Rating`. 

And because we can't rank any of the entries in the other columns in any meaningful way, **the correlation between them is meaningless.** 

I simply wanted to show you how you would do this, given the right situation (which you will experience shortly!).

To consolidate what you've learned answer the following questions about this dataset:

1. Which user rated the most movies?
* Which movie id got the highest average rating?
* Which movie id had the most varied rating?
  * Of those movies that have been rated at least 5 times?
* In which month were raters the most generous on average?
* Which month had the most variation in ratings?
* Which user had the most variation in ratings?
* On what day of the week did users rate the most movies?
* Which hour in the day did users rate the most movies?
* What hour in the day had the highest average movie rating?
* Which user gave movies the highest average rating?

OK, now you've worked through a single dataset and learned a little **pandas**.

The next notebook will be a complete assignment that you will have to work through, showcasing what you've learned so far!