In this lesson we are going to learn a bit more about how to go about performing a series of transformations in **pandas** in the most efficient, quickest way possible. The latest, greatest version of **pandas** includes a lot of very useful functionality, and I want to expose all of you to it.

So, lets get started.

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline

0.16.2


In the next series of steps, I am quickly going to get the movie data all into a single `DataFrame` object so that we can play with everything the data has to offer (see every rating, the user who made it, the movie name, its genres, etc.) 

I am also going to convert all of the genres in the movie data into a useable format so we can search over genre types quickly.

In [6]:
ratingData = pd.read_csv("../../data/movieData/ratings.dat",sep = "::",names = ['UserID','MovieID','Rating','Timestamp'])
movieData = pd.read_table("../../data/movieData/movies.dat",sep="::", names = ["MovieID","Title","Genres"])
userData = pd.read_table("../../data/movieData/users.dat", sep="::", names = ["UserID","Gender","Age","Occupation","Zip-code"])

Again, first we load all of our 3 data files and label them appropriately, as always.

In [7]:
ratingData.Timestamp = pd.to_datetime(ratingData.Timestamp, unit="s")
movieData = pd.concat([movieData,movieData.Genres.str.get_dummies(sep = "|")],axis=1)
data = userData.merge(ratingData.merge(movieData))
del data["Genres"]

But now, we are going to format them appropriately and merge everything into a single mega `DataFrame` object that we are just going to call `data`.

Lets take a look at the first few rows of `data`:

In [8]:
data.head()

Unnamed: 0,UserID,Gender,Age,Occupation,Zip-code,MovieID,Rating,Timestamp,Title,Genres,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,F,1,10,48067,1193,5,2000-12-31 22:12:40,One Flew Over the Cuckoo's Nest (1975),Drama,...,0,0,0,0,0,0,0,0,0,0
1,1,F,1,10,48067,661,3,2000-12-31 22:35:09,James and the Giant Peach (1996),Animation|Children's|Musical,...,0,0,0,1,0,0,0,0,0,0
2,1,F,1,10,48067,914,3,2000-12-31 22:32:48,My Fair Lady (1964),Musical|Romance,...,0,0,0,1,0,1,0,0,0,0
3,1,F,1,10,48067,3408,4,2000-12-31 22:04:35,Erin Brockovich (2000),Drama,...,0,0,0,0,0,0,0,0,0,0
4,1,F,1,10,48067,2355,5,2001-01-06 23:38:11,"Bug's Life, A (1998)",Animation|Children's|Comedy,...,0,0,0,0,0,0,0,0,0,0


Here is the first cool fast data manipulation trick I will teach you:

**You can use the `assign` method on `DataFrame` objects to easily create new columns that are transformations of other columns or combinations of columns**

All you have to do is pass the name of the column you want to create as the parameter to the `assign` function, and pass either an anonymous (lambda) function as the value you want the new column to be.

Here is how you would create a new `Boolean` column called `high_rating` that was set to `True` only when the `Rating` was 4 or greater:

In [24]:
data = data.assign(high_rating = data.Rating >= 4)

This is useful because you can now pass any function you want and create any kind of new column.

Try it yourself:

* Create a column called `morning_rating` if the `Timestamp` of the rating occurred before noon.
* Create a column called `high_morning_rating` if both `morning_rating` and `high_rating` both occur

In [26]:
##YOUR CODE HERE

Here is another incredibly useful feature in pandas:

**Use the `query` method to immediately return all of the columns that apply for a given selection statement using something very close to plain English**

If the column you are using for the `query` stores `Boolean` values (`True`/`False`) then a simple call passing that column returns only rows with `True`:

In [28]:
data.query("morning_rating")

Unnamed: 0,UserID,Gender,Age,Occupation,Zip-code,MovieID,Rating,Timestamp,Title,Genres,...,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,high_rating,morning_rating,high_morning_rating
254,5,M,25,20,55455,3408,3,2000-12-31 05:58:43,Erin Brockovich (2000),Drama,...,0,0,0,0,0,0,0,False,True,False
255,5,M,25,20,55455,2355,5,2000-12-31 05:53:01,"Bug's Life, A (1998)",Animation|Children's|Comedy,...,0,0,0,0,0,0,0,True,True,True
256,5,M,25,20,55455,919,4,2000-12-31 05:37:52,"Wizard of Oz, The (1939)",Adventure|Children's|Drama|Musical,...,1,0,0,0,0,0,0,True,True,True
257,5,M,25,20,55455,3105,2,2000-12-31 07:09:36,Awakenings (1990),Drama,...,0,0,0,0,0,0,0,False,True,False
258,5,M,25,20,55455,1721,1,2000-12-31 06:56:03,Titanic (1997),Drama|Romance,...,0,0,1,0,0,0,0,False,True,False
259,5,M,25,20,55455,2762,3,2000-12-31 06:10:54,"Sixth Sense, The (1999)",Thriller,...,0,0,0,0,1,0,0,False,True,False
260,5,M,25,20,55455,150,2,2000-12-31 06:56:03,Apollo 13 (1995),Drama,...,0,0,0,0,0,0,0,False,True,False
261,5,M,25,20,55455,2692,4,2000-12-31 06:09:37,Run Lola Run (Lola rennt) (1998),Action|Crime|Romance,...,0,0,1,0,0,0,0,True,True,True
262,5,M,25,20,55455,2028,2,2000-12-31 06:27:33,Saving Private Ryan (1998),Action|Drama|War,...,0,0,0,0,0,1,0,False,True,False
263,5,M,25,20,55455,608,4,2000-12-31 06:29:37,Fargo (1996),Crime|Drama|Thriller,...,0,0,0,0,1,0,0,True,True,True


You can pass even more complicated near-english statements in a way very similar to `assign`, just make sure everything you pass is a `string`.

So if we wanted to know all of the movies that writers (`Occupation` = 20) rated highly, we could simply `query` as follows: 

In [31]:
data.query("Occupation == 20 & high_morning_rating")

Unnamed: 0,UserID,Gender,Age,Occupation,Zip-code,MovieID,Rating,Timestamp,Title,Genres,...,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,high_rating,morning_rating,high_morning_rating
255,5,M,25,20,55455,2355,5,2000-12-31 05:53:01,"Bug's Life, A (1998)",Animation|Children's|Comedy,...,0,0,0,0,0,0,0,True,True,True
256,5,M,25,20,55455,919,4,2000-12-31 05:37:52,"Wizard of Oz, The (1939)",Adventure|Children's|Drama|Musical,...,1,0,0,0,0,0,0,True,True,True
261,5,M,25,20,55455,2692,4,2000-12-31 06:09:37,Run Lola Run (Lola rennt) (1998),Action|Crime|Romance,...,0,0,1,0,0,0,0,True,True,True
263,5,M,25,20,55455,608,4,2000-12-31 06:29:37,Fargo (1996),Crime|Drama|Thriller,...,0,0,0,0,1,0,0,True,True,True
266,5,M,25,20,55455,1213,5,2000-12-31 06:29:37,GoodFellas (1990),Crime|Drama,...,0,0,0,0,0,0,0,True,True,True
268,5,M,25,20,55455,1610,4,2000-12-31 06:54:05,"Hunt for Red October, The (1990)",Action|Thriller,...,0,0,0,0,1,0,0,True,True,True
269,5,M,25,20,55455,2858,4,2000-12-31 05:43:10,American Beauty (1999),Comedy|Drama,...,0,0,0,0,0,0,0,True,True,True
270,5,M,25,20,55455,515,4,2000-12-31 06:58:11,"Remains of the Day, The (1993)",Drama,...,0,0,0,0,0,0,0,True,True,True
273,5,M,25,20,55455,2427,5,2000-12-31 07:07:30,"Thin Red Line, The (1998)",Action|Drama|War,...,0,0,0,0,0,1,0,True,True,True
274,5,M,25,20,55455,593,4,2000-12-31 06:29:37,"Silence of the Lambs, The (1991)",Drama|Thriller,...,0,0,0,0,1,0,0,True,True,True


The real power of using `query` and `assign` is when you can use them together to very quickly answer a seemingly complicated question very quickly by chaining operations together:

In [47]:
crapMovieCounts = (data.assign(crap_rating=data["Rating"]<=2)
                       .query("crap_rating")
                       .groupby("Title")
                       .size())
crapMovieCounts.sort(ascending=False,inplace=True)
crapMovieCounts.head()

Title
Wild Wild West (1999)                               566
Star Wars: Episode I - The Phantom Menace (1999)    467
Blair Witch Project, The (1999)                     434
Mars Attacks! (1996)                                403
Arachnophobia (1990)                                382
dtype: int64

The real power comes from the fact that you can temporarily create columns and modify data on the fly, never having to worry about those columns existing in the original dataset (The `crap_rating` column only exists for the duration of the query!).

Now we are actually going to start doing some data science.

Our first real data science method that we are going to explore is called **clustering**. We are going to try to see whether the movies in our dataset tend to cluster into preference groups based on all of the user data we have.

We are going to give ourselves the opportunity to only work with those movies for which we have enough data. A movie with too few ratings is not going to work for us because we can't make very strong statements on movies that few people have seen.

Here is the pipeline we are going to work through:

1. Transform all non-numeric user/movie information into one-hot encoded columns across all individual ratings (like we have done before for genres)
2. Create useful aggregate feature columns from the ratings so that every unique movie in our database is now a single row
3. Perform clustering on movies using K-means clustering with several different numbers of clusters and see whaht our results look like.

First off, lets only use those movies that have been rated at least 100 times:

In [4]:
mostReviewedMoviesData = data.groupby("Title").filter(lambda x: x.shape[0]>=100)

In [None]:
statsByGender = mostReviewedMoviesData.pivot_table("Rating", index="Title",columns="Gender",aggfunc = [np.mean,np.std]) #get per-gender avg, std of ratings per movie

In [5]:
statsByGender["meanDifference"] = statsByGender["mean"]["F"] - statsByGender["mean"]["M"] # get diff in mean rating between genders
statsByGender.sort("meanDifference", ascending = False, inplace=True)
print "Movies women tended to like more than men: \n", statsByGender.head(), "\n"
print "Movies men tended to like more than women: \n", statsByGender[::-1].head(), "\n"
statsByGender.sort(("std","F"), ascending = False, inplace=True)
print "Movies women tended to disagree on: \n", statsByGender.head(), "\n"
print "Movies women tended to agree on: \n", statsByGender[::-1].head(), "\n"

Movies women tended to like more than men: 
                             mean                 std           meanDifference
Gender                          F         M         F         M               
Title                                                                         
Pet Sematary II (1992)   2.833333  1.858696  1.340560  0.978691       0.974638
Cutthroat Island (1995)  3.200000  2.341270  1.361114  1.160432       0.858730
Dirty Dancing (1987)     3.790378  2.959596  1.105181  1.087738       0.830782
Air Bud (1997)           3.057143  2.233766  1.211291  1.086957       0.823377
Home Alone 3 (1997)      2.486486  1.683761  1.556735  0.934488       0.802726 

Movies men tended to like more than women: 
                                                    mean                 std  \
Gender                                                 F         M         F   
Title                                                                          
Friday the 13th Part V: A New Beginnin

In [14]:
def split_train_test(df,sample=0.4, testSetColumnName="testSet"):
    if np.random.random() < sample:
        df.ix[:, testSetColumnName] = True
    return df

In [26]:
data["testSet"] = False
data2 = data.groupby("UserID").apply(split_train_test)

In [31]:
data2.head(100)

Unnamed: 0,UserID,Gender,Age,Occupation,Zip-code,MovieID,Rating,Timestamp,Title,Genres,...,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,high_rating,testSet
0,1,F,1,10,48067,1193,5,2000-12-31 22:12:40,One Flew Over the Cuckoo's Nest (1975),Drama,...,0,0,0,0,0,0,0,0,True,False
1,1,F,1,10,48067,661,3,2000-12-31 22:35:09,James and the Giant Peach (1996),Animation|Children's|Musical,...,0,1,0,0,0,0,0,0,False,False
2,1,F,1,10,48067,914,3,2000-12-31 22:32:48,My Fair Lady (1964),Musical|Romance,...,0,1,0,1,0,0,0,0,False,False
3,1,F,1,10,48067,3408,4,2000-12-31 22:04:35,Erin Brockovich (2000),Drama,...,0,0,0,0,0,0,0,0,True,False
4,1,F,1,10,48067,2355,5,2001-01-06 23:38:11,"Bug's Life, A (1998)",Animation|Children's|Comedy,...,0,0,0,0,0,0,0,0,True,False
5,1,F,1,10,48067,1197,3,2000-12-31 22:37:48,"Princess Bride, The (1987)",Action|Adventure|Comedy|Romance,...,0,0,0,1,0,0,0,0,False,False
6,1,F,1,10,48067,1287,5,2000-12-31 22:33:59,Ben-Hur (1959),Action|Adventure|Drama,...,0,0,0,0,0,0,0,0,True,False
7,1,F,1,10,48067,2804,5,2000-12-31 22:11:59,"Christmas Story, A (1983)",Comedy|Drama,...,0,0,0,0,0,0,0,0,True,False
8,1,F,1,10,48067,594,4,2000-12-31 22:37:48,Snow White and the Seven Dwarfs (1937),Animation|Children's|Musical,...,0,1,0,0,0,0,0,0,True,False
9,1,F,1,10,48067,919,4,2000-12-31 22:22:48,"Wizard of Oz, The (1939)",Adventure|Children's|Drama|Musical,...,0,1,0,0,0,0,0,0,True,False
