In this lesson we are going to learn a bit more about how to go about performing a series of transformations in **pandas** in the most efficient, quickest way possible. The latest, greatest version of **pandas** includes a lot of very useful functionality, and I want to expose all of you to it.

So, lets get started.

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline

0.16.2


In the next series of steps, I am quickly going to get the movie data all into a single `DataFrame` object so that we can play with everything the data has to offer (see every rating, the user who made it, the movie name, its genres, etc.) 

I am also going to convert all of the genres in the movie data into a useable format so we can search over genre types quickly.

In [6]:
ratingData = pd.read_csv("../../data/movieData/ratings.dat",sep = "::",names = ['UserID','MovieID','Rating','Timestamp'])
movieData = pd.read_table("../../data/movieData/movies.dat",sep="::", names = ["MovieID","Title","Genres"])
userData = pd.read_table("../../data/movieData/users.dat", sep="::", names = ["UserID","Gender","Age","Occupation","Zip-code"])

Again, first we load all of our 3 data files and label them appropriately, as always.

In [7]:
ratingData.Timestamp = pd.to_datetime(ratingData.Timestamp, unit="s")
movieData = pd.concat([movieData,movieData.Genres.str.get_dummies(sep = "|")],axis=1)
data = userData.merge(ratingData.merge(movieData))
del data["Genres"]

But now, we are going to format them appropriately and merge everything into a single mega `DataFrame` object that we are just going to call `data`.

Lets take a look at the first few rows of `data`:

In [8]:
data.head()

Unnamed: 0,UserID,Gender,Age,Occupation,Zip-code,MovieID,Rating,Timestamp,Title,Genres,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,F,1,10,48067,1193,5,2000-12-31 22:12:40,One Flew Over the Cuckoo's Nest (1975),Drama,...,0,0,0,0,0,0,0,0,0,0
1,1,F,1,10,48067,661,3,2000-12-31 22:35:09,James and the Giant Peach (1996),Animation|Children's|Musical,...,0,0,0,1,0,0,0,0,0,0
2,1,F,1,10,48067,914,3,2000-12-31 22:32:48,My Fair Lady (1964),Musical|Romance,...,0,0,0,1,0,1,0,0,0,0
3,1,F,1,10,48067,3408,4,2000-12-31 22:04:35,Erin Brockovich (2000),Drama,...,0,0,0,0,0,0,0,0,0,0
4,1,F,1,10,48067,2355,5,2001-01-06 23:38:11,"Bug's Life, A (1998)",Animation|Children's|Comedy,...,0,0,0,0,0,0,0,0,0,0


Here is the first cool fast data manipulation trick I will teach you:

**You can use the `assign` method on `DataFrame` objects to easily create new columns that are transformations of other columns or combinations of columns**

All you have to do is pass the name of the column you want to create as the parameter to the `assign` function, and pass an anonymous (lambda) function as the value you want the new column to be.

Here is how you would create a new `Boolean` column called `high_rating` that was set to `True` only when the `Rating` was 4 or greater:

In [15]:
data = data.assign(high_rating = lambda x: x.Rating>=4)

This is useful because you can now pass any function you want and create any kind of new column.

Try it yourself:

* Create a column called `morning_rating` if the `Timestamp` of the rating occurred before noon.
* Create a column called `high_morning_rating` if both `morning_rating` and `high_rating` both occur

In [None]:
##YOUR CODE HERE

Here is another incredibly useful feature in pandas:

**Use the `query` method to immediately return all of the columns that apply for a given selection statement using something very close to plain English**

If the column you are using for the `query` stores `Boolean` values (`True`/`False`

In [14]:
data.query("Rating < 2")

Unnamed: 0,UserID,Gender,Age,Occupation,Zip-code,MovieID,Rating,Timestamp,Title,Genres,...,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,high_rating
152,2,M,56,16,70072,21,1,2000-12-31 21:57:19,Get Shorty (1995),Action|Comedy|Drama,...,0,0,0,0,0,0,0,0,0,False
180,2,M,56,16,70072,3893,1,2000-12-31 21:52:15,Nurse Betty (2000),Comedy|Thriller,...,0,0,0,0,0,0,1,0,0,False
217,3,M,25,15,55117,1261,1,2000-12-31 21:21:03,Evil Dead II (Dead By Dawn) (1987),Action|Adventure|Comedy|Horror,...,0,1,0,0,0,0,0,0,0,False
250,4,M,45,7,02460,3527,1,2000-12-31 20:20:08,Predator (1987),Action|Sci-Fi|Thriller,...,0,0,0,0,0,1,1,0,0,False
258,5,M,25,20,55455,1721,1,2000-12-31 06:56:03,Titanic (1997),Drama|Romance,...,0,0,0,0,1,0,0,0,0,False
265,5,M,25,20,55455,2916,1,2000-12-31 06:54:05,Total Recall (1990),Action|Adventure|Sci-Fi|Thriller,...,0,0,0,0,0,1,1,0,0,False
276,5,M,25,20,55455,2717,1,2000-12-31 05:37:52,Ghostbusters II (1989),Comedy|Horror,...,0,1,0,0,0,0,0,0,0,False
281,5,M,25,20,55455,356,1,2000-12-31 05:38:32,Forrest Gump (1994),Comedy|Romance|War,...,0,0,0,0,1,0,0,1,0,False
284,5,M,25,20,55455,733,1,2000-12-31 06:56:03,"Rock, The (1996)",Action|Adventure|Thriller,...,0,0,0,0,0,0,1,0,0,False
301,5,M,25,20,55455,501,1,2000-12-31 06:26:41,Naked (1993),Drama,...,0,0,0,0,0,0,0,0,0,False


In [13]:
data.query("high_rating").groupby("Title").size()

Title
$1,000,000 Duck (1971)                              11
'Night Mother (1986)                                31
'Til There Was You (1997)                           12
'burbs, The (1989)                                  91
...And Justice for All (1979)                      120
10 Things I Hate About You (1999)                  353
101 Dalmatians (1961)                              329
101 Dalmatians (1996)                              130
12 Angry Men (1957)                                525
13th Warrior, The (1999)                           335
187 (1997)                                          10
2 Days in the Valley (1996)                        128
20 Dates (1998)                                     45
20,000 Leagues Under the Sea (1954)                343
200 Cigarettes (1999)                               55
2001: A Space Odyssey (1968)                      1270
2010 (1984)                                        231
24 7: Twenty Four Seven (1997)                       3
28 D

In [4]:
mostReviewedMoviesData = data.groupby("Title").filter(lambda x: x.shape[0]>=100) #keep only those movies that have been reviewed at least 100 times
statsByGender = mostReviewedMoviesData.pivot_table("Rating", index="Title",columns="Gender",aggfunc = [np.mean,np.std]) #get per-gender avg, std of ratings per movie

In [5]:
statsByGender["meanDifference"] = statsByGender["mean"]["F"] - statsByGender["mean"]["M"] # get diff in mean rating between genders
statsByGender.sort("meanDifference", ascending = False, inplace=True)
print "Movies women tended to like more than men: \n", statsByGender.head(), "\n"
print "Movies men tended to like more than women: \n", statsByGender[::-1].head(), "\n"
statsByGender.sort(("std","F"), ascending = False, inplace=True)
print "Movies women tended to disagree on: \n", statsByGender.head(), "\n"
print "Movies women tended to agree on: \n", statsByGender[::-1].head(), "\n"

Movies women tended to like more than men: 
                             mean                 std           meanDifference
Gender                          F         M         F         M               
Title                                                                         
Pet Sematary II (1992)   2.833333  1.858696  1.340560  0.978691       0.974638
Cutthroat Island (1995)  3.200000  2.341270  1.361114  1.160432       0.858730
Dirty Dancing (1987)     3.790378  2.959596  1.105181  1.087738       0.830782
Air Bud (1997)           3.057143  2.233766  1.211291  1.086957       0.823377
Home Alone 3 (1997)      2.486486  1.683761  1.556735  0.934488       0.802726 

Movies men tended to like more than women: 
                                                    mean                 std  \
Gender                                                 F         M         F   
Title                                                                          
Friday the 13th Part V: A New Beginnin

In [14]:
def split_train_test(df,sample=0.4, testSetColumnName="testSet"):
    if np.random.random() < sample:
        df.ix[:, testSetColumnName] = True
    return df

In [26]:
data["testSet"] = False
data2 = data.groupby("UserID").apply(split_train_test)

In [31]:
data2.head(100)

Unnamed: 0,UserID,Gender,Age,Occupation,Zip-code,MovieID,Rating,Timestamp,Title,Genres,...,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,high_rating,testSet
0,1,F,1,10,48067,1193,5,2000-12-31 22:12:40,One Flew Over the Cuckoo's Nest (1975),Drama,...,0,0,0,0,0,0,0,0,True,False
1,1,F,1,10,48067,661,3,2000-12-31 22:35:09,James and the Giant Peach (1996),Animation|Children's|Musical,...,0,1,0,0,0,0,0,0,False,False
2,1,F,1,10,48067,914,3,2000-12-31 22:32:48,My Fair Lady (1964),Musical|Romance,...,0,1,0,1,0,0,0,0,False,False
3,1,F,1,10,48067,3408,4,2000-12-31 22:04:35,Erin Brockovich (2000),Drama,...,0,0,0,0,0,0,0,0,True,False
4,1,F,1,10,48067,2355,5,2001-01-06 23:38:11,"Bug's Life, A (1998)",Animation|Children's|Comedy,...,0,0,0,0,0,0,0,0,True,False
5,1,F,1,10,48067,1197,3,2000-12-31 22:37:48,"Princess Bride, The (1987)",Action|Adventure|Comedy|Romance,...,0,0,0,1,0,0,0,0,False,False
6,1,F,1,10,48067,1287,5,2000-12-31 22:33:59,Ben-Hur (1959),Action|Adventure|Drama,...,0,0,0,0,0,0,0,0,True,False
7,1,F,1,10,48067,2804,5,2000-12-31 22:11:59,"Christmas Story, A (1983)",Comedy|Drama,...,0,0,0,0,0,0,0,0,True,False
8,1,F,1,10,48067,594,4,2000-12-31 22:37:48,Snow White and the Seven Dwarfs (1937),Animation|Children's|Musical,...,0,1,0,0,0,0,0,0,True,False
9,1,F,1,10,48067,919,4,2000-12-31 22:22:48,"Wizard of Oz, The (1939)",Adventure|Children's|Drama|Musical,...,0,1,0,0,0,0,0,0,True,False
