In this lesson we are going to learn a bit more about how to go about performing a series of transformations in **pandas** in the most efficient, quickest way possible. The latest, greatest version of **pandas** includes a lot of very useful functionality, and I want to expose all of you to it.

So, lets get started.

In [29]:
import pandas as pd
import numpy as np

SyntaxError: invalid syntax (<ipython-input-29-53dd05d866b9>, line 5)

In the next series of steps, I am quickly going to get the movie data all into a single `DataFrame` object so that we can play with everything the data has to offer (see every rating, the user who made it, the movie name, its genres, etc.) 

I am also going to convert all of the genres in the movie data into a useable format so we can search over genre types quickly.

In [2]:
ratingData = pd.read_csv("../../data/movieData/ratings.dat",sep = "::",names = ['UserID','MovieID','Rating','Timestamp'])
movieData = pd.read_table("../../data/movieData/movies.dat",sep="::", names = ["MovieID","Title","Genres"])
userData = pd.read_table("../../data/movieData/users.dat", sep="::", names = ["UserID","Gender","Age","Occupation","Zip-code"])



Again, first we load all of our 3 data files and label them appropriately, as always.

In [3]:
ratingData.Timestamp = pd.to_datetime(ratingData.Timestamp, unit="s")
movieData = pd.concat([movieData,movieData.Genres.str.get_dummies(sep = "|")],axis=1)
data = userData.merge(ratingData.merge(movieData))
del data["Genres"]

But now, we are going to format them appropriately and merge everything into a single mega `DataFrame` object that we are just going to call `data`.

Lets take a look at the first few rows of `data`:

In [4]:
data.head()

Unnamed: 0,UserID,Gender,Age,Occupation,Zip-code,MovieID,Rating,Timestamp,Title,Action,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,F,1,10,48067,1193,5,2000-12-31 22:12:40,One Flew Over the Cuckoo's Nest (1975),0,...,0,0,0,0,0,0,0,0,0,0
1,1,F,1,10,48067,661,3,2000-12-31 22:35:09,James and the Giant Peach (1996),0,...,0,0,0,1,0,0,0,0,0,0
2,1,F,1,10,48067,914,3,2000-12-31 22:32:48,My Fair Lady (1964),0,...,0,0,0,1,0,1,0,0,0,0
3,1,F,1,10,48067,3408,4,2000-12-31 22:04:35,Erin Brockovich (2000),0,...,0,0,0,0,0,0,0,0,0,0
4,1,F,1,10,48067,2355,5,2001-01-06 23:38:11,"Bug's Life, A (1998)",0,...,0,0,0,0,0,0,0,0,0,0


Here is the first cool fast data manipulation trick I will teach you:

**You can use the `assign` method on `DataFrame` objects to easily create new columns that are transformations of other columns or combinations of columns**

All you have to do is pass the name of the column you want to create as the parameter to the `assign` function, and pass either an anonymous (lambda) function as the value you want the new column to be.

Here is how you would create a new `Boolean` column called `high_rating` that was set to `True` only when the `Rating` was 4 or greater:

In [13]:
data = data.assign(high_rating = data.Rating >= 4)

This is useful because you can now pass any function you want and create any kind of new column.

Try it yourself:

* Create a column called `morning_rating` if the `Timestamp` of the rating occurred before noon.
* Create a column called `high_morning_rating` if both `morning_rating` and `high_rating` both occur

In [26]:
##YOUR CODE HERE

Here is another incredibly useful feature in pandas:

**Use the `query` method to immediately return all of the columns that apply for a given selection statement using something very close to plain English**

If the column you are using for the `query` stores `Boolean` values (`True`/`False`) then a simple call passing that column returns only rows with `True`:

In [28]:
data.query("morning_rating")

Unnamed: 0,UserID,Gender,Age,Occupation,Zip-code,MovieID,Rating,Timestamp,Title,Genres,...,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,high_rating,morning_rating,high_morning_rating
254,5,M,25,20,55455,3408,3,2000-12-31 05:58:43,Erin Brockovich (2000),Drama,...,0,0,0,0,0,0,0,False,True,False
255,5,M,25,20,55455,2355,5,2000-12-31 05:53:01,"Bug's Life, A (1998)",Animation|Children's|Comedy,...,0,0,0,0,0,0,0,True,True,True
256,5,M,25,20,55455,919,4,2000-12-31 05:37:52,"Wizard of Oz, The (1939)",Adventure|Children's|Drama|Musical,...,1,0,0,0,0,0,0,True,True,True
257,5,M,25,20,55455,3105,2,2000-12-31 07:09:36,Awakenings (1990),Drama,...,0,0,0,0,0,0,0,False,True,False
258,5,M,25,20,55455,1721,1,2000-12-31 06:56:03,Titanic (1997),Drama|Romance,...,0,0,1,0,0,0,0,False,True,False
259,5,M,25,20,55455,2762,3,2000-12-31 06:10:54,"Sixth Sense, The (1999)",Thriller,...,0,0,0,0,1,0,0,False,True,False
260,5,M,25,20,55455,150,2,2000-12-31 06:56:03,Apollo 13 (1995),Drama,...,0,0,0,0,0,0,0,False,True,False
261,5,M,25,20,55455,2692,4,2000-12-31 06:09:37,Run Lola Run (Lola rennt) (1998),Action|Crime|Romance,...,0,0,1,0,0,0,0,True,True,True
262,5,M,25,20,55455,2028,2,2000-12-31 06:27:33,Saving Private Ryan (1998),Action|Drama|War,...,0,0,0,0,0,1,0,False,True,False
263,5,M,25,20,55455,608,4,2000-12-31 06:29:37,Fargo (1996),Crime|Drama|Thriller,...,0,0,0,0,1,0,0,True,True,True


You can pass even more complicated near-english statements in a way very similar to `assign`, just make sure everything you pass is a `string`.

So if we wanted to know all of the movies that writers (`Occupation` = 20) rated highly, we could simply `query` as follows: 

In [31]:
data.query("Occupation == 20 & high_morning_rating")

Unnamed: 0,UserID,Gender,Age,Occupation,Zip-code,MovieID,Rating,Timestamp,Title,Genres,...,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,high_rating,morning_rating,high_morning_rating
255,5,M,25,20,55455,2355,5,2000-12-31 05:53:01,"Bug's Life, A (1998)",Animation|Children's|Comedy,...,0,0,0,0,0,0,0,True,True,True
256,5,M,25,20,55455,919,4,2000-12-31 05:37:52,"Wizard of Oz, The (1939)",Adventure|Children's|Drama|Musical,...,1,0,0,0,0,0,0,True,True,True
261,5,M,25,20,55455,2692,4,2000-12-31 06:09:37,Run Lola Run (Lola rennt) (1998),Action|Crime|Romance,...,0,0,1,0,0,0,0,True,True,True
263,5,M,25,20,55455,608,4,2000-12-31 06:29:37,Fargo (1996),Crime|Drama|Thriller,...,0,0,0,0,1,0,0,True,True,True
266,5,M,25,20,55455,1213,5,2000-12-31 06:29:37,GoodFellas (1990),Crime|Drama,...,0,0,0,0,0,0,0,True,True,True
268,5,M,25,20,55455,1610,4,2000-12-31 06:54:05,"Hunt for Red October, The (1990)",Action|Thriller,...,0,0,0,0,1,0,0,True,True,True
269,5,M,25,20,55455,2858,4,2000-12-31 05:43:10,American Beauty (1999),Comedy|Drama,...,0,0,0,0,0,0,0,True,True,True
270,5,M,25,20,55455,515,4,2000-12-31 06:58:11,"Remains of the Day, The (1993)",Drama,...,0,0,0,0,0,0,0,True,True,True
273,5,M,25,20,55455,2427,5,2000-12-31 07:07:30,"Thin Red Line, The (1998)",Action|Drama|War,...,0,0,0,0,0,1,0,True,True,True
274,5,M,25,20,55455,593,4,2000-12-31 06:29:37,"Silence of the Lambs, The (1991)",Drama|Thriller,...,0,0,0,0,1,0,0,True,True,True


The real power of using `query` and `assign` is when you can use them together to very quickly answer a seemingly complicated question very quickly by chaining operations together:

In [5]:
crapMovieCounts = (data.assign(crap_rating=data["Rating"]<=2)
                       .query("crap_rating")
                       .groupby("Title")
                       .size())
crapMovieCounts.sort(ascending=False,inplace=True)
crapMovieCounts.head()

Title
Wild Wild West (1999)                               566
Star Wars: Episode I - The Phantom Menace (1999)    467
Blair Witch Project, The (1999)                     434
Mars Attacks! (1996)                                403
Arachnophobia (1990)                                382
dtype: int64

The real power comes from the fact that you can temporarily create columns and modify data on the fly, never having to worry about those columns existing in the original dataset (The `crap_rating` column only exists for the duration of the query!).

The real data science method that we are going to explore is called **collaborative filtering**. We are going to try to see which of several methods is best at reconstructing movie ratings based on viewing habits of other viewers:

We are going to give ourselves the opportunity to only work with those movies for which we have enough data. A movie with too few ratings is not going to work for us because we can't make very strong statements on how someone would rate a given movie if few people have seen/rated the given movie.

Here is the pipeline we are going to work through for both questions:

1. Transform all non-numeric user/movie information into one-hot encoded columns across all individual ratings (like we have done before for genres)
2. Create useful aggregate feature columns from the ratings so that every unique movie in our database is now a single row
3. Attempt to cluster movies and analyze the clusters themselves.

First off, lets only use those movies that have been rated at least 100 times:

In [30]:
mostReviewedMoviesData = data.groupby("Title").filter(lambda x: x.shape[0]>=100)
mostReviewedMoviesData = mostReviewedMoviesData.ix[np.random.choice(mostReviewedMoviesData.index, size=100000, replace=False)]


Now we are going to split our total movie data into two sets, a **training set** and a **testing set**. We will train on the bulk of our users ratings (60%) and leave the remainder for testing our model.

The point of this splitting is we are going to build a model on one subset of our data, and then we are going to evaluate the quality of our collaborative filtering model on the remainder of the data.

We can evaluate a variety of approaches for recommending movies and then pick the approach that performs the best on un-trained data. 

We can then use this best approach to select movies for a given user to see, given that they've not yet seen (that is, rated) a movie!

In [31]:
def split_train_test(df,sample=0.4, testSetColumnName="testSet"):
    if np.random.random() < sample:
        df.ix[:, testSetColumnName] = True
    return df

In [32]:
mostReviewedMoviesData["testSet"] = False
labeledRatingsDataSplit = mostReviewedMoviesData.groupby("UserID").apply(split_train_test)
movielens_train = labeledRatingsDataSplit[labeledRatingsDataSplit.testSet == False]
movielens_test = labeledRatingsDataSplit[labeledRatingsDataSplit.testSet]


### Evaluation: performance criterion

- RMSE: $\sqrt{\frac{\sum(\hat y - y)^2}{n}}$

In order to evaluate the quality of our predictions, we are going to use the [root mean squared error](https://en.wikipedia.org/wiki/Root-mean-square_deviation) in the difference between our predicted ratings and the actual ratings given across all ratings in our **testing set**.

This approach to evaluating the quality of a machine learning model is very common.

We are going to create 2 functions, one to compute the rmse given a sequence of actual and predicted ratings, and another to select the appropriate users and movies to rate:

In [33]:
def compute_rmse(y_pred, y_true):
    """ Compute Root Mean Squared Error. """
    return np.sqrt(np.mean(np.power(y_pred - y_true, 2)))

In [34]:
def evaluate(estimate_f):
    """ RMSE-based predictive performance evaluation with pandas. """
    ids_to_estimate = zip(movielens_test.UserID, movielens_test.MovieID)
    estimated = np.array([estimate_f(u,i) for (u,i) in ids_to_estimate])
    real = movielens_test.Rating.values
    return compute_rmse(estimated, real)

For our first model, we are going to use a really dumb approach. We are simply going to assume that every movie has the average rating across all movies in our **training dataset**.

In [35]:
avg_overall_rating = movielens_train.Rating.mean()
def estimate_overall_avg(user_id, item_id):
    """ The Answer is always the average rating across all movies in the training set. """
    return avg_overall_rating

print 'RMSE for estimate_overall_avg: %s' % evaluate(estimate_overall_avg)

RMSE for estimate_overall_avg: 1.10472482824


Ok, so anything less than an RMSE of 1.1 would do better than a dumb model that assumes that every movie has an identical rating. That's our goal.

How about trying to estimate a given movie's rating for any new person seeing the film by using its average rating for all previous people that have seen the film:

In [36]:
avg_per_movie_rating = movielens_train.groupby("MovieID")["Rating"].mean()
def estimate_per_movie_avg(user_id, item_id):
    """ The Answer is always the average rating of the given movie in the training set. """
    return avg_per_movie_rating[item_id]

print 'RMSE for estimate_per_movie_avg: %s' % evaluate(estimate_per_movie_avg)

RMSE for estimate_per_movie_avg: 0.998212360975


Well that seems a bit better!

Would the overall model be improved if we added the gender of the person as another piece of information? Here we return the average rating for the movie given the person's gender as well:

In [39]:
user_info = userData.set_index('UserID')
means_by_gender = movielens_train.pivot_table('Rating', index='MovieID', columns='Gender')
def estimate_per_gender_movie_avg(user_id, movie_id):
    """ Collaborative filtering using average rating per gender for each movie. """
    user_gender = user_info.ix[user_id, 'Gender']
    if user_gender in means_by_gender.columns: 
        return means_by_gender.ix[movie_id, user_gender]
    else:
        return avg_overall_rating
print 'RMSE for estimate_per_gender_movie_avg: %s' % evaluate(estimate_per_gender_movie_avg)

RMSE for estimate_per_gender_movie_avg: nan


This model barely improves on the error over the simple per-movie average model. Perhaps gender isn't that useful for recommending movies...

In [40]:
means_by_age = movielens_train.pivot_table('Rating', index='MovieID', columns='Age')
def estimate_per_age_movie_avg(user_id, movie_id):
    """ Mean ratings by other users of the same age. """
    if movie_id not in means_by_age.index: 
        return avg_overall_rating
    user_age = user_info.ix[user_id, 'Age']
    if ~np.isnan(means_by_age.ix[movie_id, user_age]):
        return means_by_age.ix[movie_id, user_age]
    else:
        return means_by_age.ix[movie_id].mean()
print 'RMSE for estimate_per_age_movie_avg: %s' % evaluate(estimate_per_age_movie_avg)

RMSE for estimate_per_age_movie_avg: 1.06616725638


This model actually performs worse than the gender model! Is there anything else we can do to improve our ratings?

The last kind of model we are going to try involves weighting each person's rating for a given movie based on the similarity of that person's profile to the current person's profile. People with more similar patterns of ratings to a given other person will have their rating weighed higher. 

The rating given to the movie is then dependent a weighted sum of the average ratings given by others, where the weight for each other person's rating is directly proportional to their similarity to the current person.

There are a variety of ways to evaluate similarity, here are 3 common similarity metrics:

In [14]:
def euclidean(s1, s2):
    """Take two pd.Series objects and return their euclidean 'similarity'."""
    diff = s1 - s2
    return 1 / (1 + np.sqrt(np.sum(diff ** 2)))
def pearson(s1, s2):
    """Take two pd.Series objects and return a pearson correlation."""
    s1_c = s1 - s1.mean()
    s2_c = s2 - s2.mean()
    return np.sum(s1_c * s2_c) / np.sqrt(np.sum(s1_c ** 2) * np.sum(s2_c ** 2))
def cosine(s1, s2):
    """Take two pd.Series objects and return their cosine similarity."""
    return np.sum(s1 * s2) / np.sqrt(np.sum(s1 ** 2) * np.sum(s2 ** 2))

In [42]:
user_training_profiles = movielens_train.pivot_table('Rating', index='MovieID', columns='UserID')
all_profiles = labeledRatingsDataSplit.pivot_table('Rating', index='MovieID', columns='UserID')

In [None]:
def estimate_similarity_weighted_avg(user_id, movie_id):
    """ Ratings weighted by correlation similarity. """
    ratings_by_others = movielens_train[movielens_train.MovieID == movie_id]
    if ratings_by_others.empty: 
        return avg_overall_rating
    ratings_by_others.set_index('UserID', inplace=True)
    their_ids = ratings_by_others.index
    their_ratings = ratings_by_others.Rating
    their_profiles = user_training_profiles[their_ids]
    user_profile = all_profiles[user_id]
    sims = their_profiles.apply(lambda profile: pearson(profile, user_profile), axis=0)
    ratings_sims = pd.DataFrame({'sim': sims, 'Rating': their_ratings})
    ratings_sims = ratings_sims[ ratings_sims.sim > 0]
    global count
    count = count + 1
    if count % 100 == 0:
        print count
    if ratings_sims.empty:
        return their_ratings.mean()
    else:
        return np.average(ratings_sims.Rating, weights=ratings_sims.sim)
print 'RMSE for estimate_similarity_weighted_avg: %s' % evaluate(estimate_similarity_weighted_avg)

400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700

In [14]:
labeledRatingsDataSplit.testSet.value_counts()

False    558101
True     384124
dtype: int64

In [40]:
statsByGender = mostReviewedMoviesData.pivot_table("Rating", index="Title",columns="Gender",aggfunc = [np.mean,np.std]) #get per-gender avg, std of ratings per movie

In [48]:
statsByGender["meanDifference"] = statsByGender["mean"]["F"] - statsByGender["mean"]["M"] # get diff in mean rating between genders
statsByGender.sort("meanDifference", ascending = False, inplace=True)
print "Movies women tended to like more than men: \n", statsByGender[["meanDifference"]].head(), "\n"
print "Movies men tended to like more than women: \n", statsByGender[["meanDifference"]][::-1].head(), "\n"
statsByGender.sort(("std","F"), ascending = False, inplace=True)
print "Movies women tended to disagree on: \n", statsByGender[[("std","F")]].head(), "\n"
print "Movies women tended to agree on: \n", statsByGender[[("std","F")]][::-1].head(), "\n"

Movies women tended to like more than men: 
                        meanDifference
Gender                                
Title                                 
Pet Sematary II (1992)        0.974638
Cutthroat Island (1995)       0.858730
Dirty Dancing (1987)          0.830782
Air Bud (1997)                0.823377
Home Alone 3 (1997)           0.802726 

Movies men tended to like more than women: 
                                               meanDifference
Gender                                                       
Title                                                        
Friday the 13th Part V: A New Beginning (1985)      -0.892321
Friday the 13th Part VI: Jason Lives (1986)         -0.791667
Lifeforce (1985)                                    -0.744152
Marked for Death (1990)                             -0.737607
Quest for Fire (1981)                               -0.730730 

Movies women tended to disagree on: 
                                                 std
Gender    