In this lesson we are going to learn a bit more about how to go about performing a series of transformations in **pandas** in the most efficient, quickest way possible. The latest, greatest version of **pandas** includes a lot of very useful functionality, and I want to expose all of you to it.

So, lets get started.

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline

0.16.2


In the next series of steps, I am quickly going to get the movie data all into a single `DataFrame` object so that we can play with everything the data has to offer (see every rating, the user who made it, the movie name, its genres, etc.) 

I am also going to convert all of the genres in the movie data into a useable format so we can search over genre types quickly.

In [146]:
ratingData = pd.read_csv("../../data/movieData/ratings.dat",sep = "::",names = ['UserID','MovieID','Rating','Timestamp'])
movieData = pd.read_table("../../data/movieData/movies.dat",sep="::", names = ["MovieID","Title","Genres"])
userData = pd.read_table("../../data/movieData/users.dat", sep="::", names = ["UserID","Gender","Age","Occupation","Zip-code"])

Again, first we load all of our 3 data files and label them appropriately, as always.

In [7]:
ratingData.Timestamp = pd.to_datetime(ratingData.Timestamp, unit="s")
movieData = pd.concat([movieData,movieData.Genres.str.get_dummies(sep = "|")],axis=1)
data = userData.merge(ratingData.merge(movieData))
del data["Genres"]

But now, we are going to format them appropriately and merge everything into a single mega `DataFrame` object that we are just going to call `data`.

Lets take a look at the first few rows of `data`:

In [8]:
data.head()

Unnamed: 0,UserID,Gender,Age,Occupation,Zip-code,MovieID,Rating,Timestamp,Title,Genres,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,F,1,10,48067,1193,5,2000-12-31 22:12:40,One Flew Over the Cuckoo's Nest (1975),Drama,...,0,0,0,0,0,0,0,0,0,0
1,1,F,1,10,48067,661,3,2000-12-31 22:35:09,James and the Giant Peach (1996),Animation|Children's|Musical,...,0,0,0,1,0,0,0,0,0,0
2,1,F,1,10,48067,914,3,2000-12-31 22:32:48,My Fair Lady (1964),Musical|Romance,...,0,0,0,1,0,1,0,0,0,0
3,1,F,1,10,48067,3408,4,2000-12-31 22:04:35,Erin Brockovich (2000),Drama,...,0,0,0,0,0,0,0,0,0,0
4,1,F,1,10,48067,2355,5,2001-01-06 23:38:11,"Bug's Life, A (1998)",Animation|Children's|Comedy,...,0,0,0,0,0,0,0,0,0,0


Here is the first cool fast data manipulation trick I will teach you:

**You can use the `assign` method on `DataFrame` objects to easily create new columns that are transformations of other columns or combinations of columns**

All you have to do is pass the name of the column you want to create as the parameter to the `assign` function, and pass either an anonymous (lambda) function as the value you want the new column to be.

Here is how you would create a new `Boolean` column called `high_rating` that was set to `True` only when the `Rating` was 4 or greater:

In [24]:
data = data.assign(high_rating = data.Rating >= 4)

This is useful because you can now pass any function you want and create any kind of new column.

Try it yourself:

* Create a column called `morning_rating` if the `Timestamp` of the rating occurred before noon.
* Create a column called `high_morning_rating` if both `morning_rating` and `high_rating` both occur

In [26]:
##YOUR CODE HERE

Here is another incredibly useful feature in pandas:

**Use the `query` method to immediately return all of the columns that apply for a given selection statement using something very close to plain English**

If the column you are using for the `query` stores `Boolean` values (`True`/`False`) then a simple call passing that column returns only rows with `True`:

In [28]:
data.query("morning_rating")

Unnamed: 0,UserID,Gender,Age,Occupation,Zip-code,MovieID,Rating,Timestamp,Title,Genres,...,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,high_rating,morning_rating,high_morning_rating
254,5,M,25,20,55455,3408,3,2000-12-31 05:58:43,Erin Brockovich (2000),Drama,...,0,0,0,0,0,0,0,False,True,False
255,5,M,25,20,55455,2355,5,2000-12-31 05:53:01,"Bug's Life, A (1998)",Animation|Children's|Comedy,...,0,0,0,0,0,0,0,True,True,True
256,5,M,25,20,55455,919,4,2000-12-31 05:37:52,"Wizard of Oz, The (1939)",Adventure|Children's|Drama|Musical,...,1,0,0,0,0,0,0,True,True,True
257,5,M,25,20,55455,3105,2,2000-12-31 07:09:36,Awakenings (1990),Drama,...,0,0,0,0,0,0,0,False,True,False
258,5,M,25,20,55455,1721,1,2000-12-31 06:56:03,Titanic (1997),Drama|Romance,...,0,0,1,0,0,0,0,False,True,False
259,5,M,25,20,55455,2762,3,2000-12-31 06:10:54,"Sixth Sense, The (1999)",Thriller,...,0,0,0,0,1,0,0,False,True,False
260,5,M,25,20,55455,150,2,2000-12-31 06:56:03,Apollo 13 (1995),Drama,...,0,0,0,0,0,0,0,False,True,False
261,5,M,25,20,55455,2692,4,2000-12-31 06:09:37,Run Lola Run (Lola rennt) (1998),Action|Crime|Romance,...,0,0,1,0,0,0,0,True,True,True
262,5,M,25,20,55455,2028,2,2000-12-31 06:27:33,Saving Private Ryan (1998),Action|Drama|War,...,0,0,0,0,0,1,0,False,True,False
263,5,M,25,20,55455,608,4,2000-12-31 06:29:37,Fargo (1996),Crime|Drama|Thriller,...,0,0,0,0,1,0,0,True,True,True


You can pass even more complicated near-english statements in a way very similar to `assign`, just make sure everything you pass is a `string`.

So if we wanted to know all of the movies that writers (`Occupation` = 20) rated highly, we could simply `query` as follows: 

In [31]:
data.query("Occupation == 20 & high_morning_rating")

Unnamed: 0,UserID,Gender,Age,Occupation,Zip-code,MovieID,Rating,Timestamp,Title,Genres,...,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,high_rating,morning_rating,high_morning_rating
255,5,M,25,20,55455,2355,5,2000-12-31 05:53:01,"Bug's Life, A (1998)",Animation|Children's|Comedy,...,0,0,0,0,0,0,0,True,True,True
256,5,M,25,20,55455,919,4,2000-12-31 05:37:52,"Wizard of Oz, The (1939)",Adventure|Children's|Drama|Musical,...,1,0,0,0,0,0,0,True,True,True
261,5,M,25,20,55455,2692,4,2000-12-31 06:09:37,Run Lola Run (Lola rennt) (1998),Action|Crime|Romance,...,0,0,1,0,0,0,0,True,True,True
263,5,M,25,20,55455,608,4,2000-12-31 06:29:37,Fargo (1996),Crime|Drama|Thriller,...,0,0,0,0,1,0,0,True,True,True
266,5,M,25,20,55455,1213,5,2000-12-31 06:29:37,GoodFellas (1990),Crime|Drama,...,0,0,0,0,0,0,0,True,True,True
268,5,M,25,20,55455,1610,4,2000-12-31 06:54:05,"Hunt for Red October, The (1990)",Action|Thriller,...,0,0,0,0,1,0,0,True,True,True
269,5,M,25,20,55455,2858,4,2000-12-31 05:43:10,American Beauty (1999),Comedy|Drama,...,0,0,0,0,0,0,0,True,True,True
270,5,M,25,20,55455,515,4,2000-12-31 06:58:11,"Remains of the Day, The (1993)",Drama,...,0,0,0,0,0,0,0,True,True,True
273,5,M,25,20,55455,2427,5,2000-12-31 07:07:30,"Thin Red Line, The (1998)",Action|Drama|War,...,0,0,0,0,0,1,0,True,True,True
274,5,M,25,20,55455,593,4,2000-12-31 06:29:37,"Silence of the Lambs, The (1991)",Drama|Thriller,...,0,0,0,0,1,0,0,True,True,True


The real power of using `query` and `assign` is when you can use them together to very quickly answer a seemingly complicated question very quickly by chaining operations together:

In [47]:
crapMovieCounts = (data.assign(crap_rating=data["Rating"]<=2)
                       .query("crap_rating")
                       .groupby("Title")
                       .size())
crapMovieCounts.sort(ascending=False,inplace=True)
crapMovieCounts.head()

Title
Wild Wild West (1999)                               566
Star Wars: Episode I - The Phantom Menace (1999)    467
Blair Witch Project, The (1999)                     434
Mars Attacks! (1996)                                403
Arachnophobia (1990)                                382
dtype: int64

The real power comes from the fact that you can temporarily create columns and modify data on the fly, never having to worry about those columns existing in the original dataset (The `crap_rating` column only exists for the duration of the query!).

Now we are actually going to start doing some data science.

Our first real data science method that we are going to explore is called **supervised learning**. We are going to try to see whether:

1. Can we meaningfully cluster the movies in our dataset? If we can, it may give us an idea of how to better offer movies for watching to others.
2. Can we successfully predict whether someone is male or female, given their scoring and movie watching history? If we can, this would suggest that men and women have distinct viewing habits/tastes/etc.

We are going to give ourselves the opportunity to only work with those movies for which we have enough data. A movie with too few ratings is not going to work for us because we can't make very strong statements on movies that few people have seen/rated.

Here is the pipeline we are going to work through for both questions:

1. Transform all non-numeric user/movie information into one-hot encoded columns across all individual ratings (like we have done before for genres)
2. Create useful aggregate feature columns from the ratings so that every unique movie in our database is now a single row
3. Attempt to cluster movies and analyze the clusters themselves.

First off, lets only use those movies that have been rated at least 100 times:

In [51]:
mostReviewedMoviesData = data.groupby("Title").filter(lambda x: x.shape[0]>=100)

Unnamed: 0,UserID,Gender,Age,Occupation,Zip-code,MovieID,Rating,Timestamp,Title,Genres,...,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,high_rating,morning_rating,high_morning_rating
0,1,F,1,10,48067,1193,5,2000-12-31 22:12:40,One Flew Over the Cuckoo's Nest (1975),Drama,...,0,0,0,0,0,0,0,True,False,False
1,1,F,1,10,48067,661,3,2000-12-31 22:35:09,James and the Giant Peach (1996),Animation|Children's|Musical,...,1,0,0,0,0,0,0,False,False,False
2,1,F,1,10,48067,914,3,2000-12-31 22:32:48,My Fair Lady (1964),Musical|Romance,...,1,0,1,0,0,0,0,False,False,False
3,1,F,1,10,48067,3408,4,2000-12-31 22:04:35,Erin Brockovich (2000),Drama,...,0,0,0,0,0,0,0,True,False,False
4,1,F,1,10,48067,2355,5,2001-01-06 23:38:11,"Bug's Life, A (1998)",Animation|Children's|Comedy,...,0,0,0,0,0,0,0,True,False,False
5,1,F,1,10,48067,1197,3,2000-12-31 22:37:48,"Princess Bride, The (1987)",Action|Adventure|Comedy|Romance,...,0,0,1,0,0,0,0,False,False,False
6,1,F,1,10,48067,1287,5,2000-12-31 22:33:59,Ben-Hur (1959),Action|Adventure|Drama,...,0,0,0,0,0,0,0,True,False,False
7,1,F,1,10,48067,2804,5,2000-12-31 22:11:59,"Christmas Story, A (1983)",Comedy|Drama,...,0,0,0,0,0,0,0,True,False,False
8,1,F,1,10,48067,594,4,2000-12-31 22:37:48,Snow White and the Seven Dwarfs (1937),Animation|Children's|Musical,...,1,0,0,0,0,0,0,True,False,False
9,1,F,1,10,48067,919,4,2000-12-31 22:22:48,"Wizard of Oz, The (1939)",Adventure|Children's|Drama|Musical,...,1,0,0,0,0,0,0,True,False,False


Lets start with our first task:

1. Transform all non-numeric user/movie information into one-hot encoded columns across all individual ratings (like we have done before for genres)

What columns currently in our dataset require conversion into a proto-numeric format?

What columns do we think are useful information to try to predict whether a given movie would be good/bad?

These are the kinds of questions you should be asking yourself as you try to tackle this problem.

In [52]:
mostReviewedMoviesData.columns

Index([u'UserID', u'Gender', u'Age', u'Occupation', u'Zip-code', u'MovieID',
       u'Rating', u'Timestamp', u'Title', u'Genres', u'Action', u'Adventure',
       u'Animation', u'Children's', u'Comedy', u'Crime', u'Documentary',
       u'Drama', u'Fantasy', u'Film-Noir', u'Horror', u'Musical', u'Mystery',
       u'Romance', u'Sci-Fi', u'Thriller', u'War', u'Western', u'high_rating',
       u'morning_rating', u'high_morning_rating'],
      dtype='object')

It looks like we may want to transform the following columns:

* `Gender`
* `Age`
* `Occupation`
* `Timestamp`
* `Title` (since we have access to the year a movie was made, maybe that is a useful feature to work with)

We may also want to get rid of anything we don't currently think is useful here, or any categorical data that has too many dimensions to be useful.

On a per-movie basis, what kind of aggregate stats would be useful in terms of 

I suggest that we get rid of the `Zip-code` column as has too many dimensions (there are 3439 distinct zip codes in the dataset, if we were to use them we would have too many columns for the number of rows we are looking at, something we call the **curse of dimensionality** in machine learning).

In [158]:
#del mostReviewedMoviesData["Zip-code"]
mostReviewedMovieList = mostReviewedMoviesData.MovieID.unique()

Now, lets transform all of our initial data found in `userData` and `movieData` into proto-numeric formats using a combination of `get_dummies` and `assign` (or any other methods you think are useful here).

We will use `get_dummies` in those cases when the given column has more than 2 possible categorical values (`Age` and `Occupation`) and `assign` in those cases where a given column only has 2 categories (`gender`).

In [82]:
userDataTransformed = userData.assign(is_male=userData.Gender=="M")
userDataTransformed.is_male = userDataTransformed.is_male.astype(int)
del userDataTransformed["Gender"]
userDataTransformed = pd.concat([userDataTransformed,pd.get_dummies(userDataTransformed.Age, prefix="age")],axis=1)
del userDataTransformed["Age"]
userDataTransformed = pd.concat([userDataTransformed,pd.get_dummies(userDataTransformed.Occupation, prefix="occ_")],axis=1)
del userDataTransformed["Occupation"]
del userDataTransformed["Zip-code"]

Now that we have transformed our `userData` as necessary, lets take a look at it:

In [84]:
userDataTransformed.head()

Unnamed: 0,UserID,is_male,age_1,age_18,age_25,age_35,age_45,age_50,age_56,occ__0,...,occ__11,occ__12,occ__13,occ__14,occ__15,occ__16,occ__17,occ__18,occ__19,occ__20
0,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,1,0,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
2,3,1,0,0,1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,4,1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


Lets do the same kind of processing with `movieData`:

* Extract all of the genres a movie belongs to into separate columns
* Extract the `year` of the movie and convert it into a decade

(You should be familiar with this as we have done this before). 

In [116]:
movieDataTransformed = movieData.assign(year = movieData.Title.str.slice(-5,-1))
movieDataTransformed.Title = movieDataTransformed.Title.str.slice(0,-7)
movieDataTransformed = pd.concat([movieDataTransformed, movieDataTransformed.Genres.str.get_dummies(sep="|")],axis=1)
del movieDataTransformed["Genres"]
movieDataTransformed.head()
movieDataTransformed = pd.concat([movieDataTransformed,pd.get_dummies(pd.cut(movieDataTransformed.year.astype(int),np.arange(1910,2010,10)))],axis=1)
del movieDataTransformed["year"]

Now that we have finished with both `movieData` and `userData`, lets properly convert timestamps and turn them into 2 additional columns that may be useful:

* A column that tells us whether the rating happened on the weekend, called is_weekend
* A transformation of each rating timestamp into the quarter of the day (6 hour period) in which the rating occurred

In [148]:
ratingDataTransformed = ratingData.assign(Timestamp=pd.to_datetime(ratingData.Timestamp,unit='s'))
ratingDataTransformed = ratingDataTransformed.assign(is_weekend = ratingDataTransformed.Timestamp.dt.dayofweek>=5)
ratingDataTransformed.is_weekend = ratingDataTransformed.is_weekend.astype(int)
ratingDataTransformed = pd.concat([ratingDataTransformed,pd.get_dummies(pd.cut(ratingDataTransformed.Timestamp.dt.hour + 1,np.arange(0,30,6)))], axis=1)
ratingDataTransformed.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp,is_weekend,"(0, 6]","(6, 12]","(12, 18]","(18, 24]"
0,1,1193,5,2000-12-31 22:12:40,1,0,0,0,1
1,1,661,3,2000-12-31 22:35:09,1,0,0,0,1
2,1,914,3,2000-12-31 22:32:48,1,0,0,0,1
3,1,3408,4,2000-12-31 22:04:35,1,0,0,0,1
4,1,2355,5,2001-01-06 23:38:11,1,0,0,0,1


Now let's combine the transformed ratings and movie and user info into one very large `DataFrame` and remove unnecessary columns:

In [149]:
allTransformedData = userDataTransformed.merge(ratingDataTransformed.merge(movieDataTransformed))

In [151]:
allTransformedData.columns

Index([u'UserID', u'is_male', u'age_1', u'age_18', u'age_25', u'age_35',
       u'age_45', u'age_50', u'age_56', u'occ__0', u'occ__1', u'occ__2',
       u'occ__3', u'occ__4', u'occ__5', u'occ__6', u'occ__7', u'occ__8',
       u'occ__9', u'occ__10', u'occ__11', u'occ__12', u'occ__13', u'occ__14',
       u'occ__15', u'occ__16', u'occ__17', u'occ__18', u'occ__19', u'occ__20',
       u'MovieID', u'Rating', u'Timestamp', u'is_weekend', u'(0, 6]',
       u'(6, 12]', u'(12, 18]', u'(18, 24]', u'Title', u'Action', u'Adventure',
       u'Animation', u'Children's', u'Comedy', u'Crime', u'Documentary',
       u'Drama', u'Fantasy', u'Film-Noir', u'Horror', u'Musical', u'Mystery',
       u'Romance', u'Sci-Fi', u'Thriller', u'War', u'Western', u'(1910, 1920]',
       u'(1920, 1930]', u'(1930, 1940]', u'(1940, 1950]', u'(1950, 1960]',
       u'(1960, 1970]', u'(1970, 1980]', u'(1980, 1990]', u'(1990, 2000]'],
      dtype='object')

In [152]:
del allTransformedData["Title"], allTransformedData["Timestamp"], allTransformedData["UserID"]

Now comes the very challening part, how do we start transforming the movie data so that we keep aggregate per-movie statistics that we think would be useful for clustering the movies?

Here are some ideas for transformations (we can think of more together):

* Get each movie's avg./std. overall rating
* Get each movie's age/gender/occupation breakdown
* Get each movie's avg./std. male/female rating


Lets try to make each one of them happen:

In [162]:
avg_std_movie_rating = allTransformedData.groupby("MovieID")["Rating"].agg([np.mean,np.std])

In [169]:
allTransformedData.groupby("MovieID").agg(lambda x: float(np.sum(x))/float(np.size(x)))

Unnamed: 0_level_0,is_male,age_1,age_18,age_25,age_35,age_45,age_50,age_56,occ__0,occ__1,...,Western,"(1910, 1920]","(1920, 1930]","(1930, 1940]","(1940, 1950]","(1950, 1960]","(1960, 1970]","(1970, 1980]","(1980, 1990]","(1990, 2000]"
MovieID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.715455,0.053924,0.215696,0.380356,0.203659,0.068849,0.051998,0.025518,0.129995,0.075590,...,0,0,0,0,0,0,0,0,0,1
2,0.748930,0.051355,0.219686,0.372325,0.203994,0.074180,0.064194,0.014265,0.148359,0.075606,...,0,0,0,0,0,0,0,0,0,1
3,0.715481,0.037657,0.223849,0.368201,0.169456,0.069038,0.069038,0.062762,0.140167,0.077406,...,0,0,0,0,0,0,0,0,0,1
4,0.500000,0.017647,0.229412,0.452941,0.164706,0.070588,0.041176,0.023529,0.164706,0.123529,...,0,0,0,0,0,0,0,0,0,1
5,0.635135,0.050676,0.219595,0.395270,0.195946,0.060811,0.067568,0.010135,0.125000,0.087838,...,0,0,0,0,0,0,0,0,0,1
6,0.862766,0.022340,0.235106,0.442553,0.168085,0.047872,0.055319,0.028723,0.129787,0.060638,...,0,0,0,0,0,0,0,0,0,1
7,0.554585,0.030568,0.161572,0.393013,0.209607,0.089520,0.067686,0.048035,0.131004,0.082969,...,0,0,0,0,0,0,0,0,0,1
8,0.588235,0.176471,0.367647,0.191176,0.147059,0.044118,0.044118,0.029412,0.176471,0.058824,...,0,0,0,0,0,0,0,0,0,1
9,0.901961,0.029412,0.303922,0.401961,0.156863,0.058824,0.029412,0.019608,0.225490,0.058824,...,0,0,0,0,0,0,0,0,0,1
10,0.849099,0.049550,0.243243,0.411036,0.158784,0.061937,0.054054,0.021396,0.122748,0.067568,...,0,0,0,0,0,0,0,0,0,1


In [173]:
allTransformedData.groupby(["MovieID","is_male"])["Rating"].agg(np.mean).unstack()

is_male,0,1
MovieID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,4.187817,4.130552
2,3.278409,3.175238
3,3.073529,2.994152
4,2.976471,2.482353
5,3.212963,2.888298
6,3.682171,3.909988
7,3.588235,3.267717
8,3.357143,2.775000
9,2.100000,2.717391
10,3.470149,3.553050


In [None]:
statsByGender = mostReviewedMoviesData.pivot_table("Rating", index="Title",columns="Gender",aggfunc = [np.mean,np.std]) #get per-gender avg, std of ratings per movie

In [5]:
statsByGender["meanDifference"] = statsByGender["mean"]["F"] - statsByGender["mean"]["M"] # get diff in mean rating between genders
statsByGender.sort("meanDifference", ascending = False, inplace=True)
print "Movies women tended to like more than men: \n", statsByGender.head(), "\n"
print "Movies men tended to like more than women: \n", statsByGender[::-1].head(), "\n"
statsByGender.sort(("std","F"), ascending = False, inplace=True)
print "Movies women tended to disagree on: \n", statsByGender.head(), "\n"
print "Movies women tended to agree on: \n", statsByGender[::-1].head(), "\n"

Movies women tended to like more than men: 
                             mean                 std           meanDifference
Gender                          F         M         F         M               
Title                                                                         
Pet Sematary II (1992)   2.833333  1.858696  1.340560  0.978691       0.974638
Cutthroat Island (1995)  3.200000  2.341270  1.361114  1.160432       0.858730
Dirty Dancing (1987)     3.790378  2.959596  1.105181  1.087738       0.830782
Air Bud (1997)           3.057143  2.233766  1.211291  1.086957       0.823377
Home Alone 3 (1997)      2.486486  1.683761  1.556735  0.934488       0.802726 

Movies men tended to like more than women: 
                                                    mean                 std  \
Gender                                                 F         M         F   
Title                                                                          
Friday the 13th Part V: A New Beginnin

In [14]:
def split_train_test(df,sample=0.4, testSetColumnName="testSet"):
    if np.random.random() < sample:
        df.ix[:, testSetColumnName] = True
    return df

In [26]:
data["testSet"] = False
data2 = data.groupby("UserID").apply(split_train_test)

In [31]:
data2.head(100)

Unnamed: 0,UserID,Gender,Age,Occupation,Zip-code,MovieID,Rating,Timestamp,Title,Genres,...,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,high_rating,testSet
0,1,F,1,10,48067,1193,5,2000-12-31 22:12:40,One Flew Over the Cuckoo's Nest (1975),Drama,...,0,0,0,0,0,0,0,0,True,False
1,1,F,1,10,48067,661,3,2000-12-31 22:35:09,James and the Giant Peach (1996),Animation|Children's|Musical,...,0,1,0,0,0,0,0,0,False,False
2,1,F,1,10,48067,914,3,2000-12-31 22:32:48,My Fair Lady (1964),Musical|Romance,...,0,1,0,1,0,0,0,0,False,False
3,1,F,1,10,48067,3408,4,2000-12-31 22:04:35,Erin Brockovich (2000),Drama,...,0,0,0,0,0,0,0,0,True,False
4,1,F,1,10,48067,2355,5,2001-01-06 23:38:11,"Bug's Life, A (1998)",Animation|Children's|Comedy,...,0,0,0,0,0,0,0,0,True,False
5,1,F,1,10,48067,1197,3,2000-12-31 22:37:48,"Princess Bride, The (1987)",Action|Adventure|Comedy|Romance,...,0,0,0,1,0,0,0,0,False,False
6,1,F,1,10,48067,1287,5,2000-12-31 22:33:59,Ben-Hur (1959),Action|Adventure|Drama,...,0,0,0,0,0,0,0,0,True,False
7,1,F,1,10,48067,2804,5,2000-12-31 22:11:59,"Christmas Story, A (1983)",Comedy|Drama,...,0,0,0,0,0,0,0,0,True,False
8,1,F,1,10,48067,594,4,2000-12-31 22:37:48,Snow White and the Seven Dwarfs (1937),Animation|Children's|Musical,...,0,1,0,0,0,0,0,0,True,False
9,1,F,1,10,48067,919,4,2000-12-31 22:22:48,"Wizard of Oz, The (1939)",Adventure|Children's|Drama|Musical,...,0,1,0,0,0,0,0,0,True,False
