## MovieLens Data Set

The GroupLens Research Project is a research group in the Department of 
Computer Science and Engineering at the University of Minnesota. Members of 
the GroupLens Research Project are involved in many research projects related 
to the fields of information filtering, collaborative filtering, and 
recommender systems. The project is lead by professors John Riedl and Joseph 
Konstan. The project began to explore automated collaborative filtering in 
1992, but is most well known for its world wide trial of an automated 
collaborative filtering system for Usenet news in 1996. Since then the project 
has expanded its scope to research overall information filtering solutions, 
integrating in content-based methods as well as improving current collaborative 
filtering technology.Further information on the GroupLens Research project, including research 
publications, can be found at the following web site:
        
        http://www.grouplens.org/


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
from scipy import stats
from scipy.stats.mstats import winsorize

import warnings
warnings.filterwarnings('ignore')
warnings.warn("this will not show")

%matplotlib inline
# %matplotlib notebook

plt.rcParams["figure.figsize"] = (10, 6)
# plt.rcParams['figure.dpi'] = 100

sns.set_style("whitegrid")
pd.set_option('display.float_format', lambda x: '%.3f' % x)

pd.options.display.max_rows = 50
pd.options.display.max_columns = 50

## Reading Data Set

In [24]:
unames = ["user_id","gender", "age", "occupation", "zip"]
users = pd.read_table("users.dat", sep = "::", header = None, names = unames)

In [25]:
rnames = ["user_id","movie_id", "rating", "timestamp"]
ratings = pd.read_table("ratings.dat", sep = "::", header = None, names = rnames)

In [36]:
mnames = ["movie_id","title", "genres"]
movies = pd.read_table("movies.dat", sep = "::", header = None, names = mnames, encoding ="ISO-8859-1")
    

In [37]:
users.head()

Unnamed: 0,user_id,gender,age,occupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


In [38]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [39]:
movies.head()

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


## Merging 3 Data Sets

In [40]:
df = pd.merge(pd.merge(ratings, users), movies)

## Analyzing the Data

In [41]:
df.head(5)

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
0,1,1193,5,978300760,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama
1,2,1193,5,978298413,M,56,16,70072,One Flew Over the Cuckoo's Nest (1975),Drama
2,12,1193,4,978220179,M,25,12,32793,One Flew Over the Cuckoo's Nest (1975),Drama
3,15,1193,4,978199279,M,25,7,22903,One Flew Over the Cuckoo's Nest (1975),Drama
4,17,1193,5,978158471,M,50,1,95350,One Flew Over the Cuckoo's Nest (1975),Drama


In [42]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000209 entries, 0 to 1000208
Data columns (total 10 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   user_id     1000209 non-null  int64 
 1   movie_id    1000209 non-null  int64 
 2   rating      1000209 non-null  int64 
 3   timestamp   1000209 non-null  int64 
 4   gender      1000209 non-null  object
 5   age         1000209 non-null  int64 
 6   occupation  1000209 non-null  int64 
 7   zip         1000209 non-null  object
 8   title       1000209 non-null  object
 9   genres      1000209 non-null  object
dtypes: int64(6), object(4)
memory usage: 83.9+ MB


In [43]:
df.isnull().sum()

user_id       0
movie_id      0
rating        0
timestamp     0
gender        0
age           0
occupation    0
zip           0
title         0
genres        0
dtype: int64

In [47]:
## Getting mean ratings for each film groupped by gender:
mean_ratings = df.pivot_table("rating", index = "title", columns = "gender", aggfunc = "mean")
mean_ratings

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"$1,000,000 Duck (1971)",3.375,2.762
'Night Mother (1986),3.389,3.353
'Til There Was You (1997),2.676,2.733
"'burbs, The (1989)",2.793,2.962
...And Justice for All (1979),3.829,3.689
...,...,...
"Zed & Two Noughts, A (1985)",3.500,3.381
Zero Effect (1998),3.864,3.723
Zero Kelvin (Kjærlighetens kjøtere) (1995),,3.500
Zeus and Roxanne (1997),2.778,2.357


In [49]:
## We are filtering the movies that received less than 250 ratings.
rating_by_title = df.groupby("title").size()
rating_by_title

title
$1,000,000 Duck (1971)                         37
'Night Mother (1986)                           70
'Til There Was You (1997)                      52
'burbs, The (1989)                            303
...And Justice for All (1979)                 199
                                             ... 
Zed & Two Noughts, A (1985)                    29
Zero Effect (1998)                            301
Zero Kelvin (Kjærlighetens kjøtere) (1995)      2
Zeus and Roxanne (1997)                        23
eXistenZ (1999)                               410
Length: 3706, dtype: int64

In [54]:
active_titles = rating_by_title.index[rating_by_title >=250]
active_titles
## 250 is an arbitrary number.We can choose another number.It is up to you.

Index([''burbs, The (1989)', '10 Things I Hate About You (1999)',
       '101 Dalmatians (1961)', '101 Dalmatians (1996)', '12 Angry Men (1957)',
       '13th Warrior, The (1999)', '2 Days in the Valley (1996)',
       '20,000 Leagues Under the Sea (1954)', '2001: A Space Odyssey (1968)',
       '2010 (1984)',
       ...
       'X-Men (2000)', 'Year of Living Dangerously (1982)',
       'Yellow Submarine (1968)', 'You've Got Mail (1998)',
       'Young Frankenstein (1974)', 'Young Guns (1988)',
       'Young Guns II (1990)', 'Young Sherlock Holmes (1985)',
       'Zero Effect (1998)', 'eXistenZ (1999)'],
      dtype='object', name='title', length=1216)

In [53]:
#Alternative Method
#act_title = rating_by_title.loc[rating_by_title >=250]
#act_title

title
'burbs, The (1989)                   303
10 Things I Hate About You (1999)    700
101 Dalmatians (1961)                565
101 Dalmatians (1996)                364
12 Angry Men (1957)                  616
                                    ... 
Young Guns (1988)                    562
Young Guns II (1990)                 369
Young Sherlock Holmes (1985)         379
Zero Effect (1998)                   301
eXistenZ (1999)                      410
Length: 1216, dtype: int64

In [62]:
## Mean values of the films that have ratings more than 250.
mean_ratings = mean_ratings.loc[active_titles]
mean_ratings


gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"'burbs, The (1989)",2.793,2.962
10 Things I Hate About You (1999),3.647,3.312
101 Dalmatians (1961),3.791,3.500
101 Dalmatians (1996),3.240,2.911
12 Angry Men (1957),4.184,4.328
...,...,...
Young Guns (1988),3.372,3.426
Young Guns II (1990),2.935,2.904
Young Sherlock Holmes (1985),3.515,3.363
Zero Effect (1998),3.864,3.723


In [65]:
##Top films among female viewers.
top_female_ratings = mean_ratings.sort_values(by = "F", ascending = False)
top_female_ratings

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"Close Shave, A (1995)",4.644,4.474
"Wrong Trousers, The (1993)",4.588,4.478
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950),4.573,4.465
Wallace & Gromit: The Best of Aardman Animation (1996),4.563,4.385
Schindler's List (1993),4.563,4.491
...,...,...
"Avengers, The (1998)",1.915,2.017
Speed 2: Cruise Control (1997),1.907,1.863
Rocky V (1990),1.879,2.133
Barb Wire (1996),1.585,2.100


## Measuring Rating Disagreement

In [67]:
mean_ratings["diff"] = mean_ratings["M"] - mean_ratings["F"]
mean_ratings.diff

<bound method DataFrame.diff of gender                                F     M   diff
title                                               
'burbs, The (1989)                2.793 2.962  0.169
10 Things I Hate About You (1999) 3.647 3.312 -0.335
101 Dalmatians (1961)             3.791 3.500 -0.291
101 Dalmatians (1996)             3.240 2.911 -0.329
12 Angry Men (1957)               4.184 4.328  0.144
...                                 ...   ...    ...
Young Guns (1988)                 3.372 3.426  0.054
Young Guns II (1990)              2.935 2.904 -0.031
Young Sherlock Holmes (1985)      3.515 3.363 -0.151
Zero Effect (1998)                3.864 3.723 -0.141
eXistenZ (1999)                   3.099 3.289  0.190

[1216 rows x 3 columns]>

In [68]:
mean_ratings.sort_values("diff", ascending = False)

gender,F,M,diff
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Good, The Bad and The Ugly, The (1966)",3.495,4.221,0.726
"Kentucky Fried Movie, The (1977)",2.879,3.555,0.676
Dumb & Dumber (1994),2.698,3.337,0.639
"Longest Day, The (1962)",3.412,4.031,0.620
"Cable Guy, The (1996)",2.250,2.864,0.614
...,...,...,...
Steel Magnolias (1989),3.902,3.366,-0.536
Little Women (1994),3.871,3.322,-0.549
Grease (1978),3.975,3.367,-0.608
Jumpin' Jack Flash (1986),3.255,2.578,-0.676


In [72]:
mean_ratings.value_counts()

F      M      diff  
1.574  1.617  0.042     1
3.829  3.523  -0.305    1
3.833  3.919  0.085     1
3.832  3.937  0.104     1
       4.263  0.432     1
                       ..
3.375  3.626  0.251     1
       3.436  0.061     1
       3.317  -0.058    1
3.372  3.428  0.057     1
4.644  4.474  -0.171    1
Length: 1216, dtype: int64

In [75]:
## Standard deviation of ratings
rating_std_by_title = df.groupby("title")["rating"].std()

In [74]:
rating_std_by_title

title
$1,000,000 Duck (1971)                       1.093
'Night Mother (1986)                         1.119
'Til There Was You (1997)                    1.020
'burbs, The (1989)                           1.108
...And Justice for All (1979)                0.878
                                              ... 
Zed & Two Noughts, A (1985)                  1.053
Zero Effect (1998)                           1.043
Zero Kelvin (Kjærlighetens kjøtere) (1995)   0.707
Zeus and Roxanne (1997)                      1.123
eXistenZ (1999)                              1.179
Name: rating, Length: 3706, dtype: float64