# Movie Rankings

In this notebook we will recommend movies to users. We will have access to user ratings of movies, and the popularity of those movies. The goal is find a way to combine ratings and popularity into one overall ranking.

# Loading Data

Run the following 3 cells to load a DataFrame that merges 3 separate DataFrames from the MovieLens dataset.

The following URLs were discovered on github for ease of downloading onlin

In [1]:
# access movie ratings
url_movie_ratings = 'https://raw.githubusercontent.com/khanhnamle1994/movielens/master/ratings.csv'
import pandas as pd
df_ratings = pd.read_csv(url_movie_ratings, nrows=100000, sep='\\t', engine='python')
del df_ratings['user_emb_id']
del df_ratings['movie_emb_id']
del df_ratings['timestamp']
df_ratings.head()

Unnamed: 0,user_id,movie_id,rating
0,1,1193,5
1,1,661,3
2,1,914,3
3,1,3408,4
4,1,2355,5


In [2]:
# access movie info
url_movies = 'https://raw.githubusercontent.com/khanhnamle1994/movielens/master/movies.csv'
df_movies = pd.read_csv(url_movies, error_bad_lines=False, encoding='latin-1', sep='\\t')
df_movies.head()

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
# access user info
url_users = 'https://raw.githubusercontent.com/khanhnamle1994/movielens/master/users.csv'
df_users = pd.read_csv(url_users, sep='\\t', engine='python')
df_users.head()

Unnamed: 0,user_id,gender,age,occupation,zipcode,age_desc,occ_desc
0,1,F,1,10,48067,Under 18,K-12 student
1,2,M,56,16,70072,56+,self-employed
2,3,M,25,15,55117,25-34,scientist
3,4,M,45,7,2460,45-49,executive/managerial
4,5,M,25,20,55455,25-34,writer


In [4]:
# Join all 3 files into one dataframe
df = pd.merge(pd.merge(df_movies, df_ratings),df_users)
df.head()

Unnamed: 0,movie_id,title,genres,user_id,rating,gender,age,occupation,zipcode,age_desc,occ_desc
0,1,Toy Story (1995),Animation|Children's|Comedy,1,5,F,1,10,48067,Under 18,K-12 student
1,48,Pocahontas (1995),Animation|Children's|Musical|Romance,1,5,F,1,10,48067,Under 18,K-12 student
2,150,Apollo 13 (1995),Drama,1,5,F,1,10,48067,Under 18,K-12 student
3,260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Fantasy|Sci-Fi,1,4,F,1,10,48067,Under 18,K-12 student
4,527,Schindler's List (1993),Drama|War,1,5,F,1,10,48067,Under 18,K-12 student


In [5]:
df = df.iloc[:, :5]
df.head()

Unnamed: 0,movie_id,title,genres,user_id,rating
0,1,Toy Story (1995),Animation|Children's|Comedy,1,5
1,48,Pocahontas (1995),Animation|Children's|Musical|Romance,1,5
2,150,Apollo 13 (1995),Drama,1,5
3,260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Fantasy|Sci-Fi,1,4
4,527,Schindler's List (1993),Drama|War,1,5


In [6]:
df = df.groupby('movie_id').agg({'rating':'mean', 'title':'first',
                                 'user_id':'count'})
df.head()

Unnamed: 0_level_0,rating,title,user_id
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,4.159091,Toy Story (1995),220
2,3.294872,Jumanji (1995),78
3,3.157895,Grumpier Old Men (1995),57
4,2.684211,Waiting to Exhale (1995),19
5,3.0,Father of the Bride Part II (1995),28


In [7]:
df = df.sort_values('rating',ascending=False)
df.head(10)

Unnamed: 0_level_0,rating,title,user_id
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
578,5.0,"Hour of the Pig, The (1993)",1
1423,5.0,Hearts and Minds (1996),1
3092,5.0,Chushingura (1962),2
2131,5.0,Autumn Sonata (Höstsonaten ) (1978),1
1002,5.0,Ed's Next Move (1996),1
3410,5.0,Soft Fruit (1999),1
1063,5.0,Johns (1996),1
2904,5.0,Rain (1932),2
1044,5.0,Surviving Picasso (1996),2
1563,5.0,Dream With the Fishes (1997),2


In [11]:
df = df[df['user_id']>=100]
df.head(10)

Unnamed: 0_level_0,rating,title,user_id
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
527,4.562249,Schindler's List (1993),249
318,4.557447,"Shawshank Redemption, The (1994)",235
1148,4.544554,"Wrong Trousers, The (1993)",101
50,4.534483,"Usual Suspects, The (1995)",174
1207,4.490196,To Kill a Mockingbird (1962),102
1198,4.476534,Raiders of the Lost Ark (1981),277
260,4.475155,Star Wars: Episode IV - A New Hope (1977),322
858,4.459596,"Godfather, The (1972)",198
904,4.455446,Rear Window (1954),101
750,4.428571,Dr. Strangelove or: How I Learned to Stop Worr...,133


In [13]:
df.sort_values('user_id', ascending=False)[:10]

Unnamed: 0_level_0,rating,title,user_id
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2858,4.275862,American Beauty (1999),406
1196,4.311377,Star Wars: Episode V - The Empire Strikes Back...,334
1210,4.06192,Star Wars: Episode VI - Return of the Jedi (1983),323
480,3.848297,Jurassic Park (1993),323
260,4.475155,Star Wars: Episode IV - A New Hope (1977),322
2571,4.383562,"Matrix, The (1999)",292
2028,4.410345,Saving Private Ryan (1998),290
1198,4.476534,Raiders of the Lost Ark (1981),277
3578,4.148014,Gladiator (2000),277
1580,3.776173,Men in Black (1997),277


In [15]:
df['rating_norm'] = df['rating']/df['rating'].max()
df['popularity_norm'] = df['user_id']/df['user_id'].max()
df.head()

Unnamed: 0_level_0,rating,title,user_id,rating_norm,popularity_norm
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2858,4.275862,American Beauty (1999),406,0.955172,1.0
1196,4.311377,Star Wars: Episode V - The Empire Strikes Back...,334,0.963106,0.82266
1210,4.06192,Star Wars: Episode VI - Return of the Jedi (1983),323,0.90738,0.795567
480,3.848297,Jurassic Park (1993),323,0.85966,0.795567
260,4.475155,Star Wars: Episode IV - A New Hope (1977),322,0.999692,0.793103


In [18]:
df['ranking'] = 0.5*df['rating_norm'] + 0.5*df['popularity_norm']
df.sort_values('ranking', ascending=False)[:10]

Unnamed: 0_level_0,rating,title,user_id,rating_norm,popularity_norm,ranking
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2858,4.275862,American Beauty (1999),406,0.955172,1.0,0.977586
260,4.475155,Star Wars: Episode IV - A New Hope (1977),322,0.999692,0.793103,0.896398
1196,4.311377,Star Wars: Episode V - The Empire Strikes Back...,334,0.963106,0.82266,0.892883
1210,4.06192,Star Wars: Episode VI - Return of the Jedi (1983),323,0.90738,0.795567,0.851473
2028,4.410345,Saving Private Ryan (1998),290,0.985214,0.714286,0.84975
2571,4.383562,"Matrix, The (1999)",292,0.979231,0.719212,0.849221
1198,4.476534,Raiders of the Lost Ark (1981),277,1.0,0.682266,0.841133
480,3.848297,Jurassic Park (1993),323,0.85966,0.795567,0.827613
3578,4.148014,Gladiator (2000),277,0.926613,0.682266,0.804439
1580,3.776173,Men in Black (1997),277,0.843548,0.682266,0.762907


In [None]:
#1. Develop your own ranking metric
#2. Find other datasets (Goodreads, Music) and do something similar
##They must have user ratings
#Cosine simularity
#Redit and harkernews formulas
#who uses formulas
#Who uses what


In [3]:
import pandas as pd
vgds = pd.read_csv("videogames.csv")
vgds.head()

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8.0,322.0,Nintendo,E
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8.0,192.0,Nintendo,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37,,,,,,


In [4]:
vgds['Global_Sales'] = pd.to_numeric(vgds.Global_Sales, errors='coerce')
vgds['User_Score'] = pd.to_numeric(vgds.User_Score, errors='coerce')
vgds['Critic_Score'] = pd.to_numeric(vgds.Critic_Score, errors='coerce')
vgds = vgds.dropna(axis=0, subset=['Global_Sales']) 
vgds = vgds.dropna(axis=0, subset=['User_Score'])
vgds = vgds.dropna(axis=0, subset=['Critic_Score']) 
vgds.head()

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8.0,322.0,Nintendo,E
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8.0,192.0,Nintendo,E
6,New Super Mario Bros.,DS,2006.0,Platform,Nintendo,11.28,9.14,6.5,2.88,29.8,89.0,65.0,8.5,431.0,Nintendo,E
7,Wii Play,Wii,2006.0,Misc,Nintendo,13.96,9.18,2.93,2.84,28.92,58.0,41.0,6.6,129.0,Nintendo,E


In [5]:
vgds['sales_norm'] = vgds['Global_Sales']/vgds['Global_Sales'].max()
vgds['user_norm'] = vgds['User_Score']/vgds['User_Score'].max()
vgds['critic_norm'] = vgds['Critic_Score']/vgds['Critic_Score'].max()
vgds.head()

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating,sales_norm,user_norm,critic_norm
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8.0,322.0,Nintendo,E,1.0,0.833333,0.77551
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E,0.430389,0.864583,0.836735
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8.0,192.0,Nintendo,E,0.397068,0.833333,0.816327
6,New Super Mario Bros.,DS,2006.0,Platform,Nintendo,11.28,9.14,6.5,2.88,29.8,89.0,65.0,8.5,431.0,Nintendo,E,0.361081,0.885417,0.908163
7,Wii Play,Wii,2006.0,Misc,Nintendo,13.96,9.18,2.93,2.84,28.92,58.0,41.0,6.6,129.0,Nintendo,E,0.350418,0.6875,0.591837


In [6]:
vgds['ranking'] = 0.1*vgds['sales_norm'] + 0.7*vgds['user_norm'] + 0.2*vgds['critic_norm'] 
vgds.sort_values('ranking', ascending=False)[:10]

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating,sales_norm,user_norm,critic_norm,ranking
146,Metal Gear Solid,PS,1998.0,Action,Konami Digital Entertainment,3.18,1.83,0.78,0.24,6.03,94.0,20.0,9.4,918.0,KCEJ,M,0.073064,0.979167,0.959184,0.88456
1068,Resident Evil 4,GC,2005.0,Action,Capcom,0.98,0.42,0.22,0.06,1.69,96.0,82.0,9.4,767.0,Capcom,M,0.020477,0.979167,0.979592,0.883383
517,Metroid Prime,GC,2002.0,Shooter,Nintendo,1.96,0.67,0.1,0.09,2.82,97.0,70.0,9.3,747.0,Retro Studios,T,0.034169,0.96875,0.989796,0.879501
1546,Castlevania: Symphony of the Night,PS,1997.0,Platform,Konami Digital Entertainment,0.58,0.4,0.21,0.08,1.27,93.0,12.0,9.4,358.0,Konami,T,0.015388,0.979167,0.94898,0.876751
17,Grand Theft Auto: San Andreas,PS2,2004.0,Action,Take-Two Interactive,9.43,0.4,0.41,10.57,20.81,95.0,80.0,9.0,1588.0,Rockstar North,M,0.252151,0.9375,0.969388,0.875343
10999,Skies of Arcadia,DC,2000.0,Role-Playing,Sega,0.0,0.0,0.09,0.0,0.09,93.0,21.0,9.4,98.0,Overworks,T,0.001091,0.979167,0.94898,0.875322
9143,The Orange Box,PC,2007.0,Shooter,Electronic Arts,0.0,0.11,0.0,0.03,0.14,96.0,34.0,9.3,1495.0,Valve Software,M,0.001696,0.96875,0.979592,0.874213
3623,Metal Gear Solid 3: Subsistence,PS2,2005.0,Action,Konami Digital Entertainment,0.34,0.01,0.15,0.06,0.55,94.0,53.0,9.3,439.0,Aspect,M,0.006664,0.96875,0.959184,0.870628
97,Super Mario Galaxy 2,Wii,2010.0,Platform,Nintendo,3.56,2.35,0.98,0.62,7.51,97.0,87.0,9.1,1854.0,Nintendo EAD Tokyo,E,0.090997,0.947917,0.989796,0.870601
65,Final Fantasy VII,PS,1997.0,Role-Playing,Sony Computer Entertainment,3.01,2.47,3.28,0.96,9.72,92.0,20.0,9.2,1282.0,SquareSoft,T,0.117775,0.958333,0.938776,0.870366
