In [3]:
from google.colab import drive
drive.mount('/content/drive')
import numpy as np
import pandas as pd

Mounted at /content/drive


Loading in the data using pandas into the correct folders.

In [4]:
df_ratings = pd.read_csv('/content/drive/My Drive/rating.csv', parse_dates=['timestamp'])
df_movies = pd.read_csv('/content/drive/My Drive/movie.csv')


In [14]:
print(df_ratings.head())
print(df_movies)

   userId  movieId  rating           timestamp
0       1        2     3.5 2005-04-02 23:53:47
1       1       29     3.5 2005-04-02 23:31:16
2       1       32     3.5 2005-04-02 23:33:39
3       1       47     3.5 2005-04-02 23:32:07
4       1       50     3.5 2005-04-02 23:29:40
       movieId  ...                                       genres
0            1  ...  Adventure|Animation|Children|Comedy|Fantasy
1            2  ...                   Adventure|Children|Fantasy
2            3  ...                               Comedy|Romance
3            4  ...                         Comedy|Drama|Romance
4            5  ...                                       Comedy
...        ...  ...                                          ...
27273   131254  ...                                       Comedy
27274   131256  ...                                       Comedy
27275   131258  ...                                    Adventure
27276   131260  ...                           (no genres listed)
272

I combine both of the dataframes since the movieIds are matching for both dataframes. The new dataframe that I created is basically showing the ratings in the rating.csv file but is adding the movie title to each entry. The genre and timestamp are useless so we drop those colums.

In [6]:
df_combined = pd.merge(df_movies, df_ratings)
df_combined = df_combined.drop(['genres', 'timestamp'], axis = 1)

We tried to create a pivot table using all of the data but the up to date version of pandas does not allow the creation of a pivot table with such large amounts of data. Therefore, we decided to reduce the data to 1000000 to easily make a pivot table out of it.

In [7]:
print(df_combined.head())
df_combined = df_combined.iloc[:1000000]

   movieId             title  userId  rating
0        1  Toy Story (1995)       3     4.0
1        1  Toy Story (1995)       6     5.0
2        1  Toy Story (1995)       8     4.0
3        1  Toy Story (1995)      10     4.0
4        1  Toy Story (1995)      11     4.5


The next step is simply creating the pivot table with the reduced data. This pivot table is useful because it allows for making predictions.

In [8]:
pt = df_combined.pivot_table(index = ["userId"], columns = ["title"], values = "rating")

The following function takes in a movie title as a string and then calls the built in corrwith function for pandas pivot tables. The function then sorts the correlation in descending order from most correlated to least correlated. The top most items are movies that would be recommended to someone that has watched the inputted movie, where the top movie is the most relavent

In [15]:
def get_movie_rating(movie_title):
  recommended_movies = pt.corrwith(pt[movie_title])
  recommended_movies = recommended_movies.sort_values(ascending=False)
  print(recommended_movies)

The following is a sample movie, Catwalk which finds correlated movies through the pivot table. Catwalk is a documentary and these top listed movies are movies that are actually based on real life which makes sense since Catwalk is a documentary. 

In [19]:
print(get_movie_rating('Catwalk (1996)'))

title
Nobody Loves Me (Keiner liebt mich) (1994)                        1.000000
Shopping (1994)                                                   1.000000
Catwalk (1996)                                                    1.000000
Wings of Courage (1995)                                           1.000000
When Night Is Falling (1995)                                      0.932545
                                                                    ...   
Gospa (1995)                                                     -0.981981
Headless Body in Topless Bar (1995)                              -1.000000
Angela (1995)                                                    -1.000000
Happiness Is in the Field (Bonheur est dans le pr√©, Le) (1995)   -1.000000
Guardian Angel (1994)                                                  NaN
Length: 146, dtype: float64
None


  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)
