In [1]:
# imports
import pandas as pd

## MovieLens Data Structure

### Movies Data Set
This data set pairs each MovieLens ID with its title, formatted 'Name of Movie (Year)', and its genres, formatted 'genre1|genre2|...'.

In [20]:
movies = pd.read_csv('./Data/ml-latest-small/movies.csv')
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


We'll need to figure out how to better organize `genres`, most likely with boolean columns with each genre, for example `genres_fantasy`. It would be interesting to look at how movies that share multiple genres interact or if certain people like movies that have multiple genres more than movies that only have one. 

### Links Data Set
This data set pairs each MovieLens ID with its IMDb and TMDB IDs. This will be helpful if any data needs to be scraped from either of those websites. 

In [22]:
links = pd.read_csv('./Data/ml-latest-small/links.csv')
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


### Ratngs Data Set
This data contains the ratings from the tags research. It has mulitple reviews per userId, broken up by movieId, rating, and timestamp.

In [23]:
ratings = pd.read_csv('./Data/ml-latest-small/ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [6]:
ratings.dtypes

userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object

In [8]:
# Rating for Toy Story
ratings[ratings['movieId']==1]['rating'].mean()

3.9209302325581397

### Tags Data Set

In [24]:
tags = pd.read_csv('./Data/ml-latest-small/tags.csv')
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


## IEEE IMDb Review Data
This data set comes from IEEE DataPort. It contains 4,669,820 ratings from 1,499,238 users to 351,109 movies on IMDb. Each row has a userID, movieID, rating, and review date.

In [8]:
# This code comes directly from the IEEE instructions for reading in this data
import numpy as np

dataset = np.load ("./Data/Large-Data/Dataset.npy")

dataset[0]

'ur4592644,tt0120884,10,16 January 2005'

It is set up as a numpy array so we need to reformat it to turn it into a DataFrame

In [9]:
# Turn into list or rows seperated by commas, like a csv file
data_list = [row.split(',') for row in dataset]

# Convert to pandas DataFrame with labeled columns
imdb_ratings = pd.DataFrame(data_list, columns=['UserID', 'MovieID', 'Rating', 'ReviewDate'])

In [13]:
print(imdb_ratings.dtypes)
imdb_ratings.head()

UserID                object
MovieID               object
Rating                 int64
ReviewDate    datetime64[ns]
dtype: object


Unnamed: 0,UserID,MovieID,Rating,ReviewDate
0,ur4592644,tt0120884,10,2005-01-16
1,ur3174947,tt0118688,3,2005-01-16
2,ur3780035,tt0387887,8,2005-01-16
3,ur4592628,tt0346491,1,2005-01-16
4,ur3174947,tt0094721,8,2005-01-16


In [12]:
# Convert Rating to numeric
imdb_ratings['Rating'] = pd.to_numeric(imdb_ratings['Rating'])

# Convert ReviewDate to datetime
imdb_ratings['ReviewDate'] = pd.to_datetime(imdb_ratings['ReviewDate'])

imdb_ratings.dtypes

UserID                object
MovieID               object
Rating                 int64
ReviewDate    datetime64[ns]
dtype: object

We can get out average IMDb movie ratings from this.

In [15]:
imdb_ratings[imdb_ratings['MovieID'] == 'tt0120884']

Unnamed: 0,UserID,MovieID,Rating,ReviewDate
0,ur4592644,tt0120884,10,2005-01-16
202637,ur9036543,tt0120884,8,2006-01-24
1335067,ur11167152,tt0120884,8,2011-06-14
2578671,ur0055545,tt0120884,8,1999-05-16
2578672,ur0338514,tt0120884,8,1999-05-22
2578673,ur0586020,tt0120884,10,2000-01-26
2578674,ur1083887,tt0120884,9,2001-03-03


In [17]:
imdb_ratings[imdb_ratings['MovieID'] == 'tt0120884']['Rating'].mean()

8.714285714285714

In [25]:
links[links['imdbId'] == '120884']

Unnamed: 0,movieId,imdbId,tmdbId


This movie must not show up in the MovieLens research