### Recommendations with MovieTweetings: Getting to Know The Data

Throughout this lesson, you will be working with the [MovieTweetings Data](https://github.com/sidooms/MovieTweetings/tree/master/recsyschallenge2014).  To get started, you can read more about this project and the dataset from the [publication here](http://crowdrec2013.noahlab.com.hk/papers/crowdrec2013_Dooms.pdf). 

To get started, read in the libraries and the two datasets you will be using throughout the lesson using the code below.

 

In [23]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [24]:
# Read in the MovieTweetings dataset originally taken from https://github.com/sidooms/MovieTweetings/tree/master/latest
movies = pd.read_csv('data/movies.dat', delimiter='::', header=None, names=[
                     'movie_id', 'movie', 'genre'], dtype={'movie_id': object}, engine='python')
reviews = pd.read_csv('data/ratings.dat', delimiter='::', header=None, names=['user_id', 'movie_id', 'rating', 'timestamp'], dtype={
                      'movie_id': object, 'user_id': object, 'timestamp': object}, engine='python')

#### 1. Take a Look At The Data 

In [25]:
movies.head()

Unnamed: 0,movie_id,movie,genre
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short
1,10,La sortie des usines Lumière (1895),Documentary|Short
2,12,The Arrival of a Train (1896),Documentary|Short
3,25,The Oxford and Cambridge University Boat Race ...,
4,91,Le manoir du diable (1896),Short|Horror


In [26]:
reviews.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,114508,8,1381006850
1,2,208092,5,1586466072
2,2,358273,9,1579057827
3,2,10039344,5,1578603053
4,2,6751668,9,1578955697


#### 2. Data Cleaning

Next, we need to pull some additional relevant information out of the existing columns. 

For each of the datasets, there are a couple of cleaning steps we need to take care of:

#### Movies
* Pull the date from the title and create new column
* Dummy the date column with 1's and 0's for each century of a movie (1800's, 1900's, and 2000's)
* Dummy column the genre with 1's and 0's

#### Reviews
* Create a date out of time stamp

In [27]:
movies.head()

Unnamed: 0,movie_id,movie,genre
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short
1,10,La sortie des usines Lumière (1895),Documentary|Short
2,12,The Arrival of a Train (1896),Documentary|Short
3,25,The Oxford and Cambridge University Boat Race ...,
4,91,Le manoir du diable (1896),Short|Horror


In [28]:
# make a new column with movie year
movies['date'] = movies['movie'].str.split(
    '(', expand=True)[1].str.replace(')', '')

  


In [32]:
def year_century(year):
    '''
    INPUT:
    year - movie year
    OUTPUT:
    century - century of movie
    '''
    century = year[:2] + "00's"
    return century

In [34]:
# create century column from movie year
movies['century'] = movies['date'].apply(year_century)

In [38]:
# find unique movie centuries
movies['century'].unique()

array(["1800's", "1900's", "2000's"], dtype=object)

In [48]:
def dummy_century(century, year):
    '''
    INPUT:
    century - movie century
    year - year from list of years
    OUTPUT:
    (bool) - return True if century is equal to year
    '''
    if century == year:
        return 1
    else:
        return 0

In [58]:
# dummy the century column
years = ["1800's", "1900's", "2000's"]
for year in years:
    movies[year] = movies['century'].apply(dummy_century, args=(year,))

In [83]:
movies.drop('century', axis=1, inplace=True)

In [52]:
# obtain set of unique genres
genres = []

for genre in movies['genre']:
    try:
        genre = genre.split('|')
        genres.extend(genre)
    except:
        pass

genres = set(genres)

In [53]:
def dummy_genre(movie_genres, genre):
    '''
    INPUT:
    movie_genres - movie genres
    genre - a genre
    OUTPUT:
    (bool) - return True if genre in movie_genres
    '''
    try:
        if movie_genres.find(genre) > -1:
            return 1
        else:
            return 0
    except:
        return 0

In [54]:
# dummy column the genre
for genre in genres:
    movies[genre] = movies['genre'].apply(dummy_genre, args=(genre,))

In [78]:
def remove_leading_zeros(movie_id):
    '''
    INPUT:
    movie_id - movie id string
    OUTPUT:
    (str) - movie id string with leading 0s removed
    '''
    return movie_id.lstrip('0')

In [81]:
# remove leading 0s from movie IDs
movies['movie_id'] = movies['movie_id'].apply(remove_leading_zeros)

In [99]:
movies.head()

Unnamed: 0,movie_id,movie,genre,date,1800's,1900's,2000's,Crime,Reality-TV,Music,...,Film-Noir,News,Musical,Comedy,Animation,Western,Mystery,Adult,Horror,War
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short,1894,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,10,La sortie des usines Lumière (1895),Documentary|Short,1895,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,12,The Arrival of a Train (1896),Documentary|Short,1896,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,25,The Oxford and Cambridge University Boat Race ...,,1895,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,91,Le manoir du diable (1896),Short|Horror,1896,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [59]:
reviews.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,114508,8,1381006850
1,2,208092,5,1586466072
2,2,358273,9,1579057827
3,2,10039344,5,1578603053
4,2,6751668,9,1578955697


In [62]:
# create date column from unix timestamp
reviews['date'] = pd.to_datetime(reviews['timestamp'], unit='s')

In [85]:
# create month column to get dummies from
reviews['month'] = reviews['date'].apply(lambda t: t.month)

In [94]:
# get month dummies
reviews = pd.concat([reviews, pd.get_dummies(
    reviews['month'], prefix='month')], axis=1)

In [95]:
# create year column to get dummies from
reviews['year'] = reviews['date'].apply(lambda t: t.year)

In [96]:
# get year dummies
reviews = pd.concat([reviews, pd.get_dummies(
    reviews['year'], prefix='year')], axis=1)

In [98]:
reviews.drop(['month', 'year'], axis=1, inplace=True)

In [101]:
# remove leading 0s from movie IDs
reviews['movie_id'] = reviews['movie_id'].apply(remove_leading_zeros)

In [102]:
reviews.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,date,month_1,month_2,month_3,month_4,month_5,...,month_11,month_12,year_2013,year_2014,year_2015,year_2016,year_2017,year_2018,year_2019,year_2020
0,1,114508,8,1381006850,2013-10-05 21:00:50,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,2,208092,5,1586466072,2020-04-09 21:01:12,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
2,2,358273,9,1579057827,2020-01-15 03:10:27,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,2,10039344,5,1578603053,2020-01-09 20:50:53,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,2,6751668,9,1578955697,2020-01-13 22:48:17,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [104]:
# load into CSV
movies.to_csv('data/movies_clean.csv', index=False)
reviews.to_csv('data/reviews_clean.csv', index=False)