<p style="font-family: Arial; font-size:3.75em;font-style:bold"><br>
Pandas</p><br>


In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

Import Libraries
</p>

In [2]:
import pandas as pd
import numpy as np

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
Case Study: Movie Data Analysis</p>
<br>This notebook uses a dataset from the MovieLens website. We will describe the dataset further as we explore with it using *pandas*. 

## Download the Dataset

Please note that **you will need to download the dataset**. 

Here are the links to the data source and location:
* **Data Source:** MovieLens web site (filename: ml-20m.zip)
* **Location:** https://grouplens.org/datasets/movielens/


<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
Use Pandas to Read the Dataset<br>
</p>
<br>

In this notebook, we will be using three CSV files:
* **ratings.csv :** *userId*,*movieId*,*rating*, *timestamp*

* **tags.csv :** *userId*,*movieId*, *tag*, *timestamp*

* **movies.csv :** *movieId*, *title*, *genres* <br>

Using the *read_csv* function in pandas, we will ingest these three files.

In [3]:
movies = pd.read_csv('movielens/movies.csv', sep=',')
print(type(movies))
movies.head(15)

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In [4]:
# Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970

tags = pd.read_csv('movielens/tags.csv', sep=',')
tags.head()
tags = tags.set_index(['userId','movieId'])
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,18,4141,Mark Waters,1240597180
1,65,208,dark hero,1368150078
2,65,353,dark hero,1368150079
3,65,521,noir thriller,1368149983
4,65,592,dark hero,1368150078


Unnamed: 0_level_0,Unnamed: 1_level_0,tag,timestamp
userId,movieId,Unnamed: 2_level_1,Unnamed: 3_level_1
18,4141,Mark Waters,1240597180
65,208,dark hero,1368150078
65,353,dark hero,1368150079
65,521,noir thriller,1368149983
65,592,dark hero,1368150078


In [5]:
ratings = pd.read_csv('movielens/ratings.csv', sep=',') 
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
4,1,50,3.5,1112484580


In [6]:
# For current analysis, we will remove timestamp

del ratings['timestamp']
del tags['timestamp']

In [7]:
movies.shape

(27278, 3)

In [8]:
ratings.shape

(20000263, 3)

In [9]:
tags.shape

(465564, 1)

<h1 style="font-size:2em;color:#2467C0">Descriptive Statistics</h1>

Let's look how the ratings are distributed! 

In [10]:
# TODO: summary statistics of the rating data



In [11]:
# TODO: check the validity of rating data: any rating greater than 5 or smaller than 0?



<h1 style="font-size:2em;color:#2467C0">Data Cleaning: Handling Missing Data</h1>

In [12]:
# Do we have any missing data in the three files? If so, use dropna() to remove the corresponding rows



<h1 style="font-size:2em;color:#2467C0">Filters for Selecting Rows</h1>

In [13]:
# from ratings, select rows where ratings are higher than 4.0



In [14]:
# from movies, select rows where the movie belongs to Animation genres




<h1 style="font-size:2em;color:#2467C0">Group By and Aggregate </h1>

In [15]:
# calculate the average ratings for each movie



<h1 style="font-size:2em;color:#2467C0">Merge Dataframes</h1>

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>


Combine aggreagation, merging, and filters to get useful analytics
</p>

In [16]:
'''
TODO:
1. add the average rating to movies
2. Show comedy movies where the average ratings are higher than 4.0
'''




'\nTODO:\n1. add the average rating to movies\n2. Show the top five comedy movies with the highest average ratings\n'

<h1 style="font-size:2em;color:#2467C0">Vectorized String Operations</h1>


In [17]:
# TODO: split movie genres into different columns



In [18]:
# TODO: Extract years from title



<h1 style="font-size:2em;color:#2467C0">Parsing Timestamps</h1>

Timestamps are common in sensor data or other time series datasets.
Let us revisit the *tags.csv* dataset and read the timestamps!


In [19]:
tags = pd.read_csv('./movielens/tags.csv', sep=',')

In [20]:
tags.dtypes

userId        int64
movieId       int64
tag          object
timestamp     int64
dtype: object

<p style="font-family: Arial; font-size:1.35em;color:#2462C0; font-style:bold">

Unix time / POSIX time / epoch time records 
time in seconds <br> since midnight Coordinated Universal Time (UTC) of January 1, 1970
</p>

In [21]:
tags['parsed_time'] = pd.to_datetime(tags['timestamp'], unit='s')
tags.head(5)

Unnamed: 0,userId,movieId,tag,timestamp,parsed_time
0,18,4141,Mark Waters,1240597180,2009-04-24 18:19:40
1,65,208,dark hero,1368150078,2013-05-10 01:41:18
2,65,353,dark hero,1368150079,2013-05-10 01:41:19
3,65,521,noir thriller,1368149983,2013-05-10 01:39:43
4,65,592,dark hero,1368150078,2013-05-10 01:41:18


<p style="font-family: Arial; font-size:1.35em;color:#2462C0; font-style:bold">

Selecting rows based on timestamps
</p>

In [22]:
greater_than_t = tags['parsed_time'] > '2015-02-01'

selected_rows = tags[greater_than_t]

tags.shape, selected_rows.shape

((465564, 5), (12130, 5))

<p style="font-family: Arial; font-size:1.35em;color:#2462C0; font-style:bold">

Sorting the table using the timestamps
</p>

In [23]:
tags.sort_values(by='parsed_time', ascending=True)[:10]

Unnamed: 0,userId,movieId,tag,timestamp,parsed_time
333932,100371,2788,monty python,1135429210,2005-12-24 13:00:10
333927,100371,1732,coen brothers,1135429236,2005-12-24 13:00:36
333924,100371,1206,stanley kubrick,1135429248,2005-12-24 13:00:48
333923,100371,1193,jack nicholson,1135429371,2005-12-24 13:02:51
333939,100371,5004,peter sellers,1135429399,2005-12-24 13:03:19
333922,100371,47,morgan freeman,1135429412,2005-12-24 13:03:32
333921,100371,47,brad pitt,1135429412,2005-12-24 13:03:32
333936,100371,4011,brad pitt,1135429431,2005-12-24 13:03:51
333937,100371,4011,guy ritchie,1135429431,2005-12-24 13:03:51
333920,100371,32,bruce willis,1135429442,2005-12-24 13:04:02


<h1 style="font-size:2em;color:#2467C0">Average Movie Ratings over Time </h1>

## Are Movie ratings related to the year of launch?

In [24]:
# TODO: Explore the trend of average ratings over time

