# IMDB Data Analysis

Shenyue Jia

[jiashenyue.info](https://jiashenyue.info)

In [1]:
# data wrangling
import numpy as np
import pandas as pd

# database
import pymysql
from sqlalchemy import create_engine

## Download and prepare data

In [2]:
basics_url="https://datasets.imdbws.com/title.basics.tsv.gz"
basics = pd.read_csv(basics_url, sep='\t', low_memory=False)

In [3]:
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


### Clean data

- Replace null values in the data from `\` to null

In [4]:
df_basics = basics.copy().replace({'\\N':np.nan})

### Filter data

- Eliminate movies that are null for `runtimeMinutes`

In [5]:
df_basics = df_basics[pd.notnull(df_basics['runtimeMinutes'])]

- Eliminate movies that are null for `genres`

In [6]:
df_basics = df_basics[pd.notnull(df_basics['genres'])]

- Keep only `titleType`==`Movie`

In [7]:
df_basics = df_basics.query("titleType == 'movie'")

In [8]:
df_basics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 377587 entries, 8 to 9634718
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   tconst          377587 non-null  object
 1   titleType       377587 non-null  object
 2   primaryTitle    377587 non-null  object
 3   originalTitle   377587 non-null  object
 4   isAdult         377587 non-null  object
 5   startYear       371208 non-null  object
 6   endYear         0 non-null       object
 7   runtimeMinutes  377587 non-null  object
 8   genres          377587 non-null  object
dtypes: object(9)
memory usage: 28.8+ MB


- Keep `startYear` 2000-2021

In [22]:
years = list(range(2000,2022,1))
years_str = list(map(str, years))
years_str

['2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '2011',
 '2012',
 '2013',
 '2014',
 '2015',
 '2016',
 '2017',
 '2018',
 '2019',
 '2020',
 '2021']

In [26]:
df_basics = df_basics[df_basics['startYear'].isin(years_str)]
df_basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
13082,tt0013274,movie,Istoriya grazhdanskoy voyny,Istoriya grazhdanskoy voyny,0,2021,,133,Documentary
34803,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance"
61116,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020,,70,Drama
67669,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,,122,Drama
77964,tt0079644,movie,November 1828,November 1828,0,2001,,140,"Drama,War"


- Eliminate movies that include "Documentary" in `genre`

## Configure MySQL

In [2]:
pymysql.install_as_MySQLdb()

In [None]:
# Create connection string using credentials following this format
# connection = "dialect+driver://username:password@host:port/database"
username = "root"
password = "root" # (or whatever password you chose during mysql installation)
db_name = "world"
connection = f"mysql+pymysql://{root}:{org@4101441}@localhost/{db_name}"