# 1  Movies - Part 1
According to the data dictionary, null values have been encoding as \N.

You will want to find those and replace them with np.nan.
However, the backslash () character is a special one that tells the computer to ignore whatever character comes next.
So if we were to say df.replace({'\N':np.nan}), the computer would see \N as an empty string.
To fix this, add a second backslash character, which will tell the computer that you actually WANTED to use a literal .
df.replace({'\N':np.nan})
Don't forget to make these replacements permanent!

## Imports

In [1]:
import pandas as pd
import numpy as np

## Data

In [2]:
akas_url = 'https://datasets.imdbws.com/title.akas.tsv.gz'
basics_url = 'https://datasets.imdbws.com/title.basics.tsv.gz'
ratings_url = 'https://datasets.imdbws.com/title.ratings.tsv.gz'

In [3]:
basics = pd.read_csv(basics_url, sep='\t', low_memory=False)
basics.head(2)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"


In [4]:
ratings = pd.read_csv(ratings_url, sep='\t', low_memory=False)
ratings.head(2)

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1876
1,tt0000002,5.9,248


In [5]:
akas = pd.read_csv(akas_url, sep='\t', low_memory=False)
akas.head(2)

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,\N,imdbDisplay,\N,0
1,tt0000001,2,Carmencita,DE,\N,\N,literal title,0


## Preprocessing Steps
- Change null value encoding from \N to np.nan
- Eliminate movies that are null for runtimeMinutes, genres, and startYear
- Include movies released between 2000-2021
- Include only full-length movies (titleType = movie)
- Include only fictional movies that are not from the documentary genre

In [6]:
basics = basics.replace({'\\N':np.nan})
ratings = ratings.replace({'\\N':np.nan}) 
akas = akas.replace({'\\N':np.nan}) 

In [7]:
basics = basics.dropna(subset = ['runtimeMinutes', 'genres', 'startYear'])

In [8]:
basics['startYear'] = basics['startYear'].astype(int)
basics = basics.loc[(basics['startYear'] >= 2000) & (basics['startYear'] <= 2021)]

In [9]:
basics = basics.loc[basics['titleType'] == 'movie']

In [10]:
is_documentary = basics['genres'].str.contains('documentary',case=False)
basics = basics[~is_documentary]

- Filter akas dataframe to only include movies release in the US
- Filter basics dataframe by other dataframes

In [11]:
akas = akas.loc[akas['titleId'].isin(basics['tconst'])]

In [12]:
usAKAs = akas.loc[akas['region'] == 'US']

In [13]:
basics = basics.loc[basics['tconst'].isin(usAKAs['titleId'])]

In [14]:
ratings = ratings.loc[ratings['tconst'].isin(basics['tconst'])]

In [15]:
basics.shape

(79212, 9)

In [16]:
ratings.shape

(65618, 3)

## Create MovieProject folder

In [18]:
import os
os.makedirs('Data/MovieProject/',exist_ok=True) 
# Confirm folder created
os.listdir("Data/MovieProject/")

[]

## Save dataframes to compressed .csv.gz files

In [19]:
## Save dataframes to Compressed .csv.gz files
basics.to_csv("Data/MovieProject/basics.csv.gz",compression='gzip',index=False)
ratings.to_csv("Data/MovieProject/ratings.csv.gz",compression='gzip',index=False)
usAKAs.to_csv("Data/MovieProject/akas.csv.gz",compression='gzip',index=False)

## Open saved file and preview

In [21]:
# Open saved file and preview
basics = pd.read_csv("Data/MovieProject/basics.csv.gz", low_memory = False)
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El Tango del Viudo y Su Espejo Deformante,0,2020,,70,Drama
2,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,,122,Drama
3,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,,100,"Comedy,Horror,Sci-Fi"
4,tt0093119,movie,Grizzly II: Revenge,Grizzly II: The Predator,0,2020,,74,"Horror,Music,Thriller"
