<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Successful-Movies" data-toc-modified-id="Successful-Movies-1">Successful Movies</a></span><ul class="toc-item"><li><span><a href="#Project-3-Part-1-(Core)" data-toc-modified-id="Project-3-Part-1-(Core)-1.1">Project 3 Part 1 (Core)</a></span><ul class="toc-item"><li><span><a href="#What-Makes-a-Movie-Successful?" data-toc-modified-id="What-Makes-a-Movie-Successful?-1.1.1"><strong><em>What Makes a Movie Successful?</em></strong></a></span></li><li><span><a href="#Ingrid-Arbieto-Nelson" data-toc-modified-id="Ingrid-Arbieto-Nelson-1.1.2">Ingrid Arbieto Nelson</a></span></li><li><span><a href="#Business-Problem" data-toc-modified-id="Business-Problem-1.1.3">Business Problem</a></span></li><li><span><a href="#For-Part-1-of-the-project,-you-will-be" data-toc-modified-id="For-Part-1-of-the-project,-you-will-be-1.1.4">For Part 1 of the project, you will be</a></span></li><li><span><a href="#The-Data" data-toc-modified-id="The-Data-1.1.5">The Data</a></span></li><li><span><a href="#Specifications" data-toc-modified-id="Specifications-1.1.6">Specifications</a></span></li><li><span><a href="#Deliverable" data-toc-modified-id="Deliverable-1.1.7">Deliverable</a></span></li><li><span><a href="#IMDb-Dataset-Details" data-toc-modified-id="IMDb-Dataset-Details-1.1.8">IMDb Dataset Details</a></span></li></ul></li><li><span><a href="#PreProcessing" data-toc-modified-id="PreProcessing-1.2">PreProcessing</a></span><ul class="toc-item"><li><span><a href="#Import-Libraries" data-toc-modified-id="Import-Libraries-1.2.1">Import Libraries</a></span></li><li><span><a href="#Load-Data" data-toc-modified-id="Load-Data-1.2.2">Load Data</a></span></li><li><span><a href="#Filtering-Movie-Data" data-toc-modified-id="Filtering-Movie-Data-1.2.3">Filtering Movie Data</a></span></li><li><span><a href="#Final-Check-File-Info" data-toc-modified-id="Final-Check-File-Info-1.2.4">Final Check File Info</a></span></li><li><span><a href="#Save-Compressed-Files" data-toc-modified-id="Save-Compressed-Files-1.2.5">Save Compressed Files</a></span></li></ul></li></ul></li></ul></div>

# Successful Movies
---

## Project 3 Part 1 (Core)

* ### ***What Makes a Movie Successful?***

* ### Ingrid Arbieto Nelson

### Business Problem
For this project, you have been hired to produce a MySQL database on Movies from a subset of IMDB's publicly available dataset. Ultimately, you will use this database to analyze what makes a movie successful and will provide recommendations to the stakeholder on how to make a successful movie.

Over the course of this project, you will:

* Part 1: Download several files from IMDB’s movie data set and filter out the subset of moves requested by the stakeholder.
* Part 2: Use an API to extract box office revenue and profit data to add to your IMDB data and perform exploratory data analysis.
* Part 3: Construct and export a MySQL database using your data.
* Part 4: Apply hypothesis testing to explore what makes a movie successful.
* Part 5 (Optional): Produce a Linear Regression model to predict movie performance.

<img src ='Data/theater.png'>

### For Part 1 of the project, you will be 
* 1 Creating your project repository
* 2 Loading the official IMDB data for the requested tables
* 3 Filtering out unnecessary data
* 4 Saving the filtered tables as gzip-compressed csv files (".csv.gz") in your repository.


### The Data
* IMDB Provides Several Files with varied information for Movies, TV Shows, Made for TV Movies, etc.

   *  Overview/Data Dictionary: https://www.imdb.com/interfaces/
   *  Downloads page: https://datasets.imdbws.com/
*  **From their previous research, they realized they want to focus on the following files**:

   *  title.basics.tsv.gz
   *  title.ratings.tsv.gz
   *  title.akas.tsv.gz

### Specifications
Your stakeholder only wants you to include information for movies based on the following specifications:

*  Exclude any movie with missing values for genre or runtime
*  Include only full-length movies (titleType = "movie").
*  Include only fictional movies (not from documentary genre)
*  Include only movies that were released 2000 - 2022 (include 2000 and 2022)
*  Include only movies that were released in the United States

### Deliverable
After filtering out movies that do not meet the stakeholder's specifications:

*  Before saving, run a final .info() for each of the dataframes to show a summary of how many movies remain and the datatypes of each feature.
*  Save each file to a compressed csv file "Data/" folder inside your repository.
*  Commit your changes to your repository in GitHub desktop and Publish repository / Push Changes.
*  Submit the link to your repository.



### IMDb Dataset Details

#### **title.basics.tsv.gz** 
Contains the following information for titles:
*  tconst (string) - alphanumeric unique identifier of the title
*  titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
*  primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
*  originalTitle (string) - original title, in the original language
*  isAdult (boolean) - 0: non-adult title; 1: adult title
*  startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
*  endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
*  runtimeMinutes – primary runtime of the title, in minutes
*  genres (string array) – includes up to three genres associated with the title

**title.ratings.tsv.gz**
Contains the IMDb rating and votes information for titles:
*  tconst (string) - alphanumeric unique identifier of the title
*  averageRating – weighted average of all the individual user ratings
*  numVotes - number of votes the title has received

**title.akas.tsv.gz**
Contains the following information for titles:

*  titleId (string) - a tconst, an alphanumeric unique identifier of the title
*  ordering (integer) – a number to uniquely identify rows for a given titleId
*  title (string) – the localized title
*  region (string) - the region for this version of the title
*  language (string) - the language of the title
*  types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning
*  attributes (array) - Additional terms to describe this alternative title, not enumerated
*  isOriginalTitle (boolean) – 0: not original title; 1: original title

## PreProcessing

### Import Libraries
---

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

### Load Data
---

In [2]:
basics_url="https://datasets.imdbws.com/title.basics.tsv.gz"
basics = pd.read_csv(basics_url, sep='\t', low_memory=False)
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


In [3]:
ratings_url="https://datasets.imdbws.com/title.ratings.tsv.gz"
ratings = pd.read_csv(ratings_url, sep='\t', low_memory=False)
ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1966
1,tt0000002,5.8,263
2,tt0000003,6.5,1808
3,tt0000004,5.6,178
4,tt0000005,6.2,2607


In [4]:
akas_url="https://datasets.imdbws.com/title.akas.tsv.gz"
akas = pd.read_csv(akas_url, sep='\t', low_memory=False)
akas.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,\N,imdbDisplay,\N,0
1,tt0000001,2,Carmencita,DE,\N,\N,literal title,0
2,tt0000001,3,Carmencita - spanyol tánc,HU,\N,imdbDisplay,\N,0
3,tt0000001,4,Καρμενσίτα,GR,\N,imdbDisplay,\N,0
4,tt0000001,5,Карменсита,RU,\N,imdbDisplay,\N,0


### Filtering Movie Data
---

#### Title Basics :

In [5]:
# encode null values \N is null
basics = basics.replace({'\\N':np.nan})

In [6]:
# Display the count of missing values by column
print(basics.isna().sum())

tconst                  0
titleType               0
primaryTitle           11
originalTitle          11
isAdult                 1
startYear         1327682
endYear           9700427
runtimeMinutes    6917265
genres             442042
dtype: int64


In [7]:
# drop movies with null values for runtime minutes
basics = basics.dropna(axis=0, subset=('runtimeMinutes'))

In [8]:
# drop movies with null values for genre
basics = basics.dropna(axis=0, subset=('genres'))

In [9]:
# Check missing values after dropping
print(basics.isna().sum())

tconst                  0
titleType               0
primaryTitle            1
originalTitle           1
isAdult                 0
startYear          164736
endYear           2762837
runtimeMinutes          0
genres                  0
dtype: int64


In [10]:
# see title types
basics['titleType'].value_counts()

tvEpisode       1425631
short            599275
movie            381476
video            180144
tvMovie           91428
tvSeries          90217
tvSpecial         18051
tvMiniSeries      17123
tvShort            8790
videoGame           322
Name: titleType, dtype: int64

In [11]:
# create filter for only movie titles
only_movies = basics['titleType'] == 'movie'

# filter for df with only movies
basics = basics.loc[only_movies,:]

In [12]:
# see title types
basics['titleType'].value_counts()

movie    381476
Name: titleType, dtype: int64

In [13]:
basics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 381476 entries, 8 to 9806247
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   tconst          381476 non-null  object
 1   titleType       381476 non-null  object
 2   primaryTitle    381476 non-null  object
 3   originalTitle   381476 non-null  object
 4   isAdult         381476 non-null  object
 5   startYear       375057 non-null  object
 6   endYear         0 non-null       object
 7   runtimeMinutes  381476 non-null  object
 8   genres          381476 non-null  object
dtypes: object(9)
memory usage: 29.1+ MB


In [14]:
# convert movie Year to numeric values
basics['startYear'] = pd.to_numeric(basics['startYear'], errors='ignore')

In [15]:
# create filter for movie years between 2000 to 2022
movie_years = (basics['startYear'] >= 2000) & (basics['startYear']<=2022)

# filter for df with movies between 2000 to 2022
basics = basics.loc[movie_years,:]

In [16]:
# Exclude movies that are included in the documentary category.
is_documentary = basics['genres'].str.contains('documentary',case=False)
basics = basics[~is_documentary]

In [17]:
# Check missing values after dropping
print(basics.isna().sum())

tconst                 0
titleType              0
primaryTitle           0
originalTitle          0
isAdult                0
startYear              0
endYear           147348
runtimeMinutes         0
genres                 0
dtype: int64


In [18]:
basics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 147348 entries, 34803 to 9806147
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   tconst          147348 non-null  object 
 1   titleType       147348 non-null  object 
 2   primaryTitle    147348 non-null  object 
 3   originalTitle   147348 non-null  object 
 4   isAdult         147348 non-null  object 
 5   startYear       147348 non-null  float64
 6   endYear         0 non-null       object 
 7   runtimeMinutes  147348 non-null  object 
 8   genres          147348 non-null  object 
dtypes: float64(1), object(8)
memory usage: 11.2+ MB


In [19]:
# drop end Year, not needed for movies
basics.drop('endYear', axis=1, inplace=True)  

#### Title akas :

* Filter *aka (also known as)* titles for only US movies

In [20]:
# encode null values \N is null
akas = akas.replace({'\\N':np.nan})

In [21]:
akas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35723981 entries, 0 to 35723980
Data columns (total 8 columns):
 #   Column           Dtype 
---  ------           ----- 
 0   titleId          object
 1   ordering         int64 
 2   title            object
 3   region           object
 4   language         object
 5   types            object
 6   attributes       object
 7   isOriginalTitle  object
dtypes: int64(1), object(7)
memory usage: 2.1+ GB


In [22]:
# display max rows
pd.set_option('display.max_rows', None)

In [23]:
# find US value
akas['region'].value_counts().sort_values(ascending=False).head(20)

DE     4277454
FR     4273124
JP     4272002
IN     4212815
ES     4193455
IT     4173999
PT     4104892
US     1432658
GB      447322
CA      225666
XWW     172384
AU      133888
BR      117477
RU       95365
MX       93941
GR       92210
PL       88006
FI       86949
SE       76434
HU       74358
Name: region, dtype: int64

In [24]:
pd.reset_option('display.max_rows')

In [25]:
# create for for US movies
US_filter = akas['region'] == 'US'

# filter df for US movies
akas = akas.loc[US_filter,:]

In [26]:
akas.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1432658 entries, 5 to 35723725
Data columns (total 8 columns):
 #   Column           Non-Null Count    Dtype 
---  ------           --------------    ----- 
 0   titleId          1432658 non-null  object
 1   ordering         1432658 non-null  int64 
 2   title            1432658 non-null  object
 3   region           1432658 non-null  object
 4   language         3894 non-null     object
 5   types            978158 non-null   object
 6   attributes       46456 non-null    object
 7   isOriginalTitle  1431313 non-null  object
dtypes: int64(1), object(7)
memory usage: 98.4+ MB


#### Title Basics :  *(cont.)*

* keep only US movie titles

In [27]:
# Filter the basics table down to only include the US by using the filter akas dataframe
keep_basics =basics['tconst'].isin(akas['titleId'])
keep_basics.head(20)

34803      True
61116      True
67669      True
77964     False
86801      True
93938      True
98043      True
100076    False
101042     True
106106     True
107990    False
108911    False
110365    False
110477     True
110540     True
111851     True
112119    False
113046    False
113286     True
113725     True
Name: tconst, dtype: bool

In [28]:
basics = basics[keep_basics]
basics

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres
34803,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,118,"Comedy,Fantasy,Romance"
61116,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020.0,70,Drama
67669,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018.0,122,Drama
86801,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005.0,100,"Comedy,Horror,Sci-Fi"
93938,tt0096056,movie,Crime and Punishment,Crime and Punishment,0,2002.0,126,Drama
...,...,...,...,...,...,...,...,...
9805435,tt9914942,movie,Life Without Sara Amat,La vida sense la Sara Amat,0,2019.0,74,Drama
9805830,tt9915872,movie,The Last White Witch,My Girlfriend is a Wizard,0,2019.0,97,"Comedy,Drama,Fantasy"
9805970,tt9916170,movie,The Rehearsal,O Ensaio,0,2019.0,51,Drama
9805979,tt9916190,movie,Safeguard,Safeguard,0,2020.0,95,"Action,Adventure,Thriller"


In [29]:
basics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 86552 entries, 34803 to 9806063
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tconst          86552 non-null  object 
 1   titleType       86552 non-null  object 
 2   primaryTitle    86552 non-null  object 
 3   originalTitle   86552 non-null  object 
 4   isAdult         86552 non-null  object 
 5   startYear       86552 non-null  float64
 6   runtimeMinutes  86552 non-null  object 
 7   genres          86552 non-null  object 
dtypes: float64(1), object(7)
memory usage: 5.9+ MB


#### Title Ratings :

In [30]:
# encode null values \N is null
ratings = ratings.replace({'\\N':np.nan})

In [31]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1305933 entries, 0 to 1305932
Data columns (total 3 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   tconst         1305933 non-null  object 
 1   averageRating  1305933 non-null  float64
 2   numVotes       1305933 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 29.9+ MB


In [32]:
# Filter the basics table down to only include the US by using the filter akas dataframe
keep_ratings =ratings['tconst'].isin(akas['titleId'])
keep_ratings.head(20)

0      True
1      True
2     False
3     False
4      True
5      True
6      True
7      True
8      True
9      True
10    False
11     True
12     True
13     True
14     True
15     True
16    False
17    False
18    False
19    False
Name: tconst, dtype: bool

In [33]:
ratings = ratings[keep_ratings]
ratings

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1966
1,tt0000002,5.8,263
4,tt0000005,6.2,2607
5,tt0000006,5.2,181
6,tt0000007,5.4,816
...,...,...,...
1305894,tt9916200,8.1,229
1305895,tt9916204,8.1,262
1305902,tt9916348,8.1,18
1305903,tt9916362,6.4,5307


In [34]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 496709 entries, 0 to 1305908
Data columns (total 3 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   tconst         496709 non-null  object 
 1   averageRating  496709 non-null  float64
 2   numVotes       496709 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 15.2+ MB


### Final Check File Info

In [35]:
# tital basics info
basics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 86552 entries, 34803 to 9806063
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tconst          86552 non-null  object 
 1   titleType       86552 non-null  object 
 2   primaryTitle    86552 non-null  object 
 3   originalTitle   86552 non-null  object 
 4   isAdult         86552 non-null  object 
 5   startYear       86552 non-null  float64
 6   runtimeMinutes  86552 non-null  object 
 7   genres          86552 non-null  object 
dtypes: float64(1), object(7)
memory usage: 5.9+ MB


In [36]:
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres
34803,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,118,"Comedy,Fantasy,Romance"
61116,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020.0,70,Drama
67669,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018.0,122,Drama
86801,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005.0,100,"Comedy,Horror,Sci-Fi"
93938,tt0096056,movie,Crime and Punishment,Crime and Punishment,0,2002.0,126,Drama


In [37]:
# title ratings info
ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 496709 entries, 0 to 1305908
Data columns (total 3 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   tconst         496709 non-null  object 
 1   averageRating  496709 non-null  float64
 2   numVotes       496709 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 15.2+ MB


In [38]:
ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1966
1,tt0000002,5.8,263
4,tt0000005,6.2,2607
5,tt0000006,5.2,181
6,tt0000007,5.4,816


In [39]:
# title akas info
akas.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1432658 entries, 5 to 35723725
Data columns (total 8 columns):
 #   Column           Non-Null Count    Dtype 
---  ------           --------------    ----- 
 0   titleId          1432658 non-null  object
 1   ordering         1432658 non-null  int64 
 2   title            1432658 non-null  object
 3   region           1432658 non-null  object
 4   language         3894 non-null     object
 5   types            978158 non-null   object
 6   attributes       46456 non-null    object
 7   isOriginalTitle  1431313 non-null  object
dtypes: int64(1), object(7)
memory usage: 98.4+ MB


In [40]:
akas.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
5,tt0000001,6,Carmencita,US,,imdbDisplay,,0
14,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0
33,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0
36,tt0000005,1,Blacksmithing Scene,US,,alternative,,0
41,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0


### Save Compressed Files

In [41]:
## Save title basics df
basics.to_csv("Data/title_basics.csv.gz",compression='gzip',index=False)

In [42]:
## Save title ratings df
ratings.to_csv("Data/title_ratings.csv.gz",compression='gzip',index=False)

In [43]:
## Save title akas df
akas.to_csv("Data/title_akas.csv.gz",compression='gzip',index=False)

In [44]:
# Open saved file and preview again
basics = pd.read_csv("Data/title_basics.csv.gz", low_memory = False)
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020.0,70,Drama
2,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018.0,122,Drama
3,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005.0,100,"Comedy,Horror,Sci-Fi"
4,tt0096056,movie,Crime and Punishment,Crime and Punishment,0,2002.0,126,Drama


In [45]:
basics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86552 entries, 0 to 86551
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tconst          86552 non-null  object 
 1   titleType       86552 non-null  object 
 2   primaryTitle    86552 non-null  object 
 3   originalTitle   86552 non-null  object 
 4   isAdult         86552 non-null  int64  
 5   startYear       86552 non-null  float64
 6   runtimeMinutes  86552 non-null  int64  
 7   genres          86552 non-null  object 
dtypes: float64(1), int64(2), object(5)
memory usage: 5.3+ MB


In [46]:
# Open saved file and preview again
ratings = pd.read_csv("Data/title_ratings.csv.gz", low_memory = False)
ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1966
1,tt0000002,5.8,263
2,tt0000005,6.2,2607
3,tt0000006,5.2,181
4,tt0000007,5.4,816


In [47]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 496709 entries, 0 to 496708
Data columns (total 3 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   tconst         496709 non-null  object 
 1   averageRating  496709 non-null  float64
 2   numVotes       496709 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 11.4+ MB


In [48]:
# Open saved file and preview again
akas = pd.read_csv("Data/title_akas.csv.gz", low_memory = False)
akas.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,6,Carmencita,US,,imdbDisplay,,0.0
1,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0.0
2,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0.0
3,tt0000005,1,Blacksmithing Scene,US,,alternative,,0.0
4,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0.0


In [49]:
akas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1432658 entries, 0 to 1432657
Data columns (total 8 columns):
 #   Column           Non-Null Count    Dtype  
---  ------           --------------    -----  
 0   titleId          1432658 non-null  object 
 1   ordering         1432658 non-null  int64  
 2   title            1432658 non-null  object 
 3   region           1432658 non-null  object 
 4   language         3894 non-null     object 
 5   types            978158 non-null   object 
 6   attributes       46456 non-null    object 
 7   isOriginalTitle  1431313 non-null  float64
dtypes: float64(1), int64(1), object(6)
memory usage: 87.4+ MB
