# JAMBOREE PRODUCTIONS | Cleaning Data Set

### This script contains the following points:
#### 1. Importing Libraries
#### 2. Importing Data
#### 3. Checking the Data
#### 4. Dropping all Unreleased Films
        Reasoning
#### 5. Dropping Unnecessary Columns
        Reasoning
#### 6. Renaming Columns
        Reasoning
#### 7. Checking for Duplicated Rows
#### 8. Imputing Budget and Revenue Values
        Reasoning
#### 9. Checking for Missing Values
        Observation and Plan
        New Observation
        Production Companies
        Production Countries
        Spoken Languages
#### 10. Checking for Incorrect Values
        Reasoning
        Runtime
        Budget
        Revenue
        Release Date
        Original Language
#### 11. Checking for Mixed-Type Columns
#### 12. Creating New Columns for Genres
        Action
        Adventure
        Animation
        Comedy
        Crime
        Documentary
        Drama
        Family
        Fantasy
        History
        Horror
        Mystery
        Music
        Romance
        Science Fiction
        TV Movie
        Thriller
        War
        Western
#### 13. Creating New Runtime Category Column
        Reasoning
#### 14. Creating New Budget Category Column
        Reasoning
#### 15. Data Profile
#### 16. Exporting Data

## 1. Importing Libraries

In [1]:
# Import Libraries
import pandas as pd
from ydata_profiling import ProfileReport
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

## 2. Importing Data

In [2]:
# Set project folder as a string
path = r'/Users/matthewjones/Documents/CareerFoundry/Data Immersion/Achievement 6/Jamboree Entertainment Analysis'

In [3]:
df_movies = pd.read_csv(os.path.join(path, '02. Data', 'Original Data', 'TMDB_movie_dataset_v11.csv'), 
                        index_col = False)

In [4]:
iso_languages = pd.read_csv(os.path.join(path, '02. Data', 'Original Data', 'iso_639-1.csv'), 
                            index_col = False)

## 3. Checking the Data

In [5]:
# Check the output
df_movies.head(5)

Unnamed: 0,id,title,vote_average,vote_count,status,release_date,revenue,runtime,adult,backdrop_path,...,original_title,overview,popularity,poster_path,tagline,genres,production_companies,production_countries,spoken_languages,keywords
0,27205,Inception,8.364,34495,Released,2010-07-15,825532764,148,False,/8ZTVqvKDQ8emSGUEMjsS4yHAwrp.jpg,...,Inception,"Cobb, a skilled thief who commits corporate es...",83.952,/oYuLEt3zVCKq57qu2F8dT7NIa6f.jpg,Your mind is the scene of the crime.,"Action, Science Fiction, Adventure","Legendary Pictures, Syncopy, Warner Bros. Pict...","United Kingdom, United States of America","English, French, Japanese, Swahili","rescue, mission, dream, airplane, paris, franc..."
1,157336,Interstellar,8.417,32571,Released,2014-11-05,701729206,169,False,/pbrkL804c8yAv3zBZR4QPEafpAR.jpg,...,Interstellar,The adventures of a group of explorers who mak...,140.241,/gEU2QniE6E77NI6lCU6MxlNBvIx.jpg,Mankind was born on Earth. It was never meant ...,"Adventure, Drama, Science Fiction","Legendary Pictures, Syncopy, Lynda Obst Produc...","United Kingdom, United States of America",English,"rescue, future, spacecraft, race against time,..."
2,155,The Dark Knight,8.512,30619,Released,2008-07-16,1004558444,152,False,/nMKdUUepR0i5zn0y1T4CsSB5chy.jpg,...,The Dark Knight,Batman raises the stakes in his war on crime. ...,130.643,/qJ2tW6WMUDux911r6m7haRef0WH.jpg,Welcome to a world without rules.,"Drama, Action, Crime, Thriller","DC Comics, Legendary Pictures, Syncopy, Isobel...","United Kingdom, United States of America","English, Mandarin","joker, sadism, chaos, secret identity, crime f..."
3,19995,Avatar,7.573,29815,Released,2009-12-15,2923706026,162,False,/vL5LR6WdxWPjLPFRLe133jXWsh5.jpg,...,Avatar,"In the 22nd century, a paraplegic Marine is di...",79.932,/kyeqWdyUXW608qlYkRqosgbbJyK.jpg,Enter the world of Pandora.,"Action, Adventure, Fantasy, Science Fiction","Dune Entertainment, Lightstorm Entertainment, ...","United States of America, United Kingdom","English, Spanish","future, society, culture clash, space travel, ..."
4,24428,The Avengers,7.71,29166,Released,2012-04-25,1518815515,143,False,/9BBTo63ANSmhC4e6r62OJFuK2GL.jpg,...,The Avengers,When an unexpected enemy emerges and threatens...,98.082,/RYMX2wcKCBAr24UyPD7xwmjaTn.jpg,Some assembly required.,"Science Fiction, Action, Adventure",Marvel Studios,United States of America,"English, Hindi, Russian","new york city, superhero, shield, based on com..."


In [6]:
# Check the shape
df_movies.shape

(1055148, 24)

In [7]:
# List out column names for ease of access
df_movies.columns

Index(['id', 'title', 'vote_average', 'vote_count', 'status', 'release_date',
       'revenue', 'runtime', 'adult', 'backdrop_path', 'budget', 'homepage',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'tagline', 'genres',
       'production_companies', 'production_countries', 'spoken_languages',
       'keywords'],
      dtype='object')

## 4. Dropping all Unreleased Films

### Reasoning:
        Our analysis only wants to look at released films in this dataset. Since this leaves us with over 
        1,000,000 entries still, this choice should have minimal effect on our analysis.

In [8]:
# See the breakdown of films by Released status
df_movies['status'].value_counts(dropna = False)

status
Released           1035010
In Production         7389
Post Production       6502
Planned               5675
Canceled               288
Rumored                284
Name: count, dtype: int64

In [9]:
# Subset the dataframe to include only 'Released' films
df_released = df_movies[df_movies['status'] == 'Released']

In [10]:
# Check that the new dataframe has the correct number of rows
df_released.shape

(1035010, 24)

## 5. Dropping Unnecessary Columns

### Reasoning:
        status - all films have the same value, 'Released'
        
        adult - not necessary for our analysis (and all films that we will consider have the same value, False)
        
        backdrop_path - not necessary for our analysis
        
        homepage - not necessary for our analysis
        
        imdb_id - not necessary for our analysis
        
        original_title - we do not need another title column
        
        popularity - this metric is biased towards newer films that have had a recent uptick in interest; it is 
        not a reliable measure for assessing a film's success
        
        poster_path - not necessary for our analysis
        
        tagline - not necessary for our analysis
        
        keywords - not necessary for our analysis

In [11]:
# Drop the columns
df_droppedcolumns = df_released.drop(['status', 'adult', 'backdrop_path', 'homepage', 'imdb_id', 
                                      'original_title', 'popularity', 'poster_path', 'tagline', 'keywords'], 
                                     axis = 1)

In [12]:
# Check that the new dataframe no longer has the dropped columns
df_droppedcolumns.columns

Index(['id', 'title', 'vote_average', 'vote_count', 'release_date', 'revenue',
       'runtime', 'budget', 'original_language', 'overview', 'genres',
       'production_companies', 'production_countries', 'spoken_languages'],
      dtype='object')

## 6. Renaming Columns

### Reasoning
        overview --> synopsis : synopsis is a more intuitive name for this column

In [13]:
# Rename the 'overview' column
df_renamed = df_droppedcolumns.rename({'overview' : 'synopsis'}, axis = 1)

In [14]:
# Check that the new dataframe has the renamed column
df_renamed.columns

Index(['id', 'title', 'vote_average', 'vote_count', 'release_date', 'revenue',
       'runtime', 'budget', 'original_language', 'synopsis', 'genres',
       'production_companies', 'production_countries', 'spoken_languages'],
      dtype='object')

## 7. Checking for Duplicated Rows

In [15]:
# Create a subset of rows that are duplicated
df_dups = df_renamed[df_renamed.duplicated()]

In [16]:
# Check the shape of the duplicated row dataframe
df_dups.shape

(367, 14)

In [17]:
# Subset the dataframe to drop all duplicates
df_no_dups = df_renamed.drop_duplicates()

In [18]:
# Check that the new dataframe has the correct number of rows
df_no_dups.shape

(1034643, 14)

## 8. Imputing Budget and Revenue Values

### Reasoning:
        There are a high number of films with a budget of 0 or a revenue of 0. It is more likely that it is an 
        error in data collection if the film has a budget with no revenue, or a revenue with no budget. So, we 
        can impute the average of these values. Films with no budget AND no revenue could also be an error, but 
        for our analysis, we will not be considering these films.

In [19]:
# Subset the dataframe to include films with a budget over 0, but no revenue OR a revenue over 0, but no budget
df_missing_budge_rev = df_no_dups[((df_no_dups['budget'] > 0) & (df_no_dups['revenue'] == 0)) |
                                  ((df_no_dups['revenue'] > 0) & (df_no_dups['budget'] == 0))]

In [20]:
# Check the shape of the subset
df_missing_budge_rev.shape

(44023, 14)

There are a total of **44,023 entries** with either a budget of 0 or a revenue of 0. This is < 5% of the total number of films. Imputing should not significantly alter the data.

In [21]:
# Calculate the mean and median of the 'budget' and 'revenue' column and create a variable for them
avg_budget = df_no_dups['budget'].mean(axis = 0)
med_budget = df_no_dups['budget'].median(axis = 0)
avg_revenue = df_no_dups['revenue'].mean(axis = 0)
med_revenue = df_no_dups['revenue'].median(axis = 0)

print('The average budget is: ', avg_budget)
print('The median budget is: ', med_budget)
print('The average revenue is: ', avg_revenue)
print('The median revenue is: ', med_revenue)

The average budget is:  278702.51775250013
The median budget is:  0.0
The average revenue is:  729497.7704850852
The median revenue is:  0.0


There are so many missing values in both the budget and revenue columns that **the medians for both are unusable**. 

In [22]:
# For films with a budget over 0 AND no revenue, replace the 0 with the average revenue
df_no_dups.loc[(df_no_dups['budget'] > 0) & (df_no_dups['revenue'] == 0), 'revenue'] = avg_revenue
df_no_dups.loc[df_no_dups['revenue'] == avg_revenue].shape # 37,543 entries changed!

  df_no_dups.loc[(df_no_dups['budget'] > 0) & (df_no_dups['revenue'] == 0), 'revenue'] = avg_revenue


(37543, 14)

In [23]:
# For films with a revenue over 0 AND no budget, replace the 0 with the average budget
df_no_dups.loc[(df_no_dups['revenue'] > 0) & (df_no_dups['budget'] == 0), 'budget'] = avg_budget
df_no_dups.loc[df_no_dups['budget'] == avg_budget].shape # 6,480 entries changed!

  df_no_dups.loc[(df_no_dups['revenue'] > 0) & (df_no_dups['budget'] == 0), 'budget'] = avg_budget


(6480, 14)

### 44,023 = 37,543 + 6,480
We addressed all the films we previously subset in df_missing_budge_rev

## 9. Checking for Missing Values

In [24]:
# Check for missing values
df_no_dups.isnull().sum()

id                           0
title                       10
vote_average                 0
vote_count                   0
release_date            127068
revenue                      0
runtime                      0
budget                       0
original_language            0
synopsis                192302
genres                  396701
production_companies    554763
production_countries    436719
spoken_languages        423697
dtype: int64

### Observation and Plan:
        There is an unacceptable amount of missing data for 6 of the 14 columns.  We want to have as much 
        information for Genres, Production Companies, Production countries, and Spoken Languages as possible. 
        Because it would be too arduous to correct 100,000s of entries, I can subset the dataframe to include 
        only the films with a significant vote count, a budget, and a revenue. These are the films on which we 
        will focus our analysis. And this subset is a manageble amount of missing values to focus on correcting.
        
        So, the subset of df_no_dups, df_missing_values, will be used to identify the films that need 
        information updated. But df_no_dups will be the dataframe that gets updated.

In [25]:
df_missing_values = df_no_dups[(df_no_dups['budget'] > 0) & (df_no_dups['revenue'] > 0) & 
                               (df_no_dups['vote_count'] >= 150)]

In [26]:
df_missing_values.isnull().sum()

id                       0
title                    0
vote_average             0
vote_count               0
release_date             0
revenue                  0
runtime                  0
budget                   0
original_language        0
synopsis                 2
genres                   0
production_companies    27
production_countries     3
spoken_languages         2
dtype: int64

### New Observation:
        Compared to lookking at the missing values of the entire dataframe, this subset only has three columns 
        that need missing data to be filled in*.
        
        *Note: Missing data was found through another online movie database, IMDB. When searching for missing 
        production companies, effort was taken to ensure the production companies were legitimate companies. All 
        manually entered production companies were cross-checked with the original data source to confirm they 
        also produced another film in the data. And only production companies that were involved at a 
        co-production level or higher were included. Another caveat is that an individual production company may 
        be listed with more than one name. 
        
        For example:
        Jamboree Entertainment could also be listed as Jamboree Films or Jamboree Productions
        Jamboree Productions could also be listed as Jamboree Producciones.

### Production Companies

In [27]:
# Subset the dataframe to see films with missing production companies
df_no_company = df_missing_values[df_missing_values['production_companies'].isnull()]

In [28]:
# Check the output
df_no_company

Unnamed: 0,id,title,vote_average,vote_count,release_date,revenue,runtime,budget,original_language,synopsis,genres,production_companies,production_countries,spoken_languages
6395,246355,Saw,7.0,503,2003-01-01,729497.8,10,2000.0,en,"David, an orderly at a hospital, tells his hor...","Crime, Horror, Thriller",,Australia,English
6714,21481,Twitches,6.757,468,2005-10-14,729497.8,86,20000000.0,en,"Twins separated at birth, Camryn and Alex meet...","Comedy, Drama, Family, Fantasy, TV Movie",,United States of America,English
6972,9893,Sleepover,5.997,443,2004-07-09,10143020.0,89,10000000.0,en,As their first year of high school looms ahead...,"Family, Comedy",,United States of America,"English, Portuguese"
7453,61717,Wendy Wu: Homecoming Warrior,6.148,401,2006-06-16,729497.8,90,6000000.0,en,"It is the story of an average, popular America...","Family, Action, Adventure, TV Movie",,United States of America,English
7521,10961,Behind the Mask: The Rise of Leslie Vernon,6.467,396,2006-08-29,69136.0,92,278702.5,en,The next great psycho horror slasher has given...,"Comedy, Horror, Thriller",,United States of America,English
7689,1619,The Way of the Gun,6.347,383,2000-09-08,19125400.0,119,8500000.0,en,Two criminal drifters without sympathy get mor...,"Action, Crime, Drama, Thriller",,United States of America,"Spanish, English"
8157,652722,In the Arms of an Assassin,7.98,348,2019-12-06,425332.0,101,278702.5,es,Victor (William Levy) is one of the world’s mo...,"Romance, Thriller",,Dominican Republic,Spanish
8411,27040,Meshes of the Afternoon,7.679,332,1943-01-01,729497.8,14,275.0,en,A woman returning home falls asleep and has vi...,Horror,,United States of America,No Language
8875,525686,Mi prima la sexóloga,6.984,306,2016-07-21,729497.8,104,15000.0,es,A young man is afraid of asking for sex tips t...,Comedy,,Bolivia,Spanish
9444,457601,Everything You Want,7.442,278,2017-05-11,249009.0,106,278702.5,it,"An aimless young troublemaker, Alessandro, squ...","Comedy, Drama",,Italy,Italian


In [29]:
# Update the production companies of each film
df_no_dups.loc[df_no_dups['id'] == 246335, 'production_companies'] = ''
df_no_dups.loc[df_no_dups['id'] == 21481, 'production_companies'] = 'Disney Pictures'
df_no_dups.loc[df_no_dups['id'] == 9893, 'production_companies'] = 'Metro-Goldwyn Mayer, Landscape Entertainment, Weinstock Productions'
df_no_dups.loc[df_no_dups['id'] == 61717, 'production_companies'] = 'Disney Pictures, Regan/Jon Productions'
df_no_dups.loc[df_no_dups['id'] == 10961, 'production_companies'] = 'Glen Echo Entertainment, Code Entertainment'
df_no_dups.loc[df_no_dups['id'] == 1619, 'production_companies'] = 'Artisan Entertainment'
df_no_dups.loc[df_no_dups['id'] == 652722, 'production_companies'] = 'William Levy Entertainment'
df_no_dups.loc[df_no_dups['id'] == 27040, 'production_companies'] = ''
df_no_dups.loc[df_no_dups['id'] == 525686, 'production_companies'] = 'Miguel Chavez & Associates'
df_no_dups.loc[df_no_dups['id'] == 457601, 'production_companies'] = 'IBC Movie, Rai Cinema, Pupkin Entertainment'
df_no_dups.loc[df_no_dups['id'] == 59722, 'production_companies'] = 'The Asylum'
df_no_dups.loc[df_no_dups['id'] == 97795, 'production_companies'] = 'Nostromo Pictures, Kinology'
df_no_dups.loc[df_no_dups['id'] == 15641, 'production_companies'] = 'Stage 6 Films'
df_no_dups.loc[df_no_dups['id'] == 10790, 'production_companies'] = 'Castelao Productions, Filmax, Filmstudio Bojana, Future Films, Radiovision Plus'
df_no_dups.loc[df_no_dups['id'] == 77000, 'production_companies'] = 'Aurora Film, Rai Cinema'
df_no_dups.loc[df_no_dups['id'] == 15664, 'production_companies'] = 'Lionsgate Films, Thousand Words'
df_no_dups.loc[df_no_dups['id'] == 383807, 'production_companies'] = 'Les Films du Bélier, Les Films Pelléas, France 2 Cinéma, Mars Films, Jouror Productions, CN5 Productions, Ezekiel Film Production, Frakas Productions, RTBF, Proximus'
df_no_dups.loc[df_no_dups['id'] == 576071, 'production_companies'] = 'Unplanned Movie, Believe Entertainment'
df_no_dups.loc[df_no_dups['id'] == 734309, 'production_companies'] = 'Giant Sables Media Entertainment, ZenHQ Films'
df_no_dups.loc[df_no_dups['id'] == 670355, 'production_companies'] = ''
df_no_dups.loc[df_no_dups['id'] == 19139, 'production_companies'] = 'Milkshake Films'
df_no_dups.loc[df_no_dups['id'] == 368993, 'production_companies'] = 'Ra Ra Productions'
df_no_dups.loc[df_no_dups['id'] == 862855, 'production_companies'] = ''
df_no_dups.loc[df_no_dups['id'] == 12621, 'production_companies'] = 'Bona Fide Productions'
df_no_dups.loc[df_no_dups['id'] == 22492, 'production_companies'] = 'Thomas Tull Productions'
df_no_dups.loc[df_no_dups['id'] == 743439, 'production_companies'] = 'Nickelodeon Animation Studios, Nickelodeon Productions, Spin Master Entertainment'
df_no_dups.loc[df_no_dups['id'] == 11458, 'production_companies'] = 'Myriad Pictures, Wildwood Enterprises'

### Production Countries

In [30]:
# Subset the dataframe to see films with missing production country
df_no_country = df_missing_values[df_missing_values['production_countries'].isnull()]

In [31]:
# Check the output
df_no_country

Unnamed: 0,id,title,vote_average,vote_count,release_date,revenue,runtime,budget,original_language,synopsis,genres,production_companies,production_countries,spoken_languages
813,27576,Salt,6.398,5109,2010-07-21,293329100.0,100,110000000.0,en,"As a CIA officer, Evelyn Salt swore an oath to...","Action, Mystery, Thriller","Wintergreen Productions, Columbia Pictures, Re...",,"English, Russian, Korean"
11340,14822,Return from Witch Mountain,5.849,212,1978-03-10,16393000.0,95,278702.5,en,Tony and Tia are other-worldly twins endowed w...,"Adventure, Fantasy, Science Fiction, Family",Walt Disney Productions,,English
12683,19139,Goal! III : Taking On The World,3.946,177,2009-06-22,729497.8,91,10000000.0,en,"Mexican footballer Santiago Muñez, along with ...",Drama,,,"English, Spanish"


In [32]:
# Update the production country for the film
df_no_dups.loc[df_no_dups['id'] == 27576, 'production_countries'] = 'United States of America'
df_no_dups.loc[df_no_dups['id'] == 14822, 'production_countries'] = 'United States of America'
df_no_dups.loc[df_no_dups['id'] == 19139, 'production_countries'] = 'United Kingdom'

### Spoken Languages

In [33]:
# Subset the dataframe to see films with missing spoken language
df_no_spoken_languages = df_missing_values[df_missing_values['spoken_languages'].isnull()]

In [34]:
# Check the output
df_no_spoken_languages

Unnamed: 0,id,title,vote_average,vote_count,release_date,revenue,runtime,budget,original_language,synopsis,genres,production_companies,production_countries,spoken_languages
1827,594767,Shazam! Fury of the Gods,6.662,2416,2023-03-15,133437105.0,130,125000000.0,en,"Billy Batson and his foster siblings, who tran...","Comedy, Action, Fantasy","New Line Cinema, The Safran Company, DC Films",United States of America,
3110,511809,West Side Story,7.036,1334,2021-12-08,74530532.0,157,100000000.0,en,Two youngsters from rival New York City gangs ...,"Drama, Romance, Crime","Amblin Entertainment, 20th Century Studios",United States of America,


In [35]:
# Update the spoken languages of each film
df_no_dups.loc[df_no_dups['id'] == 594767, 'spoken_languages'] = 'English, Greek, French, Spanish'
df_no_dups.loc[df_no_dups['id'] == 511809, 'spoken_languages'] = 'English, Spanish'

## 10. Checking for Incorrect Values

In [36]:
# Check the spread of the continuous variables
df_no_dups.describe()

Unnamed: 0,id,vote_average,vote_count,revenue,runtime,budget
count,1034643.0,1034643.0,1034643.0,1034643.0,1034643.0,1034643.0
mean,686285.7,2.04686,20.72846,755968.3,50.1827,280448.0
std,369747.7,3.091633,333.7834,18140160.0,62.3038,4950615.0
min,2.0,0.0,0.0,-12.0,-28.0,0.0
25%,380853.5,0.0,0.0,0.0,1.0,0.0
50%,685431.0,0.0,0.0,0.0,29.0,0.0
75%,1007976.0,5.0,1.0,0.0,90.0,0.0
max,1307736.0,10.0,34495.0,3000000000.0,14400.0,900000000.0


In [37]:
# Check if there are unusual values
df_no_dups['release_date'].sort_values()

562475     1800-01-01
550453     1800-09-11
227562     1865-01-01
511392     1865-01-01
523744     1865-01-01
              ...    
1055031           NaN
1055037           NaN
1055123           NaN
1055125           NaN
1055144           NaN
Name: release_date, Length: 1034643, dtype: object

In [38]:
# Check the values of the 'original_language' column
df_no_dups['original_language'].value_counts()

original_language
en    559797
fr     61911
es     52873
de     51908
ja     45962
       ...  
kv         1
nr         1
kg         1
aa         1
ii         1
Name: count, Length: 174, dtype: int64

### Reasoning:
        Runtime - we only want to consider feature films (The Academy of Motion Picture Arts and Sciences 
        defines a feature film as at least 40 minutes long). There's also at least one film that is 14,000 min 
        long, but we will keep the extreme values on that end in the dataset.
        
        Budget/Revenue - although we've corrected some of the films with no budget/revenue, it is still possible 
        that some of films have incorrect values (too low). Missing budget and revenue information was found 
        through IMDB or Wikipedia. And as a last resort, the average budget was used.
        
        Release Date - besides having null values, we also have incorrect values of films released in the 19th 
        century (the first feature film wasn't released until 1888)
        
        Original Language - while there aren't unusual values in this column, the language abbreviations are not 
        useful in their current state

### Runtime

In [39]:
# Subset the dataframe to see films 40 minutes or shorter
df_short_films = df_no_dups[df_no_dups['runtime'] < 40]

In [40]:
# Check the output
df_short_films.shape # Over half the data are from films under 40 min long!

(538873, 14)

In [41]:
# Subset the dataframe to only include films with a runtime of 40 minutes
df_features = df_no_dups[df_no_dups['runtime'] >= 40]

In [42]:
df_features.shape

(495770, 14)

### Budget

In [43]:
# Subset the dataframe to see films with a budget of less than $1,000
df_low_budget = df_features[(df_features['budget'] > 0) & (df_features['budget'] < 1000) & (df_features['vote_count'] >= 150)]

In [44]:
# Check the output
df_low_budget

Unnamed: 0,id,title,vote_average,vote_count,release_date,revenue,runtime,budget,original_language,synopsis,genres,production_companies,production_countries,spoken_languages
3623,619979,Deep Water,5.8,1114,2022-03-18,729497.8,116,4.0,en,Vic and Melinda Van Allen are a couple in the ...,"Drama, Mystery, Thriller","New Regency Pictures, Entertainment 360, Film ...",United States of America,English
5191,5967,The Umbrellas of Cherbourg,7.393,676,1964-02-19,729497.8,93,7.0,fr,This simple romantic tragedy begins in 1957. G...,"Drama, Romance","Parc Film, Madeleine Films, Beta Film","France, Germany",French
5196,21435,French Fried Vacation 3: Friends Forever,4.06,675,2006-02-01,84.0,95,35.0,fr,"After the Club Med and skiing, what happened t...",Comedy,"Les Films Christian Fechner, TPS Cinéma, TF1 F...",France,"English, French, Italian"
6219,22582,Tom and Jerry: The Movie,6.266,522,1992-10-01,729497.8,84,35.0,en,The popular cartoon cat and mouse are thrown i...,"Family, Animation, Comedy","Film Roman, Live Entertainment, WMG Film, Turn...","Germany, United States of America",English
6260,851644,20th Century Girl,8.264,517,2022-10-06,729497.8,119,119.0,ko,Yeon-du asks her best friend Bora to collect a...,"Romance, Drama",Yong Film,South Korea,Korean
6828,282297,Cowspiracy: The Sustainability Secret,7.692,458,2014-07-01,729497.8,90,117.0,en,"Follow the shocking, yet humorous, journey of ...",Documentary,"First Spark Media, A.U.M. Films",United States of America,English
7608,10407,Housesitter,6.116,389,1992-06-12,94.0,102,26.0,en,"After building his dream house, architect Newt...","Comedy, Romance","Universal Pictures, Imagine Entertainment",United States of America,English
9071,13701,Immortal Beloved,7.067,295,1994-12-16,729497.8,121,120.0,en,A chronicle of the life of infamous classical ...,"Drama, Music, Romance","Columbia Pictures, Majestic Films Internationa...",United States of America,English
9542,11310,City Slickers II: The Legend of Curly's Gold,5.208,274,1994-06-10,43.0,116,40.0,en,Mitch Robbins 40th birthday begins quite well ...,"Action, Comedy, Drama, Western",Castle Rock Entertainment,United States of America,English
10061,11703,Kiss of the Spider Woman,7.024,254,1985-07-26,17005230.0,120,11.0,en,The story of two radically different men throw...,Drama,HB Filmes,"Brazil, United States of America","English, French, German, Italian, Portuguese"


In [45]:
# Update the budget information for each film
df_features.loc[df_features['id'] == 619979, 'budget'] = 48917499
df_features.loc[df_features['id'] == 5967, 'budget'] = avg_budget
df_features.loc[df_features['id'] == 21435, 'budget'] = 37567950
df_features.loc[df_features['id'] == 22582, 'budget'] = 8000000
df_features.loc[df_features['id'] == 851644, 'budget'] = avg_budget
df_features.loc[df_features['id'] == 282297, 'budget'] = 117092
df_features.loc[df_features['id'] == 10407, 'budget'] = 26000000
df_features.loc[df_features['id'] == 13701, 'budget'] = 9900000
df_features.loc[df_features['id'] == 11310, 'budget'] = 40000000
df_features.loc[df_features['id'] == 11703, 'budget'] = 1500000
df_features.loc[df_features['id'] == 14424, 'budget'] = 5000000
df_features.loc[df_features['id'] == 141581, 'budget'] = 5900000
df_features.loc[df_features['id'] == 571055, 'budget'] = avg_budget
df_features.loc[df_features['id'] == 20343, 'budget'] = 1350000
df_features.loc[df_features['id'] == 119193, 'budget'] = 14180000
df_features.loc[df_features['id'] == 16664, 'budget'] = 1200000
df_features.loc[df_features['id'] == 34672, 'budget'] = 6700000
df_features.loc[df_features['id'] == 829410, 'budget'] = 7000000
df_features.loc[df_features['id'] == 13489, 'budget'] = 8500000

# Data for the budget of Kickboxer 2 is opaque. Estimated budget was taken from the first Kickboxer film
df_features.loc[df_features['id'] == 24993, 'budget'] = 1500000

### Revenue

In [46]:
# Subset the dataframe to see films with a revenue of less than $1,000
df_low_revenue = df_features[(df_features['revenue'] > 0) & (df_features['revenue'] < 1000) & (df_features['vote_count'] > 150)]

In [47]:
# Check the output
df_low_revenue

Unnamed: 0,id,title,vote_average,vote_count,release_date,revenue,runtime,budget,original_language,synopsis,genres,production_companies,production_countries,spoken_languages
5196,21435,French Fried Vacation 3: Friends Forever,4.06,675,2006-02-01,84.0,95,37567950.0,fr,"After the Club Med and skiing, what happened t...",Comedy,"Les Films Christian Fechner, TPS Cinéma, TF1 F...",France,"English, French, Italian"
5443,13531,Empire Records,6.65,632,1995-09-22,303.0,90,278702.5,en,The employees of an independent music store le...,"Music, Comedy, Drama","Monarchy Enterprises B.V., New Regency Picture...",United States of America,English
7189,10714,The Jungle Book,5.908,425,1994-12-23,43.0,111,278702.5,en,"Raised by wild animals since childhood, Mowgli...","Family, Adventure, Drama",Walt Disney Pictures,United States of America,English
7608,10407,Housesitter,6.116,389,1992-06-12,94.0,102,26000000.0,en,"After building his dream house, architect Newt...","Comedy, Romance","Universal Pictures, Imagine Entertainment",United States of America,English
8222,10839,Cross of Iron,7.0,344,1977-01-29,201.0,132,6000000.0,en,"It is 1943, and the German army—ravaged and de...","Drama, Action, History, War","Rapid Film, EMI Films, Terra-Filmkunst","Germany, United Kingdom","German, English, French, Russian"
8671,94562,Turkish for Beginners,6.2,317,2012-03-14,23.0,105,278702.5,de,During an emergency landing on a deserted isla...,Comedy,"Constantin Film, Rat Pack Filmproduktion, Bluv...",Germany,German
8739,354979,Dog Eat Dog,5.105,313,2016-11-04,80.0,93,278702.5,en,Carved from a lifetime of experience that runs...,"Drama, Crime, Thriller","shanghai gigantic pictures, Pure Dopamine, adm...",United States of America,English
9483,6936,Ben X,6.942,276,2007-08-26,27.0,90,1500000.0,nl,Harassed by bullies because of his mild autism...,Drama,MMG Film & TV Production,Belgium,Dutch
9542,11310,City Slickers II: The Legend of Curly's Gold,5.208,274,1994-06-10,43.0,116,40000000.0,en,Mitch Robbins 40th birthday begins quite well ...,"Action, Comedy, Drama, Western",Castle Rock Entertainment,United States of America,English
9617,7294,The Man Without a Past,7.391,271,2002-03-01,921.0,97,278702.5,fi,"Arriving in Helsinki, a nameless man is beaten...","Drama, Comedy, Romance","Sputnik, YLE, Pyramide Productions, Pandora Fi...","Finland, France, Germany",Finnish


In [48]:
# Update the revenue information for each film
df_features.loc[df_features['id'] == 21435, 'revenue'] = 84152064
df_features.loc[df_features['id'] == 13531, 'revenue'] = 303841
df_features.loc[df_features['id'] == 10714, 'revenue'] = 52389402
df_features.loc[df_features['id'] == 10407, 'revenue'] = 94900635
df_features.loc[df_features['id'] == 10839, 'revenue'] = 1509000
df_features.loc[df_features['id'] == 94562, 'revenue'] = 23957607
df_features.loc[df_features['id'] == 354979, 'revenue'] = 184404
df_features.loc[df_features['id'] == 6936, 'revenue'] = 2744414
df_features.loc[df_features['id'] == 11310, 'revenue'] = 43622150
df_features.loc[df_features['id'] == 7294, 'revenue'] = 9564237
df_features.loc[df_features['id'] == 7916, 'revenue'] = 783501
df_features.loc[df_features['id'] == 14424, 'revenue'] = 1229330
df_features.loc[df_features['id'] == 17911, 'revenue'] = 722
df_features.loc[df_features['id'] == 94204, 'revenue'] = 525
df_features.loc[df_features['id'] == 34672, 'revenue'] = 247327
df_features.loc[df_features['id'] == 19855, 'revenue'] = 16928556
df_features.loc[df_features['id'] == 18196, 'revenue'] = 4000000
df_features.loc[df_features['id'] == 429107, 'revenue'] = 7425391
df_features.loc[df_features['id'] == 283559, 'revenue'] = 4800000
df_features.loc[df_features['id'] == 612491, 'revenue'] = 1003063
df_features.loc[df_features['id'] == 19267, 'revenue'] = 21500000

### Release Date

Looking at the previously created df_missing_values (all films that have a budget, a revenue, and at least 150 votes), there are no suspicious release dates. This means the incorrect release dates are only found in the films that we will already not be including in the analysis.

In [49]:
# Check if there are unusual values
df_features['release_date'].sort_values()

550453     1800-09-11
163055     1894-10-08
822749     1899-11-01
514792     1900-01-01
602776     1900-01-01
              ...    
1054714           NaN
1054835           NaN
1054848           NaN
1054899           NaN
1055125           NaN
Name: release_date, Length: 495770, dtype: object

In [50]:
# Check if there are unusual values in the df_missing_values dataframe
df_missing_values['release_date'].sort_values() 

2672     1902-04-17
6035     1903-12-07
13258    1904-10-01
6583     1915-02-08
8943     1916-09-04
            ...    
9964     2023-08-23
10129    2023-08-30
7919     2023-09-06
6246     2023-09-13
8203     2023-09-22
Name: release_date, Length: 9551, dtype: object

### Original Language

In [51]:
# Check the output of the language code dataframe
iso_languages.head()

Unnamed: 0,family,name,nativeName,639-1,639-2,639-2/B
0,Northwest Caucasian,Abkhaz,"аҧсуа бызшәа, аҧсшәа",ab,abk,
1,Afro-Asiatic,Afar,Afaraf,aa,aar,
2,Indo-European,Afrikaans,Afrikaans,af,afr,
3,Niger–Congo,Akan,Akan,ak,aka,
4,Indo-European,Albanian,Shqip,sq,sqi,alb


In [52]:
# Left merge df_features on 'original_langauge' with iso_language on '639-1'
df_features = df_features.merge(iso_languages, left_on = 'original_language', right_on = '639-1', indicator = True)

In [53]:
# Check the output
df_features.head()

Unnamed: 0,id,title,vote_average,vote_count,release_date,revenue,runtime,budget,original_language,synopsis,...,production_companies,production_countries,spoken_languages,family,name,nativeName,639-1,639-2,639-2/B,_merge
0,27205,Inception,8.364,34495,2010-07-15,825532800.0,148,160000000.0,en,"Cobb, a skilled thief who commits corporate es...",...,"Legendary Pictures, Syncopy, Warner Bros. Pict...","United Kingdom, United States of America","English, French, Japanese, Swahili",Indo-European,English,English,en,eng,,both
1,157336,Interstellar,8.417,32571,2014-11-05,701729200.0,169,165000000.0,en,The adventures of a group of explorers who mak...,...,"Legendary Pictures, Syncopy, Lynda Obst Produc...","United Kingdom, United States of America",English,Indo-European,English,English,en,eng,,both
2,155,The Dark Knight,8.512,30619,2008-07-16,1004558000.0,152,185000000.0,en,Batman raises the stakes in his war on crime. ...,...,"DC Comics, Legendary Pictures, Syncopy, Isobel...","United Kingdom, United States of America","English, Mandarin",Indo-European,English,English,en,eng,,both
3,19995,Avatar,7.573,29815,2009-12-15,2923706000.0,162,237000000.0,en,"In the 22nd century, a paraplegic Marine is di...",...,"Dune Entertainment, Lightstorm Entertainment, ...","United States of America, United Kingdom","English, Spanish",Indo-European,English,English,en,eng,,both
4,24428,The Avengers,7.71,29166,2012-04-25,1518816000.0,143,220000000.0,en,When an unexpected enemy emerges and threatens...,...,Marvel Studios,United States of America,"English, Hindi, Russian",Indo-European,English,English,en,eng,,both


In [54]:
# Check to make sure it was a full match
df_features['_merge'].value_counts()

_merge
both          490711
left_only          0
right_only         0
Name: count, dtype: int64

In [55]:
# Drop the extra columns
df_features = df_features.drop(['original_language', 'family', 'nativeName', '639-1', '639-2', 
                                '639-2/B', '_merge'], axis = 1)

In [56]:
# Rename the new full language name column
df_features = df_features.rename({'name' : 'original_language'}, axis = 1)

In [57]:
# Check the column output 
df_features.columns

Index(['id', 'title', 'vote_average', 'vote_count', 'release_date', 'revenue',
       'runtime', 'budget', 'synopsis', 'genres', 'production_companies',
       'production_countries', 'spoken_languages', 'original_language'],
      dtype='object')

## 11. Checking for Mixed-Type Columns

In [58]:
# Check for mixed-type columns
for col in df_features.columns.tolist():
  weird = (df_features[[col]].map(type) != df_features[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_features[weird]) > 0:
    print (col)

title
release_date
synopsis
genres
production_companies
production_countries
spoken_languages


In [59]:
# Updating columns to be a string
df_features['title'] = df_features['title'].astype('str')
df_features['synopsis'] = df_features['synopsis'].astype('str')
df_features['genres'] = df_features['genres'].astype('str')
df_features['production_companies'] = df_features['production_companies'].astype('str')
df_features['production_countries'] = df_features['production_countries'].astype('str')
df_features['spoken_languages'] = df_features['spoken_languages'].astype('str')

In [60]:
from datetime import datetime

df_features['release_date'] = pd.to_datetime(df_features['release_date'])

In [61]:
# Check for mixed-type columns
for col in df_features.columns.tolist():
  weird = (df_features[[col]].map(type) != df_features[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_features[weird]) > 0:
    print (col)

release_date


## 12. Creating New Columns for Genres

### Action

In [62]:
# Check the 'genres' column to see if the word 'Action' appears
action_list = []

for value in df_features['genres']:
    if 'Action' in value:
        action_list.append(1)
    else: action_list.append(0)

In [63]:
# Create a new column using the values from the previous for loop
df_features['action'] = action_list
df_features['action'].value_counts(dropna = False)

action
0    460570
1     30141
Name: count, dtype: int64

### Adventure

In [64]:
# Check the 'genres' column to see if the word 'Adventure' appears
adventure_list = []

for value in df_features['genres']:
    if 'Adventure' in value:
        adventure_list.append(1)
    else: adventure_list.append(0)

In [65]:
# Create a new column using the values from the previous for loop
df_features['adventure'] = adventure_list
df_features['adventure'].value_counts(dropna = False)

adventure
0    474938
1     15773
Name: count, dtype: int64

### Animation

In [66]:
# Check the 'genres' column to see if the word 'Animation' appears
animation_list = []

for value in df_features['genres']:
    if 'Animation' in value:
        animation_list.append(1)
    else: animation_list.append(0)

In [67]:
# Create a new column using the values from the previous for loop
df_features['animation'] = animation_list
df_features['animation'].value_counts(dropna = False)

animation
0    481448
1      9263
Name: count, dtype: int64

### Comedy

In [68]:
# Check the 'genres' column to see if the word 'Comedy' appears
comedy_list = []

for value in df_features['genres']:
    if 'Comedy' in value:
        comedy_list.append(1)
    else: comedy_list.append(0)

In [69]:
# Create a new column using the values from the previous for loop
df_features['comedy'] = comedy_list
df_features['comedy'].value_counts(dropna = False)

comedy
0    415842
1     74869
Name: count, dtype: int64

### Crime

In [70]:
# Check the 'genres' column to see if the word 'Crime' appears
crime_list = []

for value in df_features['genres']:
    if 'Crime' in value:
        crime_list.append(1)
    else: crime_list.append(0)

In [71]:
# Create a new column using the values from the previous for loop
df_features['crime'] = crime_list
df_features['crime'].value_counts(dropna = False)

crime
0    467424
1     23287
Name: count, dtype: int64

### Documentary

In [72]:
# Check the 'genres' column to see if the word 'Documentary' appears
documentary_list = []

for value in df_features['genres']:
    if 'Documentary' in value:
        documentary_list.append(1)
    else: documentary_list.append(0)

In [73]:
# Create a new column using the values from the previous for loop
df_features['documentary'] = documentary_list
df_features['documentary'].value_counts(dropna = False)

documentary
0    411189
1     79522
Name: count, dtype: int64

### Drama

In [74]:
# Check the 'genres' column to see if the word 'Drama' appears
drama_list = []

for value in df_features['genres']:
    if 'Drama' in value:
        drama_list.append(1)
    else: drama_list.append(0)

In [75]:
# Create a new column using the values from the previous for loop
df_features['drama'] = drama_list
df_features['drama'].value_counts(dropna = False)

drama
0    359414
1    131297
Name: count, dtype: int64

### Family

In [76]:
# Check the 'genres' column to see if the word 'Family' appears
family_list = []

for value in df_features['genres']:
    if 'Family' in value:
        family_list.append(1)
    else: family_list.append(0)

In [77]:
# Create a new column using the values from the previous for loop
df_features['family'] = family_list
df_features['family'].value_counts(dropna = False)

family
0    475517
1     15194
Name: count, dtype: int64

### Fantasy

In [78]:
# Check the 'genres' column to see if the word 'Fantasy' appears
fantasy_list = []

for value in df_features['genres']:
    if 'Fantasy' in value:
        fantasy_list.append(1)
    else: fantasy_list.append(0)

In [79]:
# Create a new column using the values from the previous for loop
df_features['fantasy'] = fantasy_list
df_features['fantasy'].value_counts(dropna = False)

fantasy
0    479445
1     11266
Name: count, dtype: int64

### History

In [80]:
# Check the 'genres' column to see if the word 'History' appears
history_list = []

for value in df_features['genres']:
    if 'History' in value:
        history_list.append(1)
    else: history_list.append(0)

In [81]:
# Create a new column using the values from the previous for loop
df_features['history'] = history_list
df_features['history'].value_counts(dropna = False)

history
0    479943
1     10768
Name: count, dtype: int64

### Horror

In [82]:
# Check the 'genres' column to see if the word 'Horror' appears
horror_list = []

for value in df_features['genres']:
    if 'Horror' in value:
        horror_list.append(1)
    else: horror_list.append(0)

In [83]:
# Create a new column using the values from the previous for loop
df_features['horror'] = horror_list
df_features['horror'].value_counts(dropna = False)

horror
0    462983
1     27728
Name: count, dtype: int64

### Music

In [84]:
# Check the 'genres' column to see if the word 'Music' appears
music_list = []

for value in df_features['genres']:
    if 'Music' in value:
        music_list.append(1)
    else: music_list.append(0)

In [85]:
# Create a new column using the values from the previous for loop
df_features['music'] = music_list
df_features['music'].value_counts(dropna = False)

music
0    464394
1     26317
Name: count, dtype: int64

### Mystery

In [86]:
# Check the 'genres' column to see if the word 'Mystery' appears
mystery_list = []

for value in df_features['genres']:
    if 'Mystery' in value:
        mystery_list.append(1)
    else: mystery_list.append(0)

In [87]:
# Create a new column using the values from the previous for loop
df_features['mystery'] = mystery_list
df_features['mystery'].value_counts(dropna = False)

mystery
0    478795
1     11916
Name: count, dtype: int64

### Romance

In [88]:
# Check the 'genres' column to see if the word 'Romance' appears
romance_list = []

for value in df_features['genres']:
    if 'Romance' in value:
        romance_list.append(1)
    else: romance_list.append(0)

In [89]:
# Create a new column using the values from the previous for loop
df_features['romance'] = romance_list
df_features['romance'].value_counts(dropna = False)

romance
0    453595
1     37116
Name: count, dtype: int64

### Science Fiction

In [90]:
# Check the 'genres' column to see if the word 'Science Fiction' appears
science_fiction_list = []

for value in df_features['genres']:
    if 'Science Fiction' in value:
        science_fiction_list.append(1)
    else: science_fiction_list.append(0)

In [91]:
# Create a new column using the values from the previous for loop
df_features['science_fiction'] = science_fiction_list
df_features['science_fiction'].value_counts(dropna = False)

science_fiction
0    479725
1     10986
Name: count, dtype: int64

### TV Movie

In [92]:
# Check the 'genres' column to see if the word 'TV Movie' appears
tv_movie_list = []

for value in df_features['genres']:
    if 'TV Movie' in value:
        tv_movie_list.append(1)
    else: tv_movie_list.append(0)

In [93]:
# Create a new column using the values from the previous for loop
df_features['tv_movie'] = tv_movie_list
df_features['tv_movie'].value_counts(dropna = False)

tv_movie
0    473006
1     17705
Name: count, dtype: int64

### Thriller

In [94]:
# Check the 'genres' column to see if the word 'Thriller' appears
thriller_list = []

for value in df_features['genres']:
    if 'Thriller' in value:
        thriller_list.append(1)
    else: thriller_list.append(0)

In [95]:
# Create a new column using the values from the previous for loop
df_features['thriller'] = thriller_list
df_features['thriller'].value_counts(dropna = False)

thriller
0    460494
1     30217
Name: count, dtype: int64

### War

In [96]:
# Check the 'genres' column to see if the word 'War' appears
war_list = []

for value in df_features['genres']:
    if 'War' in value:
        war_list.append(1)
    else: war_list.append(0)

In [97]:
# Create a new column using the values from the previous for loop
df_features['war'] = war_list
df_features['war'].value_counts(dropna = False)

war
0    483785
1      6926
Name: count, dtype: int64

### Western

In [98]:
# Check the 'genres' column to see if the word 'Western' appears
western_list = []

for value in df_features['genres']:
    if 'Western' in value:
        western_list.append(1)
    else: western_list.append(0)

In [99]:
# Create a new column using the values from the previous for loop
df_features['western'] = western_list
df_features['western'].value_counts(dropna = False)

western
0    484609
1      6102
Name: count, dtype: int64

## 13. Creating New Runtime Category Column

### Reasoning:
        We are going to want to use these categories when describing films. Category cutoffs were sourced from 
        IMDB

In [100]:
# Create a runtime category with 4 levels based on IMDB runtime cutoffs 
df_features.loc[df_features['runtime'] < 100, 'runtime_category'] = 'Short Films'
df_features.loc[(df_features['runtime'] >= 100) & (df_features['runtime'] < 140), 'runtime_category'] = 'Mid-Length Films'
df_features.loc[(df_features['runtime'] >= 140) & (df_features['runtime'] < 180), 'runtime_category'] = 'Long Films'
df_features.loc[df_features['runtime'] >= 180, 'runtime_category'] = 'Super Long Films'

In [101]:
df_features['runtime_category'].value_counts()

runtime_category
Short Films         323183
Mid-Length Films    119501
Long Films           29827
Super Long Films     18200
Name: count, dtype: int64

## 14. Creating New Budget Category Column

### Reasoning:
        We are going to want to use these categories when describing films. Category cutoffs were sourced from 
        SAG-AFTRA and Studio Binder

In [102]:
# Create a new column separating films into 4 budget categories
df_features.loc[df_features['budget'] <= 250000, 'budget_category'] = 'Micro-Budget'
df_features.loc[(df_features['budget'] > 250000) & (df_features['budget'] <= 5000000), 'budget_category'] = 'Low-Budget'
df_features.loc[(df_features['budget'] > 5000000) & (df_features['budget'] <= 50000000), 'budget_category'] = 'Mid-Budget'
df_features.loc[df_features['budget'] > 50000000, 'budget_category'] = 'High-Budget'

In [103]:
df_features['budget_category'].value_counts()

budget_category
Micro-Budget    468492
Low-Budget       14447
Mid-Budget        6394
High-Budget       1378
Name: count, dtype: int64

## 15. Data Profile

In [104]:
df_features.shape

(490711, 35)

In [105]:
# Create a data profile
profile = ProfileReport(df_features, title = "TMDB Data Profile")

In [106]:
# Save the profile as an html file
profile.to_file("movie_report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

(using `df.profile_report(correlations={"auto": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'could not convert string to float: 'High-Budget'')


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

## 16. Exporting Data

In [108]:
# Export df_actual_features
df_features.to_csv(os.path.join(path, '02. Data', 'Prepared Data', 'clean_movies.csv'))