# Movies!

**Because who doesn't love a good movie?**

We've all been there: it's another Friday night, and the COVID pandemic has limited our options for evening entertainment. Go to a party at a friend's house? Can't -- that'd be irresponsible and potentially dangerous. Check out a night club? Nope (let's be real, that was an awful option even before the pandemic hit).

And so, we're resorting once again to -- movie night! Make some stove top popcorn and plop down for 90 - 180 minutes of unbridled cinematic joy in the comfort of your own home (and pajamas).

But how can we possibly make the best choice possible when there are so many options? Nothing is worse than watching the ending credits flash on the screen and thinking (or exclaiming aloud, in extreme instances), "MAN, that sucked!" Here to help is a **movie recommender engine** using data obtained from <https://github.com/rashida048/Datasets/blob/master/movie_dataset.csv>

## Part I: Exploration

Let's read in the `dataset` and do some preliminary analysis.

In [1]:
# Make necessary imports
import pandas as pd
import numpy as np

In [2]:
# Read in the CSV file and check the head
movies = pd.read_csv('../datasets/movie_dataset.csv')
movies.head()

Unnamed: 0,index,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew,director
0,0,237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,19995,culture clash future space war space colony so...,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Sam Worthington Zoe Saldana Sigourney Weaver S...,"[{'name': 'Stephen E. Rivkin', 'gender': 0, 'd...",James Cameron
1,1,300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,285,ocean drug abuse exotic island east india trad...,en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Johnny Depp Orlando Bloom Keira Knightley Stel...,"[{'name': 'Dariusz Wolski', 'gender': 2, 'depa...",Gore Verbinski
2,2,245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,206647,spy based on novel secret agent sequel mi6,en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Daniel Craig Christoph Waltz L\u00e9a Seydoux ...,"[{'name': 'Thomas Newman', 'gender': 2, 'depar...",Sam Mendes
3,3,250000000,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,49026,dc comics crime fighter terrorist secret ident...,en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,Christian Bale Michael Caine Gary Oldman Anne ...,"[{'name': 'Hans Zimmer', 'gender': 2, 'departm...",Christopher Nolan
4,4,260000000,Action Adventure Science Fiction,http://movies.disney.com/john-carter,49529,based on novel mars medallion space travel pri...,en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,Taylor Kitsch Lynn Collins Samantha Morton Wil...,"[{'name': 'Andrew Stanton', 'gender': 2, 'depa...",Andrew Stanton


In [3]:
# How many rows and columns does our dataset have?
movies.shape

(4803, 24)

In [4]:
# What are the names of the columns in our dataset?
movies.columns

Index(['index', 'budget', 'genres', 'homepage', 'id', 'keywords',
       'original_language', 'original_title', 'overview', 'popularity',
       'production_companies', 'production_countries', 'release_date',
       'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title',
       'vote_average', 'vote_count', 'cast', 'crew', 'director'],
      dtype='object')

In [5]:
# How many nulls are in our dataset?
movies.isnull().sum().sort_values(ascending = False)

homepage                3091
tagline                  844
keywords                 412
cast                      43
director                  30
genres                    28
overview                   3
runtime                    2
release_date               1
popularity                 0
budget                     0
id                         0
original_language          0
original_title             0
production_countries       0
production_companies       0
crew                       0
revenue                    0
spoken_languages           0
status                     0
title                      0
vote_average               0
vote_count                 0
index                      0
dtype: int64

In [6]:
# What datatype is each column?
movies.dtypes

index                     int64
budget                    int64
genres                   object
homepage                 object
id                        int64
keywords                 object
original_language        object
original_title           object
overview                 object
popularity              float64
production_companies     object
production_countries     object
release_date             object
revenue                   int64
runtime                 float64
spoken_languages         object
status                   object
tagline                  object
title                    object
vote_average            float64
vote_count                int64
cast                     object
crew                     object
director                 object
dtype: object

**Let's look at each column one by one to see what the data looks like**

# Part II: Step by Step Examination and Cleaning

### Index

In [7]:
# Preview column index
movies[['index']]

Unnamed: 0,index
0,0
1,1
2,2
3,3
4,4
...,...
4798,4798
4799,4799
4800,4800
4801,4801


This data simply matches the already-existing index data of each row. This column can and should be `dropped`.

In [8]:
# Drop this useless column from the dataframe
movies.drop(columns = 'index', inplace = True)

### **Budget**

In [9]:
# Preview column budget
movies[['budget']]

Unnamed: 0,budget
0,237000000
1,300000000
2,245000000
3,250000000
4,260000000
...,...
4798,220000
4799,9000
4800,0
4801,0


The data in this column is difficult to read due to the lack of `,` seperators. Let's assume the data is in American dollars as well.

Some of these values are for `$0` -- are these movies created by studios in a kind of pro bono fashion, or do the `$0` values represent missing data?

In [10]:
movies[movies['budget'] == 0].shape

(1037, 23)

As it turns out, `1037` of the values in the `budget` column are `$0`. 

These scenarios aren't ideal. One idea is to drop this column entirely from the dataset, but I think a film's `budget` would be a good prediction factor to include in the recommendation engine. For example, some people might prefer blockbuster movies with blockbuster budgets, while other people prefer indie movies that with modest budgets.

There are several tactics for tackling this problem. For this particular column, I'm going to fill in all of the missing values with the `mean` value of all the budget amounts in the dataset that are not equal to zero.

In [11]:
"${:,.2f}".format(round(sum(movies[movies['budget'] != 0]['budget']) / len(movies[movies['budget'] != 0]['budget']), 
                 2))

'$37,042,837.63'

The average budget for a movie in this daatset is `$37,042,837.63`. We'll now use this value to replace all the `$0s` in this dataset within the `budget` column.

In [12]:
# Assign average budget to a variable
mean_budget = sum(movies[movies['budget'] != 0]['budget']) / len(movies[movies['budget'] != 0]['budget'])

# Define a function to take in a dataframe
def replace_0_budget(df):
    
    # Create an empty list that will be the new column
    new_budget_column = []
    
    # Begin for loop
    for i in df['budget']:
        
        # If budget doesn't equal 0
        if i != 0:
            
            # Append existing value to empty list
            new_budget_column.append(i)
        
        # If budget is equal to zero
        else:
            
            # Append average budget to new column instead of 0
            new_budget_column.append(mean_budget)
    
    # Replace old column with new column
    df['budget'] = new_budget_column        
    
    # Return dataframe
    return df

# Call function on the movies dataframe
replace_0_budget(movies)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew,director
0,2.370000e+08,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,19995,culture clash future space war space colony so...,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Sam Worthington Zoe Saldana Sigourney Weaver S...,"[{'name': 'Stephen E. Rivkin', 'gender': 0, 'd...",James Cameron
1,3.000000e+08,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,285,ocean drug abuse exotic island east india trad...,en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Johnny Depp Orlando Bloom Keira Knightley Stel...,"[{'name': 'Dariusz Wolski', 'gender': 2, 'depa...",Gore Verbinski
2,2.450000e+08,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,206647,spy based on novel secret agent sequel mi6,en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Daniel Craig Christoph Waltz L\u00e9a Seydoux ...,"[{'name': 'Thomas Newman', 'gender': 2, 'depar...",Sam Mendes
3,2.500000e+08,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,49026,dc comics crime fighter terrorist secret ident...,en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.312950,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,Christian Bale Michael Caine Gary Oldman Anne ...,"[{'name': 'Hans Zimmer', 'gender': 2, 'departm...",Christopher Nolan
4,2.600000e+08,Action Adventure Science Fiction,http://movies.disney.com/john-carter,49529,based on novel mars medallion space travel pri...,en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,Taylor Kitsch Lynn Collins Samantha Morton Wil...,"[{'name': 'Andrew Stanton', 'gender': 2, 'depa...",Andrew Stanton
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4798,2.200000e+05,Action Crime Thriller,,9367,united states\u2013mexico barrier legs arms pa...,es,El Mariachi,El Mariachi just wants to play his guitar and ...,14.269792,"[{""name"": ""Columbia Pictures"", ""id"": 5}]",...,81.0,"[{""iso_639_1"": ""es"", ""name"": ""Espa\u00f1ol""}]",Released,"He didn't come looking for trouble, but troubl...",El Mariachi,6.6,238,Carlos Gallardo Jaime de Hoyos Peter Marquardt...,"[{'name': 'Robert Rodriguez', 'gender': 0, 'de...",Robert Rodriguez
4799,9.000000e+03,Comedy Romance,,72766,,en,Newlyweds,A newlywed couple's honeymoon is upended by th...,0.642552,[],...,85.0,[],Released,A newlywed couple's honeymoon is upended by th...,Newlyweds,5.9,5,Edward Burns Kerry Bish\u00e9 Marsha Dietlein ...,"[{'name': 'Edward Burns', 'gender': 2, 'depart...",Edward Burns
4800,3.704284e+07,Comedy Drama Romance TV Movie,http://www.hallmarkchannel.com/signedsealeddel...,231617,date love at first sight narration investigati...,en,"Signed, Sealed, Delivered","""Signed, Sealed, Delivered"" introduces a dedic...",1.444476,"[{""name"": ""Front Street Pictures"", ""id"": 3958}...",...,120.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,,"Signed, Sealed, Delivered",7.0,6,Eric Mabius Kristin Booth Crystal Lowe Geoff G...,"[{'name': 'Carla Hetland', 'gender': 0, 'depar...",Scott Smith
4801,3.704284e+07,,http://shanghaicalling.com/,126186,,en,Shanghai Calling,When ambitious New York attorney Sam is sent t...,0.857008,[],...,98.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,A New Yorker in Shanghai,Shanghai Calling,5.7,7,Daniel Henney Eliza Coupe Bill Paxton Alan Ruc...,"[{'name': 'Daniel Hsia', 'gender': 2, 'departm...",Daniel Hsia


In [13]:
movies[movies['budget'] == 0].shape

(0, 23)

Huzzah!! All movies with a `$0` budget have now had their value in the dataset changed to the average budget of all budgets in the dataset.

### **Genres**

In [14]:
# Preview column genres
movies[['genres']]

Unnamed: 0,genres
0,Action Adventure Fantasy Science Fiction
1,Adventure Fantasy Action
2,Action Adventure Crime
3,Action Crime Drama Thriller
4,Action Adventure Science Fiction
...,...
4798,Action Crime Thriller
4799,Comedy Romance
4800,Comedy Drama Romance TV Movie
4801,


It looks like this column contains each genre that can be categorically applied to each movie, with no sperator betweemn the genres (i.e., `,`). Let's look to see what the longest `genre` cell in this dataset look like.

In [15]:
# Sort movies by longest genre
movies.genres.str.len().sort_values(ascending = False)

1617    51.0
3208    50.0
305     50.0
1277    50.0
168     50.0
        ... 
4674     NaN
4681     NaN
4714     NaN
4716     NaN
4801     NaN
Name: genres, Length: 4803, dtype: float64

In [16]:
movies.iloc[1617]['genres']

'Action Adventure Animation Science Fiction Thriller'

This movie looks like the greatest one on the list! Who wouldn't want to watch an `Action/Adventure/Animation/Science Fiction/Thriller?`

When taking a glance at the `NaNs` in this dataset, we can see that there are `28` blank cells in the `genre` category. Let's clear those out now.

In [17]:
# Drop rows in which the 'genre' values are NaNs.
movies = movies[movies['genres'].notna()]

# Check work
movies[movies['genres'].isna()]

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew,director


**How many unique genres are there in the dataset?**

In [18]:
# Create an empty list called 'genres'
genres = []

# Create a list of indices as long as the column 'genres' in the dataframe
for i in range(0, len(movies['genres'])):
    
    # Isolate each cell one by one and create a list of genres in the cell
    cell = movies.iloc[i]['genres'].split(' ')
    
    # Loop through each list to look for unique genres
    for genre in cell:
        
        # Add unique genres to the genre list
        if genre not in genres:
            genres.append(genre)
            
# Take a look at the genres list and how many unique genres are in it
print(f'There are {len(genres)} unique genres in the dataset, and they are:')
print(genres)

There are 22 unique genres in the dataset, and they are:
['Action', 'Adventure', 'Fantasy', 'Science', 'Fiction', 'Crime', 'Drama', 'Thriller', 'Animation', 'Family', 'Western', 'Comedy', 'Romance', 'Horror', 'Mystery', 'History', 'War', 'Music', 'Documentary', 'Foreign', 'TV', 'Movie']


Looking above, it looks like `Science Fiction` has been split up into `Science` and `Fiction`. This might be ignorable if this had happened to `Romantic Comedy`, but we can't ignore the 2nd best movie `genre` out there (the first being `Horror`, of course). 

We can also see that there are 2 genres, one being `TV` and one being `Movie`, that are included in this dataset. This is likely a split of the genre `TV Movie`. If the genre `TV Movie` were shortened to just `TV`, no context would be lost. That said, let's drop `Movie` from final `genres` list.

Let's re-write the function above to fix this. This will be necessary in `binarizing` this column later.

In [19]:
# Create an empty list called 'genres'
genres = []

# Create a list of indices as long as the column 'genres' in the dataframe
for i in range(0, len(movies['genres'])):
    
    # Isolate each cell one by one and create a list of genres in the cell
    cell = movies.iloc[i]['genres'].split(' ')
    
    # Look for 'Science' and 'Fiction' in each cell
    if 'Science' and 'Fiction' in cell:
        
        # Drop 'Science'
        cell.remove('Science')
        
        # Drop 'Fiction'
        cell.remove('Fiction')
        
        #Add 'Science Fiction'
        cell.append('Science_Fiction')
    
    # Loop through each list to look for unique genres
    for genre in cell:
        
        # Skip over the genre 'Movie'
        if genre == 'Movie':
            pass

        # Add unique genres to the genre list
        elif genre not in genres:
            genres.append(genre)
            
# Take a look at the genres list and how many unique genres are in it
print(f'There are {len(genres)} unique genres in the dataset, and they are:')
print(genres)

There are 20 unique genres in the dataset, and they are:
['Action', 'Adventure', 'Fantasy', 'Science_Fiction', 'Crime', 'Drama', 'Thriller', 'Animation', 'Family', 'Western', 'Comedy', 'Romance', 'Horror', 'Mystery', 'History', 'War', 'Music', 'Documentary', 'Foreign', 'TV']


**Now, let's binarize the genres and drop the genres column from the dataset**

In [20]:
def binarize_genres(df):
    
    # Create a new dataframe with a column for each unique genre in the list above
    df2 = pd.DataFrame(columns = [genre for genre in genres], index = df.index, data = 0)
        
    # Create a list of indices as long as the column 'genres' in the dataframe
    for i in range(0, len(df['genres'])):
    
        # Isolate each cell one by one and create a list of genres in the cell
        cell = df.iloc[i]['genres'].split(' ')
    
        # Look for 'Science' and 'Fiction' in each cell
        if 'Science' and 'Fiction' in cell:
        
            # Drop 'Science'
            cell.remove('Science')
        
            # Drop 'Fiction'
            cell.remove('Fiction')
        
            #Add 'Science Fiction'
            cell.append('Science_Fiction')
    
        # Loop through the cell for each genre
        for genre in cell:
        
            # Loop through each column of df2
            for column in df2.columns:
                
                # Skip over the genre 'Movie'
                # This step is redundant, as we've already dropped 'Movies' from the 'genres' list, which
                # /n was used to create all the columns in df2. This has been included for continuity's sake.
                if genre == 'Movie':
                    pass
                
                # Look for matching genres
                elif genre == column:
                    
                    # Mark the genre as a 1
                    df2[column].iloc[i] += 1
    
    # Create a list variable called frames for our 2 dataframes
    frames = [df, df2]
    
    # Concatenate the two dataframes
    final_df = pd.concat(frames, axis = 1, sort = False)
    
    return final_df

In [21]:
# Call function and change 
movies = binarize_genres(movies)

# Drop the genres column now that we've binarized its contents
movies.drop(columns = 'genres', inplace = True)

### **Home page**

In [22]:
movies[['homepage']]

Unnamed: 0,homepage
0,http://www.avatarmovie.com/
1,http://disney.go.com/disneypictures/pirates/
2,http://www.sonypictures.com/movies/spectre/
3,http://www.thedarkknightrises.com/
4,http://movies.disney.com/john-carter
...,...
4797,
4798,
4799,
4800,http://www.hallmarkchannel.com/signedsealeddel...


This column isn't important for a recommender engine and can be dropped.

In [23]:
# Drop column 'homepage' from dataset
movies.drop(columns = 'homepage', inplace = True)

### **ID**

In [24]:
movies[['id']]

Unnamed: 0,id
0,19995
1,285
2,206647
3,49026
4,49529
...,...
4797,67238
4798,9367
4799,72766
4800,231617


This column isn't important for a recommender engine and can be dropped.

In [25]:
# Drop column 'id' from dataset
movies.drop(columns = 'id', inplace = True)

### **Keywords**

In [26]:
movies[['keywords']]

Unnamed: 0,keywords
0,culture clash future space war space colony so...
1,ocean drug abuse exotic island east india trad...
2,spy based on novel secret agent sequel mi6
3,dc comics crime fighter terrorist secret ident...
4,based on novel mars medallion space travel pri...
...,...
4797,
4798,united states\u2013mexico barrier legs arms pa...
4799,
4800,date love at first sight narration investigati...


This looks like a tricky column. Not only are there 412 `NaNs` in this column, but the keywords here look pretty subjective. Who decided on these keywords? The Academy at large? Probably not. Due to its subjectivity and missing data, this column looks droppable from the final dataset.

In [27]:
# Drop column 'keywords' from dataset
movies.drop(columns = 'keywords', inplace = True)

### **Original Language**

In [28]:
movies[['original_language']]

Unnamed: 0,original_language
0,en
1,en
2,en
3,en
4,en
...,...
4797,en
4798,es
4799,en
4800,en


In [29]:
# Check unique values of this column
movies.original_language.unique()

array(['en', 'ja', 'fr', 'zh', 'es', 'de', 'hi', 'ru', 'ko', 'te', 'cn',
       'it', 'nl', 'ta', 'sv', 'th', 'da', 'xx', 'hu', 'cs', 'pt', 'is',
       'tr', 'nb', 'af', 'pl', 'he', 'ar', 'vi', 'ky', 'id', 'ro', 'fa',
       'no', 'sl', 'ps', 'el'], dtype=object)

It's unclear what a majority of these abbreviations stand for. There's no key for what these might stand for on the website from which this dataset was pulled, which isn't helpful. Should  we drop this column from the dataset? Or attempt to find a key for this online somewhere? Probably not -- a different key might have been created by someone who would abbreviate languages differently, which could corrupt the accuracy of this dataset. 

In [30]:
movies.original_language.describe()

count     4775
unique      37
top         en
freq      4477
Name: original_language, dtype: object

It looks like there are 298 movies of the 4,803 in this dataset that aren't in English. In the interest of time, simplicity, and accuracy, it may be worth changing this project to a movie recommendation enginge in which all movies are in English.

In [31]:
# Look at the dataset for all non-English movies
movies[movies['original_language'] != 'en']

Unnamed: 0,budget,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,...,Comedy,Romance,Horror,Mystery,History,War,Music,Documentary,Foreign,TV
97,1.500000e+07,ja,シン・ゴジラ,From the mind behind Evangelion comes a hit la...,9.476999,"[{""name"": ""Cine Bazar"", ""id"": 5896}, {""name"": ...","[{""iso_3166_1"": ""JP"", ""name"": ""Japan""}]",2016-07-29,77000000,120.0,...,0,0,1,0,0,0,0,0,0,0
235,9.725040e+07,fr,Astérix aux Jeux Olympiques,Astérix and Obélix have to win the Olympic Gam...,20.344364,"[{""name"": ""Constantin Film"", ""id"": 47}, {""name...","[{""iso_3166_1"": ""BE"", ""name"": ""Belgium""}, {""is...",2008-01-13,132900000,116.0,...,1,0,0,0,0,0,0,0,0,0
317,9.400000e+07,zh,金陵十三釵,A Westerner finds refuge with a group of women...,12.516546,"[{""name"": ""Beijing New Picture Film Co. Ltd."",...","[{""iso_3166_1"": ""CN"", ""name"": ""China""}, {""iso_...",2011-12-15,95311434,145.0,...,0,0,0,0,1,1,0,0,0,0
474,3.704284e+07,fr,Évolution,11-year-old Nicolas lives with his mother in a...,3.300061,"[{""name"": ""Ex Nihilo"", ""id"": 3307}, {""name"": ""...","[{""iso_3166_1"": ""BE"", ""name"": ""Belgium""}, {""is...",2015-09-14,0,81.0,...,0,0,1,1,0,0,0,0,0,0
492,8.000000e+06,es,Don Gato: El inicio de la pandilla,Top Cat has arrived to charm his way into your...,0.719996,"[{""name"": ""Anima Estudios"", ""id"": 9965}, {""nam...","[{""iso_3166_1"": ""IN"", ""name"": ""India""}, {""iso_...",2015-10-30,0,89.0,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4739,3.704284e+07,fr,"I Love You, Don't Touch Me!","The story of a 25 year old virgin girl, lookin...",0.020839,[],"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",1997-01-21,0,86.0,...,1,1,0,0,0,0,0,0,0,0
4751,3.704284e+07,pt,"Gabriela, Cravo e Canela","In 1925, Gabriela becomes cook, mistress, and ...",0.557602,"[{""name"": ""United Artists"", ""id"": 60}, {""name""...","[{""iso_3166_1"": ""BR"", ""name"": ""Brazil""}]",1983-03-24,0,99.0,...,0,1,0,0,0,0,0,0,0,0
4790,3.704284e+07,fa,دایره,Various women struggle to function in the oppr...,1.193779,"[{""name"": ""Jafar Panahi Film Productions"", ""id...","[{""iso_3166_1"": ""IR"", ""name"": ""Iran""}]",2000-09-08,0,90.0,...,0,0,0,0,0,0,0,0,1,0
4792,2.000000e+04,ja,キュア,A wave of gruesome murders is sweeping Tokyo. ...,0.212443,"[{""name"": ""Daiei Studios"", ""id"": 881}]","[{""iso_3166_1"": ""JP"", ""name"": ""Japan""}]",1997-11-06,99000,111.0,...,0,0,1,1,0,0,0,0,0,0


After exploring this list on non-English movies further, my decision to drop these from the dataset is validated. Although the `overview` column for these movies are in English, what good is a movie recommendation for a film called "دایره" for someone who doesn't speak that native language? I searched for "دایره" on Google to see if I was able to find the movie described in this dataset, but my search turned up no results. 

Let's drop these movies from our dataset.

In [32]:
# Drop all non-English movies
movies = movies[movies['original_language'] == 'en']

In [113]:
# Now that all movies are in English, we can drop this column from our dataframe
movies.drop(columns = 'original_language', inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


### **Original Title**

In [33]:
movies[['original_title']]

Unnamed: 0,original_title
0,Avatar
1,Pirates of the Caribbean: At World's End
2,Spectre
3,The Dark Knight Rises
4,John Carter
...,...
4796,Primer
4797,Cavite
4799,Newlyweds
4800,"Signed, Sealed, Delivered"


Nothing to see here. Using any `Natural Language Processing` or `Binarizing` here would be a mistake, unless people prefer to make watching decisions based on a movie's title and title alone.

What we can do, however, is set this column to be the dataframe's `index` column.

In [34]:
#Set index
movies.set_index('original_title', inplace = True)

# Check work
movies.head()

Unnamed: 0_level_0,budget,original_language,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,...,Comedy,Romance,Horror,Mystery,History,War,Music,Documentary,Foreign,TV
original_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,237000000.0,en,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",...,0,0,0,0,0,0,0,0,0,0
Pirates of the Caribbean: At World's End,300000000.0,en,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",...,0,0,0,0,0,0,0,0,0,0
Spectre,245000000.0,en,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",...,0,0,0,0,0,0,0,0,0,0
The Dark Knight Rises,250000000.0,en,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",...,0,0,0,0,0,0,0,0,0,0
John Carter,260000000.0,en,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",...,0,0,0,0,0,0,0,0,0,0


### **Overview**

In [35]:
movies[['overview']]

Unnamed: 0_level_0,overview
original_title,Unnamed: 1_level_1
Avatar,"In the 22nd century, a paraplegic Marine is di..."
Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
Spectre,A cryptic message from Bond’s past sends him o...
The Dark Knight Rises,Following the death of District Attorney Harve...
John Carter,"John Carter is a war-weary, former military ca..."
...,...
Primer,Friends/fledgling entrepreneurs invent a devic...
Cavite,"Adam, a security guard, travels from Californi..."
Newlyweds,A newlywed couple's honeymoon is upended by th...
"Signed, Sealed, Delivered","""Signed, Sealed, Delivered"" introduces a dedic..."


This column looks more insightful than the `keywords` column. We can use `Natural Language Processing` when creating the model for this dataset.

Before doing so, let's check one more time for `NaNs` in this column. There were `3` of them for `overview` initially, but we've dropped some data from the dataframe `movies` at this point in this jupyter notebook.

In [36]:
# Check once more for nulls in the 'overview' column
movies['overview'].isna().sum()

1

In [37]:
# Let's check to see what this one null overview is from
movies[movies['overview'].isna()]

Unnamed: 0_level_0,budget,original_language,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,...,Comedy,Romance,Horror,Mystery,History,War,Music,Documentary,Foreign,TV
original_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"To Be Frank, Sinatra at 100",2.0,en,,0.050625,"[{""name"": ""Eyeline Entertainment"", ""id"": 60343}]","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""}]",2015-12-12,0,,[],...,0,0,0,0,0,0,0,1,0,0


Ahhh yes, the critically acclaimed "To Be Frank, Sinatra at 100", the rave documentary with the $2 budget. 

On a serious note -- if we remove this film from our dataframe, we'll be able to use NLP for 100% of the `overviews`.

In [38]:
# Remove the movie 'To Be Frank, Sinartra at 100' from the dataframe
movies.drop(labels = 'To Be Frank, Sinatra at 100', inplace = True)

In [39]:
# Import Tokenizer
from nltk.tokenize import RegexpTokenizer

# Import Lemmatizer
from nltk.stem import WordNetLemmatizer

# Import stopwords.
from nltk.corpus import stopwords

In [40]:
# Write a function to tokenize and then lemmatize the overview
def lemmatize_overview(df):
    
    # Create an empty list that will ultimately be a new column of lemmas
    overview_lemmas = []

    # Instantiate a tokenizer
    # Include the 'r\w+' to ignore spaces and punctuation
    tokenizer = RegexpTokenizer(r'\w+')
    
    # Instantiate a lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    for cell in df['overview']:
        
        # Tokenize the individual overview
        overview_tokens = tokenizer.tokenize(cell.lower())
        
        # Lemmatize the individual overview
        overview_lemma = [lemmatizer.lemmatize(token) for token in overview_tokens]
        
        # Append the lemmatized cell to the empty list titles 'overview_lemmas'
        overview_lemmas.append(overview_lemma)
    
    # Drop the original overview column from the dataframe
    df.drop(columns = 'overview', inplace = True)
    
    # Replace column with the lemmas, rename it 'overview'
    df['overview'] = overview_lemmas
    
    # Return df
    return df


In [41]:
# Call the function and see if it worked
movies = lemmatize_overview(movies)
movies['overview']

original_title
Avatar                                      [in, the, 22nd, century, a, paraplegic, marine...
Pirates of the Caribbean: At World's End    [captain, barbossa, long, believed, to, be, de...
Spectre                                     [a, cryptic, message, from, bond, s, past, sen...
The Dark Knight Rises                       [following, the, death, of, district, attorney...
John Carter                                 [john, carter, is, a, war, weary, former, mili...
                                                                  ...                        
Primer                                      [friend, fledgling, entrepreneur, invent, a, d...
Cavite                                      [adam, a, security, guard, travel, from, calif...
Newlyweds                                   [a, newlywed, couple, s, honeymoon, is, upende...
Signed, Sealed, Delivered                   [signed, sealed, delivered, introduces, a, ded...
My Date with Drew                           [

Now that we've `tokenized` and `lemmatized` the `overview` column, we can use one of the vectorizer alogrorithms (either `CountVectorizer` or `TFIDFVectorizer` to process the language for modeling. We'll do that in the next notebook, right before creating the `recommendation engine` itself. 

### **Popularity**

In [42]:
movies[['popularity']]

Unnamed: 0_level_0,popularity
original_title,Unnamed: 1_level_1
Avatar,150.437577
Pirates of the Caribbean: At World's End,139.082615
Spectre,107.376788
The Dark Knight Rises,112.312950
John Carter,43.926995
...,...
Primer,23.307949
Cavite,0.022173
Newlyweds,0.642552
"Signed, Sealed, Delivered",1.444476


I don't like data columns like these. How is this measured? On what scale? And according to whom? This looks like something that can be dropped from our dataset.

In [43]:
movies.drop(columns = ['popularity'], inplace = True)

### **Production Companies**

In [44]:
movies[['production_companies']]

Unnamed: 0_level_0,production_companies
original_title,Unnamed: 1_level_1
Avatar,"[{""name"": ""Ingenious Film Partners"", ""id"": 289..."
Pirates of the Caribbean: At World's End,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""..."
Spectre,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam..."
The Dark Knight Rises,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""..."
John Carter,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]"
...,...
Primer,"[{""name"": ""Thinkfilm"", ""id"": 446}]"
Cavite,[]
Newlyweds,[]
"Signed, Sealed, Delivered","[{""name"": ""Front Street Pictures"", ""id"": 3958}..."


In [45]:
# Let's look at an example of a single "production_company" cell value
movies['production_companies'][0]

'[{"name": "Ingenious Film Partners", "id": 289}, {"name": "Twentieth Century Fox Film Corporation", "id": 306}, {"name": "Dune Entertainment", "id": 444}, {"name": "Lightstorm Entertainment", "id": 574}]'

It looks like this is a lit of dictionaries that has been converted into a string. In the cell below, I'll see if it'll be possible to isolate the names of the production companies, which will be crucial to doing some feature engineering in the next notebook. 

In [46]:
# Let's used json to convert this string back to a list of dictionaries
import json

# Use the .loads methodology to 
production_companies_0 = json.loads(movies['production_companies'][0])
production_companies_0

[{'name': 'Ingenious Film Partners', 'id': 289},
 {'name': 'Twentieth Century Fox Film Corporation', 'id': 306},
 {'name': 'Dune Entertainment', 'id': 444},
 {'name': 'Lightstorm Entertainment', 'id': 574}]

In [47]:
# Create an empty list called 'production_companies'
production_companies = []

# Create a list of indices as long as the column 'production_companies' in the dataframe
for i in range(0, len(movies['production_companies'])):
    
    # Isolate each cell one by one and create a list of production_companies in the cell
    cell = movies.iloc[i]['production_companies'].split(' ')
    
    # Loop through each list to look for unique production_companies
    for production_company in cell:
        
        # Add unique production_companies to the production_companies list
        if production_company not in production_companies:
            production_companies.append(production_company)
            
# Take a look at the production_companies list and how many unique production_companies are in it
print(f'There are {len(production_companies)} unique production_companies in the dataset, and they are:')
print(production_companies)

There are 10433 unique production_companies in the dataset, and they are:
['[{"name":', '"Ingenious', 'Film', 'Partners",', '"id":', '289},', '{"name":', '"Twentieth', 'Century', 'Fox', 'Corporation",', '306},', '"Dune', 'Entertainment",', '444},', '"Lightstorm', '574}]', '"Walt', 'Disney', 'Pictures",', '2},', '"Jerry', 'Bruckheimer', 'Films",', '130},', '"Second', 'Mate', 'Productions",', '19936}]', '"Columbia', '5},', '"Danjaq",', '10761},', '"B24",', '69434}]', '"Legendary', '923},', '"Warner', 'Bros.",', '6194},', '"DC', '9993},', '"Syncopy",', '9996}]', '2}]', '"Laura', 'Ziskin', '326},', '"Marvel', 'Enterprises",', '19551}]', 'Animation', 'Studios",', '6125}]', '420},', '"Prime', 'Focus",', '15357},', '"Revolution', 'Sun', '76043}]', '"Heyday', '7364}]', 'Comics",', '429},', '"Atlas', '507},', '"Cruel', '&', 'Unusual', '9995},', '"RatPac-Dune', '41624}]', '"Bad', 'Hat', 'Harry', '9168}]', '"Eon', '7576}]', '"Infinitum', 'Nihil",', '2691},', '"Silver', 'Bullet', 'Productions', '(

This was worth exploring, but this looks like a column that should be dropped. There are `10,433` unique production companies (upon further review, it's less than that -- for instance, "New Line Cinema" was split up into "New", "Line", and "Cinema". Fixing this simply isn't worth the effort -- including these companies in the dataframe may have been valuable, but its value doesn't justify the time it'll take to work on this.

In [48]:
# Let's drop this column from the dataframe
movies.drop(columns = 'production_companies', inplace = True)

### **Production Countries**

In [49]:
movies[['production_countries']]

Unnamed: 0_level_0,production_countries
original_title,Unnamed: 1_level_1
Avatar,"[{""iso_3166_1"": ""US"", ""name"": ""United States o..."
Pirates of the Caribbean: At World's End,"[{""iso_3166_1"": ""US"", ""name"": ""United States o..."
Spectre,"[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""..."
The Dark Knight Rises,"[{""iso_3166_1"": ""US"", ""name"": ""United States o..."
John Carter,"[{""iso_3166_1"": ""US"", ""name"": ""United States o..."
...,...
Primer,"[{""iso_3166_1"": ""US"", ""name"": ""United States o..."
Cavite,[]
Newlyweds,[]
"Signed, Sealed, Delivered","[{""iso_3166_1"": ""US"", ""name"": ""United States o..."


The same is true in this column as was in the `production_companies` column. We can convert these to a `dictionary` from a string, but would it even be worth it? `production_countries` probably isn't a great predictor for a recommendation engine.

In [50]:
movies.drop(columns = 'production_countries', inplace = True)

### **Release Date**

In [51]:
movies['release_date']

original_title
Avatar                                      2009-12-10
Pirates of the Caribbean: At World's End    2007-05-19
Spectre                                     2015-10-26
The Dark Knight Rises                       2012-07-16
John Carter                                 2012-03-07
                                               ...    
Primer                                      2004-10-08
Cavite                                      2005-03-12
Newlyweds                                   2011-12-26
Signed, Sealed, Delivered                   2013-10-13
My Date with Drew                           2005-08-05
Name: release_date, Length: 4476, dtype: object

The `data type` here is an `object` and should instead be converted to a `datetime` format.

In [52]:
# Import datetime
from datetime import datetime

# Change datatype of 'release_date' column to datetime
movies['release_date'] = pd.to_datetime(movies['release_date'])

# Check work
movies['release_date']

original_title
Avatar                                     2009-12-10
Pirates of the Caribbean: At World's End   2007-05-19
Spectre                                    2015-10-26
The Dark Knight Rises                      2012-07-16
John Carter                                2012-03-07
                                              ...    
Primer                                     2004-10-08
Cavite                                     2005-03-12
Newlyweds                                  2011-12-26
Signed, Sealed, Delivered                  2013-10-13
My Date with Drew                          2005-08-05
Name: release_date, Length: 4476, dtype: datetime64[ns]

### **Revenue**

In [53]:
movies[['revenue']]

Unnamed: 0_level_0,revenue
original_title,Unnamed: 1_level_1
Avatar,2787965087
Pirates of the Caribbean: At World's End,961000000
Spectre,880674609
The Dark Knight Rises,1084939099
John Carter,284139100
...,...
Primer,424760
Cavite,0
Newlyweds,0
"Signed, Sealed, Delivered",0


The values in this column are a little hard to read without any comma seperators.

It looks like some movies have made `$0` in revenue, which is difficult to believe. How many movies in our dataframe have posted `$0` in revenue?

In [54]:
movies[movies['revenue'] == 0].shape

(1244, 35)

Hmmm, so `1,244` movies haven't made a dime? This seems highly implausible.



In [55]:
"${:,.2f}".format(round(sum(movies[movies['revenue'] != 0]['revenue']) / len(movies[movies['revenue'] != 0]['revenue']), 
                 2))

'$120,540,082.02'

The average amount of revenue earned by a movie in this daatset is `$120,540,082.02`. We'll now use this value to replace all the `$0s` in this dataset within the `revenue` column.

In [56]:
mean_revenue = sum(movies[movies['revenue'] != 0]['revenue']) / len(movies[movies['revenue'] != 0]['revenue'])

def replace_0_revenue(df):
    new_revenue_column = []
    for i in df['revenue']:
        if i != 0:
            new_revenue_column.append(i)
        else:
            new_revenue_column.append(mean_revenue)
    df['revenue'] = new_revenue_column        
    return df

replace_0_revenue(movies)

Unnamed: 0_level_0,budget,original_language,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,...,Romance,Horror,Mystery,History,War,Music,Documentary,Foreign,TV,overview
original_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2.370000e+08,en,2009-12-10,2.787965e+09,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,...,0,0,0,0,0,0,0,0,0,"[in, the, 22nd, century, a, paraplegic, marine..."
Pirates of the Caribbean: At World's End,3.000000e+08,en,2007-05-19,9.610000e+08,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,...,0,0,0,0,0,0,0,0,0,"[captain, barbossa, long, believed, to, be, de..."
Spectre,2.450000e+08,en,2015-10-26,8.806746e+08,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,...,0,0,0,0,0,0,0,0,0,"[a, cryptic, message, from, bond, s, past, sen..."
The Dark Knight Rises,2.500000e+08,en,2012-07-16,1.084939e+09,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,...,0,0,0,0,0,0,0,0,0,"[following, the, death, of, district, attorney..."
John Carter,2.600000e+08,en,2012-03-07,2.841391e+08,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,...,0,0,0,0,0,0,0,0,0,"[john, carter, is, a, war, weary, former, mili..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Primer,7.000000e+03,en,2004-10-08,4.247600e+05,77.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,What happens if it actually works?,Primer,6.9,...,0,0,0,0,0,0,0,0,0,"[friend, fledgling, entrepreneur, invent, a, d..."
Cavite,3.704284e+07,en,2005-03-12,1.205401e+08,80.0,[],Released,,Cavite,7.5,...,0,0,0,0,0,0,0,1,0,"[adam, a, security, guard, travel, from, calif..."
Newlyweds,9.000000e+03,en,2011-12-26,1.205401e+08,85.0,[],Released,A newlywed couple's honeymoon is upended by th...,Newlyweds,5.9,...,1,0,0,0,0,0,0,0,0,"[a, newlywed, couple, s, honeymoon, is, upende..."
"Signed, Sealed, Delivered",3.704284e+07,en,2013-10-13,1.205401e+08,120.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,,"Signed, Sealed, Delivered",7.0,...,1,0,0,0,0,0,0,0,1,"[signed, sealed, delivered, introduces, a, ded..."


In [57]:
movies[movies['revenue'] == 0].shape

(0, 35)

It worked! All movies with a revenue of `$0` has now been replaced with the mean revenue in the dataset of `$120,540,082.02`.

### Runtime

In [58]:
# Check format and data type
movies['runtime']

original_title
Avatar                                      162.0
Pirates of the Caribbean: At World's End    169.0
Spectre                                     148.0
The Dark Knight Rises                       165.0
John Carter                                 132.0
                                            ...  
Primer                                       77.0
Cavite                                       80.0
Newlyweds                                    85.0
Signed, Sealed, Delivered                   120.0
My Date with Drew                            90.0
Name: runtime, Length: 4476, dtype: float64

In [59]:
# There were 2 NaNs initially in this dataframe -- how many are there now?
movies['runtime'].isna().sum()

0

### Spoken Languages

In [60]:
# Check format and data type
movies['spoken_languages']

original_title
Avatar                                      [{"iso_639_1": "en", "name": "English"}, {"iso...
Pirates of the Caribbean: At World's End             [{"iso_639_1": "en", "name": "English"}]
Spectre                                     [{"iso_639_1": "fr", "name": "Fran\u00e7ais"},...
The Dark Knight Rises                                [{"iso_639_1": "en", "name": "English"}]
John Carter                                          [{"iso_639_1": "en", "name": "English"}]
                                                                  ...                        
Primer                                               [{"iso_639_1": "en", "name": "English"}]
Cavite                                                                                     []
Newlyweds                                                                                  []
Signed, Sealed, Delivered                            [{"iso_639_1": "en", "name": "English"}]
My Date with Drew                            

In [61]:
# Isolate one cell to see what its contents look like
movies['spoken_languages'][0]

'[{"iso_639_1": "en", "name": "English"}, {"iso_639_1": "es", "name": "Espa\\u00f1ol"}]'

The contents of this column are unclear. I've seen Avatar, the movie that the `spoken_languages` above corresponds to, and there's no Spanish in the movie. Could it mean that there exists a Spanish translated version of Avatar? 

Whatever the purpose of this column is, it doesn't appear to be necessary. Let's drop it.

In [62]:
movies.drop(columns = ['spoken_languages'], inplace = True)

### Status

In [63]:
# Check format and data type
movies['status']

original_title
Avatar                                      Released
Pirates of the Caribbean: At World's End    Released
Spectre                                     Released
The Dark Knight Rises                       Released
John Carter                                 Released
                                              ...   
Primer                                      Released
Cavite                                      Released
Newlyweds                                   Released
Signed, Sealed, Delivered                   Released
My Date with Drew                           Released
Name: status, Length: 4476, dtype: object

In [64]:
# Look at unique values in this column
movies['status'].unique()

array(['Released', 'Post Production', 'Rumored'], dtype=object)

`Rumored` and movies in `Post Production` are included in this? Let's take a closer look.

In [65]:
# Take a look at all movies with the status 'Post Production'
movies[movies['status'] == 'Post Production']

Unnamed: 0_level_0,budget,original_language,release_date,revenue,runtime,status,tagline,title,vote_average,vote_count,...,Romance,Horror,Mystery,History,War,Music,Documentary,Foreign,TV,overview
original_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Brotherly Love,1900000.0,en,2015-04-24,120540100.0,89.0,Post Production,,Brotherly Love,6.9,21,...,0,0,0,0,0,0,0,0,0,"[west, philadelphia, basketball, star, sergio,..."
Higher Ground,2000000.0,en,2011-08-26,841733.0,109.0,Post Production,,Higher Ground,5.3,14,...,0,0,0,0,0,0,0,0,0,"[a, chronicle, of, one, woman, s, lifelong, st..."


In [66]:
# Take a look at all movies with the status 'Rumored'
movies[movies['status'] == 'Rumored']

Unnamed: 0_level_0,budget,original_language,release_date,revenue,runtime,status,tagline,title,vote_average,vote_count,...,Romance,Horror,Mystery,History,War,Music,Documentary,Foreign,TV,overview
original_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
The Helix... Loaded,37042840.0,en,2005-01-01,120540100.0,97.0,Rumored,,The Helix... Loaded,4.8,2,...,0,0,0,0,0,0,0,0,0,[]
Crying with Laughter,37042840.0,en,2009-06-01,120540100.0,93.0,Rumored,A Bad Trip Down Memory Lane,Crying with Laughter,7.0,1,...,0,0,0,0,0,0,0,0,0,"[powerfully, redemptive, and, darkly, comedic,..."
The Harvest (La Cosecha),56000.0,en,2011-07-29,120540100.0,80.0,Rumored,,The Harvest (La Cosecha),0.0,0,...,0,0,0,0,0,0,1,0,0,"[the, story, of, the, child, who, work, 12, 14..."
Little Big Top,37042840.0,en,2006-01-01,120540100.0,0.0,Rumored,,Little Big Top,10.0,1,...,0,0,0,0,0,0,0,0,0,"[an, aging, out, of, work, clown, return, to, ..."
The Naked Ape,37042840.0,en,2006-09-16,120540100.0,110.0,Rumored,,The Naked Ape,5.0,1,...,0,0,0,0,0,0,0,0,0,"[the, naked, ape, is, a, coming, of, age, film..."


Movies with statuses that are anything except for `Released` should be droppped. Not only is a movie reccomendation for a movie that hasn't been released a crappy recommendation, but these movies look like they were never released. For example, The Naked Ape (ahem) has apparently been rumored since 2006, and a simple Google search has confirmed that there is no such film.

In [67]:
# Drop the 36 movies from the dataframe whose statuses are either 'Rumored' or 'Post Production'
movies = movies[movies['status'] == 'Released']

### Tagline

In [68]:
movies['tagline']

original_title
Avatar                                                            Enter the World of Pandora.
Pirates of the Caribbean: At World's End       At the end of the world, the adventure begins.
Spectre                                                                 A Plan No One Escapes
The Dark Knight Rises                                                         The Legend Ends
John Carter                                              Lost in our world, found in another.
                                                                  ...                        
Primer                                                     What happens if it actually works?
Cavite                                                                                    NaN
Newlyweds                                   A newlywed couple's honeymoon is upended by th...
Signed, Sealed, Delivered                                                                 NaN
My Date with Drew                            

This column essentially looks like a less informative and far cornier edition of the `overview` column. Let's drop it.

In [69]:
movies.drop(columns = ['tagline'], inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


That's odd that the above command generated a warning...let's check to make sure that this column is gone from `movies`.

In [70]:
movies.columns

Index(['budget', 'original_language', 'release_date', 'revenue', 'runtime',
       'status', 'title', 'vote_average', 'vote_count', 'cast', 'crew',
       'director', 'Action', 'Adventure', 'Fantasy', 'Science_Fiction',
       'Crime', 'Drama', 'Thriller', 'Animation', 'Family', 'Western',
       'Comedy', 'Romance', 'Horror', 'Mystery', 'History', 'War', 'Music',
       'Documentary', 'Foreign', 'TV', 'overview'],
      dtype='object')

Good, it looks like the column is gone.

### Title

How is this different from `original_title` ?

In [72]:
# Check column and data type
movies['title']

original_title
Avatar                                                                        Avatar
Pirates of the Caribbean: At World's End    Pirates of the Caribbean: At World's End
Spectre                                                                      Spectre
The Dark Knight Rises                                          The Dark Knight Rises
John Carter                                                              John Carter
                                                              ...                   
Primer                                                                        Primer
Cavite                                                                        Cavite
Newlyweds                                                                  Newlyweds
Signed, Sealed, Delivered                                  Signed, Sealed, Delivered
My Date with Drew                                                  My Date with Drew
Name: title, Length: 4469, dtype: object

Let's check to see how exactly, if it all, this column is different from `orginial_title`.

In [109]:
# Define a function to look for differences in 'original_title' and 'title'
def check_title_differences(df):
    
    # Create an empty dataframe
    title_differences = pd.DataFrame()
    
    # Create an empty list called 'original titles'
    original_titles = []
    
    # Create an empty list called 'titles'
    titles = []
    
    # Start for loop to go through the entire dataframe
    for i in range(0, len(df.index)):
        
        # If statement: if 'original title' and 'title' are unequal values
        if df.index[i] != df.iloc[i]['title']:
            
            # Add unequal original title to appropriate list
            original_titles.append(df.index[i])
            
            # Add unequal title to appropriate list
            titles.append(df.iloc[i]['title'])
    
    # Create 'original titles' column in new dataframe with newly created list
    title_differences['original_titles'] = original_titles
    
    # Create 'titles' column in new dataframe with newly created list
    title_differences['titles'] = titles
    
    # Return new dataframe
    return title_differences

In [110]:
# Call function on 'movies' dataframe
check_title_differences(movies)

Unnamed: 0,original_titles,titles
0,4: Rise of the Silver Surfer,Fantastic 4: Rise of the Silver Surfer
1,Arthur et les Minimoys,Arthur and the Invisibles
2,Deux frères,Two Brothers
3,Michael Jackson's This Is It,This Is It
4,Lo imposible,The Impossible
5,Nomad,Nomad: The Warrior
6,Le Hussard sur le toit,The Horseman on the Roof
7,The House of Magic,Thunder and the House of Magic
8,The Neverending Story,The NeverEnding Story
9,EverAfter,Ever After: A Cinderella Story


It might be a good idea to reset the index of `movies` to `title`, simply because the column `title` seems to be entirely in English, whereas the `original_title` column (which is now the index) does not.

In [111]:
movies.set_index('title')

Unnamed: 0_level_0,budget,original_language,release_date,revenue,runtime,status,vote_average,vote_count,cast,crew,...,Romance,Horror,Mystery,History,War,Music,Documentary,Foreign,TV,overview
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2.370000e+08,en,2009-12-10,2.787965e+09,162.0,Released,7.2,11800,Sam Worthington Zoe Saldana Sigourney Weaver S...,"[{'name': 'Stephen E. Rivkin', 'gender': 0, 'd...",...,0,0,0,0,0,0,0,0,0,"[in, the, 22nd, century, a, paraplegic, marine..."
Pirates of the Caribbean: At World's End,3.000000e+08,en,2007-05-19,9.610000e+08,169.0,Released,6.9,4500,Johnny Depp Orlando Bloom Keira Knightley Stel...,"[{'name': 'Dariusz Wolski', 'gender': 2, 'depa...",...,0,0,0,0,0,0,0,0,0,"[captain, barbossa, long, believed, to, be, de..."
Spectre,2.450000e+08,en,2015-10-26,8.806746e+08,148.0,Released,6.3,4466,Daniel Craig Christoph Waltz L\u00e9a Seydoux ...,"[{'name': 'Thomas Newman', 'gender': 2, 'depar...",...,0,0,0,0,0,0,0,0,0,"[a, cryptic, message, from, bond, s, past, sen..."
The Dark Knight Rises,2.500000e+08,en,2012-07-16,1.084939e+09,165.0,Released,7.6,9106,Christian Bale Michael Caine Gary Oldman Anne ...,"[{'name': 'Hans Zimmer', 'gender': 2, 'departm...",...,0,0,0,0,0,0,0,0,0,"[following, the, death, of, district, attorney..."
John Carter,2.600000e+08,en,2012-03-07,2.841391e+08,132.0,Released,6.1,2124,Taylor Kitsch Lynn Collins Samantha Morton Wil...,"[{'name': 'Andrew Stanton', 'gender': 2, 'depa...",...,0,0,0,0,0,0,0,0,0,"[john, carter, is, a, war, weary, former, mili..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Primer,7.000000e+03,en,2004-10-08,4.247600e+05,77.0,Released,6.9,658,Shane Carruth David Sullivan Casey Gooden Anan...,"[{'name': 'Shane Carruth', 'gender': 2, 'depar...",...,0,0,0,0,0,0,0,0,0,"[friend, fledgling, entrepreneur, invent, a, d..."
Cavite,3.704284e+07,en,2005-03-12,1.205401e+08,80.0,Released,7.5,2,,"[{'name': 'Neill Dela Llana', 'gender': 0, 'de...",...,0,0,0,0,0,0,0,1,0,"[adam, a, security, guard, travel, from, calif..."
Newlyweds,9.000000e+03,en,2011-12-26,1.205401e+08,85.0,Released,5.9,5,Edward Burns Kerry Bish\u00e9 Marsha Dietlein ...,"[{'name': 'Edward Burns', 'gender': 2, 'depart...",...,1,0,0,0,0,0,0,0,0,"[a, newlywed, couple, s, honeymoon, is, upende..."
"Signed, Sealed, Delivered",3.704284e+07,en,2013-10-13,1.205401e+08,120.0,Released,7.0,6,Eric Mabius Kristin Booth Crystal Lowe Geoff G...,"[{'name': 'Carla Hetland', 'gender': 0, 'depar...",...,1,0,0,0,0,0,0,0,1,"[signed, sealed, delivered, introduces, a, ded..."


In [112]:
# Check columns
movies.columns

Index(['budget', 'original_language', 'release_date', 'revenue', 'runtime',
       'status', 'title', 'vote_average', 'vote_count', 'cast', 'crew',
       'director', 'Action', 'Adventure', 'Fantasy', 'Science_Fiction',
       'Crime', 'Drama', 'Thriller', 'Animation', 'Family', 'Western',
       'Comedy', 'Romance', 'Horror', 'Mystery', 'History', 'War', 'Music',
       'Documentary', 'Foreign', 'TV', 'overview'],
      dtype='object')