## Final Project Submission

Please fill out:
* Student name: Christine Egan
* Student pace: part time
* Scheduled project review date/time: 
* Instructor name: Amber Yandow
* Blog post URL:

## Table of Contents
### 1. Introduction
### 2. Libraries & Functions
### 3. Data Mining & Cleaning Methodology
### 4. Exploratory Data Analysis
### 5. Findings and Suggestions  
<p>&nbsp;</p>

## 1. Introduction
The goal of this project is to suggest strategies that will maximize that chances that will make a movie the most commercially successful. This will be achieved through examining curated data with some important questions in mind:

#### A. Budgets & Revenue
    i. Which movies generate the highest revenues?
    ii. Which movies have the best budget to earnings value?

#### B. Domestic & Foreign Revenues
    i. How do foreign and domestic revenues compare?
    ii. Which genres are the most popular in domestic and/or foriegn markets?

#### C. Movie Length & Revenues
    i. How does movie length effect budget?
    ii. How does movie length effect earnings?

#### D. Studio & Revenues
    i. Which studios have the highest budget movies?
    ii. Which studios have the highest revenue movies?
<p>&nbsp;</p>

## 2. Functions and Libraries

### A. Libraries
    gzip (needed to unzip the zipped data)
    shutil (needed to create a copy of the unzipped data that can returned in the function)
    pandas (needed to organize the csv file into a coherent data frame)

In [1]:
import gzip # need gzip to unzip the csv
import shutil # need to use to copy the file object
import pandas as pd # need to use to organize and manipulate the csv with python

### B. Functions
Below is a list of custom functions that I defined to streamline my initial analysis.

#### i. gz_open
    is used to open any files I need from the zippedData folder

In [2]:
def gz_open(f): # f is the file that needs to be unzipped
    with gzip.open(f, 'rb') as f_in:
        f = f.replace('.gz','')
        with open(f, 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out) # f_in =  the file source, f_out = file destination
    return f_out

#### ii. percent_null
    is used to determine the percentage of null values in a particular dataset

In [3]:
def percent_null(df):
    x = len(df.isna().sum())
    print(x*100/len(df))
    print()

#### iii. percent_null_coll 
    iterates percent null over each column in a dataset

In [4]:
def percent_null_col(df):
    x = ['column_name','missing_data', 'missing_in_percentage']
    missing_data = pd.DataFrame(columns=x)
    columns = df.columns
    for col in columns:
        icolumn_name = col
        imissing_data = df[col].isnull().sum()
        imissing_in_percentage = (df[col].isnull().sum()/df[col].shape[0])*100
        missing_data.loc[len(missing_data)] = [icolumn_name, imissing_data, imissing_in_percentage]
    print(missing_data)
    print()
    
    

#### iv. df_info
    quickly prints metadata from the data set, such as:
    column names
    shape
    null values
    the percentage of null values for the entire dataframe
    the percentage of null values per each column
    basic statistics
    datatypes
    a preview of the dataframe

In [5]:
def df_info(df): # need to fix this to remove the extra print lines
    print('Columns')
    print(list(df.columns),'\n')
    print('Shape')
    print(df.shape,'\n')
    print('Null Values')
    print(df.isnull().sum(),'\n')
    print('Total Percentage of Missing Data') # the total percentage if data missing for the entire data frame
    print(percent_null(df),'\n')
    print('Percentage of Missing Data Per Column') # the percentage of missing data per column
    print(percent_null_col(df),'\n')
    print('Info')
    print(df.info(),'\n')
    print('Description')
    print(df.describe(),'\n')
    print('Preview')
    print(df.head(),'\n')

## 3. Data Mining and Cleaning

#### A. Dataframes:
    A. bom_movie_gross_df
    B. imbd_name_basics_df
    C. imbd_title_akas_df
    D. tn_movie_budgets_df

#### B. Methodology:
    1) Preparing the dataframe
        i. unzipping the file
        ii. reading the file
    2) Generating preliminary information about the dataframe to guide analysis
    3) Cleaning the data
        i. handling missing values
        ii. formatting data for ease of use

## i. bom_movie_gross_df

### 1. Preparing the dataframe

In [6]:
# Step 1 - Preparing the dataframe
gz_open('zippedData/bom.movie_gross.csv.gz') # unzipping the file
df = pd.read_csv('bom.movie_gross.csv') # reading the file, creating a df

### 2. Examining the quick facts from the dataframe

In [7]:
# Step 2 - Get some quick data to help guide analysis
df_info(df)

Columns
['title', 'studio', 'domestic_gross', 'foreign_gross', 'year'] 

Shape
(3387, 5) 

Null Values
title                0
studio               5
domestic_gross      28
foreign_gross     1350
year                 0
dtype: int64 

Total Percentage of Missing Data
0.14762326542663123

None 

Percentage of Missing Data Per Column
      column_name missing_data  missing_in_percentage
0           title            0               0.000000
1          studio            5               0.147623
2  domestic_gross           28               0.826690
3   foreign_gross         1350              39.858282
4            year            0               0.000000

None 

Info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64


### 3. Cleaning the data

#### a. Determining how to handle missing values and examining unique values

#### df.title

We need to know if any titles are repeated.

In [8]:
x = df[df.title.duplicated()]
x

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
3045,Bluebeard,WGUSA,43100.0,,2017


Even though I am pretty sure that this
row can be eliminated, I want to double check that there is truly nothing different about those two rows.

In [9]:
x = df[df.title == 'Bluebeard']
x

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
317,Bluebeard,Strand,33500.0,5200.0,2010
3045,Bluebeard,WGUSA,43100.0,,2017


Further analysis revealed that this might actually be two different movies with the same title, based on the differences in studio,domestic_gross,foreign_gross, and year! I'm going to leave it in for now.

#### df.studio   
There are only (5) missing values, comprising ~.15% of the total data.
Since we would like to use **studio** as a variable in our analysis, it might be helpful to just drop these rows. <br>
<br>
First, we will preview them just to double check there isn't anything unique about them that might change my mind.   

In [10]:
df[df.studio.isna()] # storing the subset of rows with null vals in studio

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
210,Outside the Law (Hors-la-loi),,96900.0,3300000.0,2010
555,Fireflies in the Garden,,70600.0,3300000.0,2011
933,Keith Lemon: The Film,,,4000000.0,2012
1862,Plot for Peace,,7100.0,,2014
2825,Secret Superstar,,,122000000.0,2017


I don't see anything special about these movies, so we can drop them.

In [11]:
df=df.dropna(subset=['studio'], how='all') # to drop the rows with null vals for studio
df.isna().sum() # checking to make sure they are gone

title                0
studio               0
domestic_gross      26
foreign_gross     1349
year                 0
dtype: int64

Next, it is important to get an idea of how many unique studios are represented in the dataframe.

In [12]:
print('Number of Unique Studios: ',df.studio.nunique()) # 257 unique studios
print()
x = df['studio'].value_counts()
print('Top 15 Studios with the Most Releases between 2010 - 2018:','\n',x.head(15)) # preview the studios with the most recent releases
print()

Number of Unique Studios:  257

Top 15 Studios with the Most Releases between 2010 - 2018: 
 IFC       166
Uni.      147
WB        140
Magn.     136
Fox       136
SPC       123
Sony      110
BV        106
LGF       103
Par.      101
Eros       89
Wein.      77
CL         74
Strand     68
FoxS       67
Name: studio, dtype: int64



It will be helpful to know how many unique values there are in studio, as well as checking to see if there are different possible placeholders for null values (i.e. 'unknown','na',etc.)

In [13]:
x = len(df) # we might need this to compare later
print('Number of Unique studio Vals: ',df.studio.nunique()) # display the number of unique values 

Number of Unique studio Vals:  257


It will also be helpful to see how many times a particular studio appears in the dataframe, because that indicates how many movies the studio has produced. This can help us limit our analysis to studios that produce a large number of films. I'm going to limit it to the top 15 studios.

In [14]:
y = df.studio.value_counts()
y.head(15) # displaying the number of instances of a particular studio

IFC       166
Uni.      147
WB        140
Magn.     136
Fox       136
SPC       123
Sony      110
BV        106
LGF       103
Par.      101
Eros       89
Wein.      77
CL         74
Strand     68
FoxS       67
Name: studio, dtype: int64

#### df.domestic_gross
There are only **28** missing values, comprising **~.8%** of the total data in that column.   
However, since there columns for both domestic gross and foreign gross it is important to investigate if the rows that contain null values in domestic_gross also contain null values in foreign gross. 

In [15]:
print('Films with Null Values for Domestic Gross \n')
x = df[df.domestic_gross.isna()]
print(x[0:28])  # since it is a small amount of nulls we can check all of them to make sure we are right

Films with Null Values for Domestic Gross 

                                      title   studio  domestic_gross  \
230              It's a Wonderful Afterlife      UTV             NaN   
298   Celine: Through the Eyes of the World     Sony             NaN   
302                              White Lion    Scre.             NaN   
306                        Badmaash Company     Yash             NaN   
327                      Aashayein (Wishes)  Relbig.             NaN   
537                                   Force     FoxS             NaN   
713                        Empire of Silver     NeoC             NaN   
871                            Solomon Kane     RTWC             NaN   
928                            The Tall Man    Imag.             NaN   
936                     Lula, Son of Brazil     NYer             NaN   
966                          The Cup (2012)     Myr.             NaN   
1017                              Dark Tide      WHE             NaN   
1079                

Since the rows that have a null domestic_gross value have a foriegn gross value, it is appropriate to fill in domestic_gross with zero.   
We just have to adjust our analysis accordingly when we are examining the relationship of these variables to profitability.   
For example, if we needed to calculate any descriptive stats for the domestic_gross, it would be important to make sure these movies were not included, to avoid skewing those stats.

In [16]:
df.domestic_gross.fillna(0,inplace=True) 

#### df.foreign_gross
Since there are **1350** missing values, or **~49%** of the data for that column, it cannot be discounted without investigation.   
I have a feeling that just like rows with domestic_gross nulls have foriegn_gross values, that rows with foreign_gross null values will have values in domestic_gross.
However, it is too large to view the entire set, so I am going to check it using the size.
The size of the subset of movies that have null foreign_gross and and values in domestic_gross should be the same size as the subset of all movies will foriegn_gross null values.


In [17]:
x = len(df.loc[((df['foreign_gross'].isna()) & (df['domestic_gross'] > 0))])
y = len(df[df['foreign_gross'].isna()])
x == y

True

They are the same size, so I feel comfortable applying the same method of replacing the null values with zeroes.

In [18]:
df.foreign_gross.fillna(0,inplace=True)

Finally, let's double check that all the null values are eliminated.

In [19]:
print(df.isna().sum())

title             0
studio            0
domestic_gross    0
foreign_gross     0
year              0
dtype: int64


#### df.year

Here we can quickly determine the range of years represented in the dataframe. 
I chose to unique() instead of min()/max() methods, just in case there were any missing years.

In [20]:
df.year.unique()

array([2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018])

#### Renaming the dataframe

I want to be able to easily reuse some of this code for different data frames, so I decided to use the variable df to represent the dataframes when I clean and prepare them individually. However, once they are clean, I want to give them a unique title so that all of the dataframes can be combined without confusion later on.

In [21]:
bom_movie_gross_df = df

# IGNORE THIS SECTION FOR NOW
### Dataframe B: imbd_name_basics_df
A data frame containing names of people involved in making movies

In [None]:
# Step 1 - Unzip the database that we need
gz_open('zippedData/imdb.name.basics.csv.gz')

# Step 2 - Create a dataframe
df = pd.read_csv('zippedData/imdb.name.basics.csv')

# Step 3 - Get some quick data to help guide analysis
df_info(df)

In [None]:
# i need to find out which movies are the same across more than one df

# a) nconst 
# seems like a unique ID number, i wonder if it is in other df?
x = len(df)
print('Length of DF: ',x)
print('Number of Unique nconst Vals: ',df.nconst.nunique()) # if it the same length as x it confirms it is an identifying value
print()

# b) primary name seems like the name of a person involved in making the movie, do these correspond to nconst?
y = df.primary_name.nunique()
print('Number of Unique primary_name Vals: ',y)
z = x - y
print('Number of Repeated primary_names: ',z)
print()
# i need to look at some of these primary names and see if they are duplicates
a = df.primary_name.value_counts()
print(a.head()) # i can't safely remove duplicates for right now
print()
# what do some of these duplicates look like
#print(df[df['primary_name'] == "Michael Brown"])
# so it seems like there are two Michael Browns, one born in 1996 and one born in 1978
# there is obviously a lot to be done regarding this name problem
#print(df[df['primary_name'] == "James Brown"])
#print()
# what I am begining to realize about this column is that it is going to be difficult to use as a starting point
# it is probably smarter to start with the films and their profitability and then perhaps investigate the actors/directors of those films

# c) birth year
# i'm going to drop this col because it is 88% nans, plus I can't think of how it is relevant
# unless i need birth year to determine the difference between two actors?
#df = df.drop('birth_year', 1)
# i made need to fill these values in with a place holder, but we will see

# d) death year
# i'm going to drop this one because it is 99% nans, plus I can't think of how it is relevant
df = df.drop('death_year', 1)

# d) primary profession
# how many unique vals/what are they?
y = df.primary_profession.nunique()
print('Number of Unique primary_profession Vals: ',df.primary_profession.nunique())
z = x - y
print('Number of Repeated primary_profession Vals: ',z)
# there might be a way to extract the true number of unique professions, not just this explosion of combinations
# perhaps I can narrow it down to lists of actors/directors/writers
print()

# e) known for titles (are these t-const numbers?)
y = df.known_for_titles.nunique()
print('Number of Unique known_for_titles Vals: ',df.known_for_titles.nunique())
z = x - y
print('Number of Repeated known_for_titles Vals: ',z)
# can we cross check these against other df?
print()

imbd_name_basics_df = df

imbd_name_basics_df.head()

# IGNORE ABOVE THIS LINE FOR NOW

## ii. imbd_title_akas_df


### 1. Preparing the dataframe

In [22]:
gz_open('zippedData/imdb.title.akas.csv.gz')
df = pd.read_csv('zippedData/imdb.title.akas.csv')

### 2. Examining the quick facts from the dataframe

In [23]:
df_info(df)

Columns
['title_id', 'ordering', 'title', 'region', 'language', 'types', 'attributes', 'is_original_title'] 

Shape
(331703, 8) 

Null Values
title_id                  0
ordering                  0
title                     0
region                53293
language             289988
types                163256
attributes           316778
is_original_title        25
dtype: int64 

Total Percentage of Missing Data
0.0024117960946991738

None 

Percentage of Missing Data Per Column
         column_name missing_data  missing_in_percentage
0           title_id            0               0.000000
1           ordering            0               0.000000
2              title            0               0.000000
3             region        53293              16.066481
4           language       289988              87.423991
5              types       163256              49.217523
6         attributes       316778              95.500493
7  is_original_title           25               0.007537

None

### 3. Cleaning the Data

#### a. Determining how to handle missing values and examining unique values

#### title_id.df

This columm has no null values. However, we might need to check for duplicate values.   
I'm not sure what a title_id is, but I'm assuming it's an identifying number assigned to a particular movie.   
If there are duplicates, we can infer that it is the same movie.

In [24]:
x = len(df)
print('Length of DF: ',x)
y = df.title_id.nunique()
print('Number of Unique title_id Vals: ',y)
z = x - y
print('Number of Repeated title_id: ',z)
print()
print('List of unique title_id Vals')
x = df['title_id'].value_counts()
print(x.head(20))

Length of DF:  331703
Number of Unique title_id Vals:  122302
Number of Repeated title_id:  209401

List of unique title_id Vals
tt2488496    61
tt1201607    55
tt2310332    55
tt2948356    53
tt1790809    53
tt2278871    53
tt2452042    52
tt4630562    52
tt1217209    51
tt2820852    51
tt0398286    51
tt1345836    51
tt0892791    51
tt1596343    50
tt2527336    49
tt1905041    49
tt2096673    49
tt0848228    49
tt1399103    49
tt5323662    49
Name: title_id, dtype: int64


The length of the dataframe is 331,703, but the number of unique title IDs is 122,302.   
This suggests to me that is dataframe represented data for 122,302 films.
If I use a multi-index approach, the dataframe might instead be organized by the title of the film.
This might be necessary to see if there is overlap with bom_movie_gross_df.

#### df.ordering

I am unsure what ordering is used for, but there aren't any null values. First, I will check the number of unique values.

In [25]:
x = len(df)
print('Length of DF: ',x)
y = df.ordering.nunique()
print('Number of Unique ordering Vals: ',y)
z = x - y
print('Number of Repeated ordering Vals: ',z)
print()
print('List of unique ordering Vals')
x = df['ordering'].value_counts()
print(x.head(20))

Length of DF:  331703
Number of Unique ordering Vals:  61
Number of Repeated ordering Vals:  331642

List of unique ordering Vals
1     122302
2      44686
3      41608
4      22586
5      15084
6      11103
7       8745
8       7156
9       6086
10      5269
11      4612
12      4086
13      3683
14      3319
15      3018
16      2740
17      2486
18      2271
19      2097
20      1941
Name: ordering, dtype: int64


I don't think that these numbers will add anything to my analysis, so I am going to drop them.

In [26]:
df.drop('ordering', axis=1, inplace=True)
print(list(df.columns)) # just making sure it's gone

['title_id', 'title', 'region', 'language', 'types', 'attributes', 'is_original_title']


#### df.title

Next, I'm going to check to see if there are duplicates in the title column, as well as how many unique values we are dealing with.
I chose this method instead of .duplicated() because I want to know how many unique movies there are, how many times each movie is listed, and how large the set of duplicates is.

In [27]:
x = len(df)
print('Length of DF: ',x)
y = df.title.nunique()
print('Number of Unique title Vals: ',y)
z = x - y
print('Number of Repeated title: ',z)
print()
print('List of unique title Vals')
y = df.title.value_counts()
print(y[:10])
print()

Length of DF:  331703
Number of Unique title Vals:  252781
Number of Repeated title:  78922

List of unique title Vals
Robin Hood    32
Home          30
Alone         27
Thor          25
Broken        25
Love          25
Deadpool      24
Iron Man 3    23
Brooklyn      23
Macbeth       22
Name: title, dtype: int64



#### df.region

This column contains **53,293** missing values, accounting for **~16%** of the data in that column.   
However, that means that approximately 84% of the data is actually available, and in such a large data set, might still be useful. For now, I think it is best to fill the null values with "unknown" and see where the land.

In [28]:
df.region.fillna('unknown',inplace=True) # fill in the missing values
df.region.isna().sum() # check to make sure they are all gone.

0

I want to look into the unique values in region and see which regions occur with the most frequency. It will also be important to double check what each abbreviation means.

In [29]:
y = df.region.nunique()
print('Number of Unique region Vals: ',y)
z = x - y
print('Number of Repeated region Vals: ',z)
print()
print('List of region Vals')
y = df.region.value_counts()
print(y[:10])

Number of Unique region Vals:  214
Number of Repeated region Vals:  331489

List of region Vals
unknown    53293
US         51490
XWW        18467
RU         13817
DE         11634
FR         10990
ES          9007
GB          8942
CA          8871
PL          8691
Name: region, dtype: int64


#### df.language

This column has **289,988**, which is  **~87%** of the data.
I am considering dropping this, but I wonder if there is information about language that can be extrapolated by looking at other columns, such as the title and the region.   
I can start that process by comparing the region and language columns.

In [30]:
x = df.loc[((df['language'].isna()) & (df['region'] == 'unknown'))]
x

Unnamed: 0,title_id,title,region,language,types,attributes,is_original_title
38,tt0369610,Jurassic World,unknown,,original,,1.0
80,tt0401729,John Carter,unknown,,original,,1.0
83,tt10010134,Versailles Rediscovered - The Sun King's Vanis...,unknown,,original,,1.0
86,tt10027708,Miguelito - Canto a Borinquen,unknown,,original,,1.0
90,tt10050722,Thing I Don't Get,unknown,,original,,1.0
...,...,...,...,...,...,...,...
331690,tt9723084,Anderswo. Allein in Afrika,unknown,,original,,1.0
331692,tt9726638,Monkey King: The Volcano,unknown,,original,,1.0
331696,tt9755806,Big Shark,unknown,,original,,1.0
331698,tt9827784,Sayonara kuchibiru,unknown,,original,,1.0


There don't seem to be any rows where language is null and region is unknown.   
Later on, I will use region and title to draw inferences about language if necessary.
For now, we will also fill them as unknown.

In [31]:
df.language.fillna('unknown',inplace=True) # fill in the missing values
df.language.isna().sum() # check to make sure they are all gone.

0

To be safe, let's do one last comparison to see if language = unknown and region = unknown are complementary sets.

In [32]:
x = len(df[df.region == 'unknown'])
print('region = unknown: ', x)
y = len(df[df.language == 'unknown'])
print('language = unknown: ', y)
z = len(df)
print('Do all unknown region values have values for language?: ', z - x == y)
print('Do all unknown language values have values for region?: ', z - y == x)
print('Rows with unknown values for region and language?: ', len(df.loc[((df['language'] == 'unknown') & (df['region'] == 'unknown'))]))

region = unknown:  53293
language = unknown:  289988
Do all unknown region values have values for language?:  False
Do all unknown language values have values for region?:  False
Rows with unknown values for region and language?:  53293


It seems like any rows where region is unknown, language is also unknown.

Next, I want to check the number of unique languages and which languages occur with the most frequency. Abbreviations will also have to be double checked.

In [33]:
y = df.language.nunique()
print('Number of Unique language Vals: ',y)
z = x - y
print('Number of Repeated language Vals: ',z)
print()
y = df.language.value_counts()
print(y[:10])
print()

Number of Unique language Vals:  77
Number of Repeated language Vals:  53216

unknown    289988
en          22895
tr           3847
bg           3609
fr           3576
he           2680
sv            965
cmn           727
fa            482
hi            307
Name: language, dtype: int64



#### df.types

This column of the data set has **163,256**, comprising **~49%** of the data in that column. Since I'm not sure if this column will be important to any of the questions I've outlined, I might consider dropping it. I am going to fill it in with unknown until I examine all of the unique values. 

In [34]:
df.types.fillna('unknown',inplace=True) # fill in the missing values
df.types.isna().sum() # check to make sure they are all gone.

0

In [35]:
y = df.types.nunique()
print('Number of Unique type Vals: ',y)
z = x - y
print('Number of Repeated type Vals: ',z,'\n')
y = df.types.value_counts()
print(y)

Number of Unique type Vals:  11
Number of Repeated type Vals:  53282 

unknown             163256
imdbDisplay         100461
original             44700
working               8680
alternative           6564
festival              3307
dvd                   2995
tv                    1617
video                  121
dvdimdbDisplay          1
festivalworking         1
Name: types, dtype: int64


After examining the values in the type column, I think I might leave it in because it seems to indicate where/how the film was released, which might enrich our analysis.

#### df.attributes

This column had very little data, missing a total of 316,778 values or ~95%.
I'm going to check the unique values in case there is anything special, but i will most likely drop it.

In [36]:
y = df.attributes.nunique()
print('Number of Unique attributes Vals: ',y)
z = x - y
print('Number of Repeated attributes Vals: ', z, '\n')
print()
print('List of attributes Vals')
y = df.attributes.value_counts()
print(y[:10])

Number of Unique attributes Vals:  77
Number of Repeated attributes Vals:  53216 


List of attributes Vals
new title                           1700
alternative spelling                1394
literal English title               1054
complete title                      1034
original subtitled version           879
informal English title               816
transliterated ISO-LATIN-1 title     695
dubbed version                       645
short title                          640
alternative transliteration          611
Name: attributes, dtype: int64


Even though a lot of values are missing, I think that the information will be helpful when dealing with foreign films. Since many titles are repeated, it makes sense that there would only be these specifications in certain rows. I am going to leave the column in and use unknown to fill in the values.

In [37]:
df.attributes.fillna('unknown',inplace=True) # fill in the missing values
df.attributes.isna().sum() # check to make sure they are all gone.

0

#### df.is_original_title

This column has **25** missing values, account for **~0.008%** of the data. It might be useful to preview what rows are missing data in this column.

In [38]:
df[df.is_original_title.isna()]

Unnamed: 0,title_id,title,region,language,types,attributes,is_original_title
76516,tt1572192,Scream Queen Campfire,US,unknown,unknown,unknown,
161142,tt3300342,Misfortune,US,unknown,unknown,unknown,
176091,tt2397619,Woody Allen: A Documentary,US,unknown,unknown,unknown,
176092,tt2397619,Woody Allen: El documental,AR,unknown,unknown,unknown,
176093,tt2397619,Woody Allen: Um Documentário,BR,unknown,unknown,unknown,
176094,tt2397619,Woody Allen: A Documentary,DE,unknown,unknown,unknown,
176095,tt2397619,Woody Allen: El documental,ES,unknown,unknown,unknown,
176096,tt2397619,"Woody Allen: A Documentary - Manhattan, Movies...",FI,unknown,unknown,unknown,
176097,tt2397619,"Woody Allen, a Documentary",FR,unknown,unknown,unknown,
176098,tt2397619,Woody,IT,unknown,unknown,unknown,


The preview indicates that these rows still have a lot of useful data, so I think dropping the rows might not be necessary. I will consider dropping the column later on, but since most of the rows do have this data, and I might need it to help determine if the movie is foreign or domestic, I want to leave it in unless I find a good reason to take it out. For now, I will use unknown.

In [39]:
df.is_original_title.fillna('unknown',inplace=True)

Next, I need to check what the unique values are for the majority of the rows.

In [40]:
y = df.is_original_title.nunique()
print('Number of Unique is_original_title Vals: ',y)
z = x - y
print('Number of Repeated is_original_title Vals: ',z)
print()
print('List of is_original_title Vals')
y = df.is_original_title.value_counts()
print(y[:10])

Number of Unique is_original_title Vals:  3
Number of Repeated is_original_title Vals:  53290

List of is_original_title Vals
0.0        286978
1.0         44700
unknown        25
Name: is_original_title, dtype: int64


I think it will be easier if instead of 0 and 1, if the values were true or false.

In [41]:
f = lambda x: True if x==1.0 else False
df.is_original_title = df.is_original_title.map(f)
print(df.head())

    title_id                                    title region language  \
0  tt0369610                            Джурасик свят     BG       bg   
1  tt0369610                        Jurashikku warudo     JP  unknown   
2  tt0369610  Jurassic World: O Mundo dos Dinossauros     BR  unknown   
3  tt0369610                  O Mundo dos Dinossauros     BR  unknown   
4  tt0369610                           Jurassic World     FR  unknown   

         types   attributes  is_original_title  
0      unknown      unknown              False  
1  imdbDisplay      unknown              False  
2  imdbDisplay      unknown              False  
3      unknown  short title              False  
4  imdbDisplay      unknown              False  


Now, I will double check to make sure there are no more missing values.

In [42]:
df.isna().sum()

title_id             0
title                0
region               0
language             0
types                0
attributes           0
is_original_title    0
dtype: int64

Finally, I will rename the dataframe so I can use it in conjunction with the other later.

In [43]:
imbd_title_akas_df = df

### Dataframe D: tn_movie_budgets_df

In [44]:
gz_open('zippedData/tn.movie_budgets.csv.gz')
df = pd.read_csv('zippedData/tn.movie_budgets.csv')

In [45]:
df_info(df)

Columns
['id', 'release_date', 'movie', 'production_budget', 'domestic_gross', 'worldwide_gross'] 

Shape
(5782, 6) 

Null Values
id                   0
release_date         0
movie                0
production_budget    0
domestic_gross       0
worldwide_gross      0
dtype: int64 

Total Percentage of Missing Data
0.10377032168799723

None 

Percentage of Missing Data Per Column
         column_name missing_data  missing_in_percentage
0                 id            0                    0.0
1       release_date            0                    0.0
2              movie            0                    0.0
3  production_budget            0                    0.0
4     domestic_gross            0                    0.0
5    worldwide_gross            0                    0.0

None 

Info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   i

#### df.id

This row seems like a second index to me, and I'm concerned it's going to confuse me. So I'm going to drop it.

In [46]:
df.drop('id', axis=1, inplace=True)
df.columns # checking to make sure it is gone.

Index(['release_date', 'movie', 'production_budget', 'domestic_gross',
       'worldwide_gross'],
      dtype='object')

#### df.release_date

I noticed that this dataframe has a release date column that I can extract the year from. I can store the year in a new column, because it might be useful when joining to the first dataframe bom_movie_gross_df, which also has a year column. 

In [47]:
df['year'] = df.release_date.str.replace('^.* ', '') # create a column with only the year by extracting the it from release_date
df.year.head() # checking to make sure it worked

0    2009
1    2011
2    2019
3    2015
4    2017
Name: year, dtype: object

If I combine those two dataframes later on, I need to take into consideration that the other data frame begins in 2010, which is good because we want to use the most current data we can. This list contains movies with many release dates prior to 2010. I would like to make sure that I can select data that matches the range of the other data at some point if necessary.    
However, there is a small issue. Since we extracted the year from release date, which contained non-numerical data, our datatype for the year column is a string, not an integer. That would make it impossible to use operations to filter the data that we need. So I am converting the year to an integer, then limiting the data to data from 2010 and later. 

In [48]:
df['year'] = (df['year']).astype(int) # making all the values of this column integers
df = df[df['year'] >= 2010] # filtering out the entries prior to 2010

In [49]:
df.sort_values(['year'], ascending=[1]) # sorting by the year

Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross,year
1794,"Aug 27, 2010",Takers,"$32,000,000","$57,744,720","$70,587,268",2010
2321,"Dec 31, 2010",Belka i Strelka. Zvezdnye sobaki,"$25,000,000",$0,"$9,445,081",2010
4664,"Oct 20, 2010",Paranormal Activity 2,"$3,000,000","$84,752,907","$177,512,032",2010
2351,"Aug 20, 2010",Piranha 3D,"$24,000,000","$25,003,155","$83,660,160",2010
2352,"Nov 24, 2010",Faster,"$24,000,000","$23,240,020","$35,792,945",2010
...,...,...,...,...,...,...
5047,"Feb 1, 2019",Braid,"$1,660,000",$0,"$80,745",2019
954,"Jan 25, 2019",The Kid Who Would Be King,"$59,000,000","$16,790,790","$28,348,446",2019
1205,"Dec 31, 2020",Hannibal the Conqueror,"$50,000,000",$0,$0,2020
535,"Feb 21, 2020",Call of the Wild,"$82,000,000",$0,$0,2020


#### df.movie

There are no missing values in movie, but I want to check and see if there are duplicate titles.   
I am using value_counts() first because I want to get an idea of how many times a particular title might be listed.

In [50]:
df.movie.value_counts()

Snitch                                 2
The Square                             2
Trance                                 2
Robin Hood                             2
A Most Wanted Man                      1
                                      ..
The Hobbit: The Desolation of Smaug    1
LOL                                    1
Live by Night                          1
The Imitation Game                     1
Fist Fight                             1
Name: movie, Length: 2190, dtype: int64

Since only a small selection of these titles appear more than once, we can quickly check them to see if they are actually duplicates or if something different is going on. I want to check across all the columns of the row (not just title) to make sure it is actually a duplicate.

In [51]:
df[df.duplicated(subset='movie', keep=False)].sort_values('movie', axis = 0, ascending = True)

Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross,year
38,"May 14, 2010",Robin Hood,"$210,000,000","$105,487,148","$322,459,006",2010
408,"Nov 21, 2018",Robin Hood,"$99,000,000","$30,824,628","$84,747,441",2018
3025,"Feb 22, 2013",Snitch,"$15,000,000","$42,930,462","$57,907,734",2013
5351,"Dec 31, 2012",Snitch,"$850,000",$0,$0,2012
5009,"Apr 9, 2010",The Square,"$1,900,000","$406,216","$740,932",2010
5099,"Oct 25, 2013",The Square,"$1,500,000","$124,244","$176,262",2013
2970,"Apr 5, 2013",Trance,"$16,000,000","$2,322,593","$22,594,052",2013
5330,"Dec 31, 2012",Trance,"$950,000",$0,$0,2012


It seems like there is something weird going on with these movies. Even though the title is repeated, not all of them represent the same movie, based on the year. I can tell from these data frames that title as well as year are going to be important identifying criteria.    
I will have to handle each title a little differently.

I think I can drop the entries for ***Trance*** and ***Snitch*** that have zero in domestic_gross and worldwide_gross.

In [52]:
df = df.loc[~((df['movie'] == 'Trance') | (df['movie'] == 'Snitch')) & (df['year'] == 2012),:]
# this line says that df is the subset of rows where movie is not Trance or Snitch and year (for that movie) is not 2012

There are multiple titles that are called ***Robin Hood*** and ***The Square*** but they have different years and different production_budget/domestic_gross/worldwide_gross. So I am going to treat them as different movies.

#### df.production_budget

There are no missing values in this column, and presumably each value represents a dollar amount in US dollars. But I can't perform numerical operations on a string, so I need to remove the dollar sign.

In [53]:
df['production_budget'] = df.production_budget.str.replace('$', '').str.replace(',', '')

In [54]:
df.head()

Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross,year
10,"Jul 20, 2012",The Dark Knight Rises,275000000,"$448,139,099","$1,084,439,099",2012
13,"Mar 9, 2012",John Carter,275000000,"$73,058,679","$282,778,100",2012
18,"Dec 14, 2012",The Hobbit: An Unexpected Journey,250000000,"$303,003,568","$1,017,003,568",2012
26,"May 4, 2012",The Avengers,225000000,"$623,279,547","$1,517,935,897",2012
30,"Jul 3, 2012",The Amazing Spider-Man,220000000,"$262,030,663","$757,890,267",2012


df.domestic_gross

In [55]:
df['domestic_gross'] = df.domestic_gross.str.replace('$', '').str.replace(',', '')

In [56]:
df.head()

Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross,year
10,"Jul 20, 2012",The Dark Knight Rises,275000000,448139099,"$1,084,439,099",2012
13,"Mar 9, 2012",John Carter,275000000,73058679,"$282,778,100",2012
18,"Dec 14, 2012",The Hobbit: An Unexpected Journey,250000000,303003568,"$1,017,003,568",2012
26,"May 4, 2012",The Avengers,225000000,623279547,"$1,517,935,897",2012
30,"Jul 3, 2012",The Amazing Spider-Man,220000000,262030663,"$757,890,267",2012


df.worldwide_gross

In [57]:
df['worldwide_gross'] = df.worldwide_gross.str.replace('$', '').str.replace(',', '')

In [58]:
tn_movie_budgets_df = df

## Combining DF for Questions

#### Dataframes we can use:
    bom_movie_gross_df
    imbd_name_basics_df
    imbd_title_akas_df
    tn_movie_budgets_df

In [396]:
print(list(bom_movie_gross_df.columns))
#print(list(imbd_name_basics_df.columns))
print(list(imbd_title_akas_df.columns))
print(list(tn_movie_budgets_df.columns))

['title', 'studio', 'domestic_gross', 'foreign_gross', 'year']
['title_id', 'title', 'region', 'language', 'types', 'attributes', 'is_original_title']
['title', 'production_budget', 'domestic_gross', 'worldwide_gross', 'year']


In [397]:
df = bom_movie_gross_df.sort_values(by = 'year', ascending = 1)

In [398]:
df = df[df.year == 2018]

In [399]:
df = df.sort_values(by = 'domestic_gross', ascending = 0)

In [400]:
df.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
3080,Black Panther,BV,700100000.0,646900000.0,2018
3079,Avengers: Infinity War,BV,678800000.0,1369.5,2018
3082,Incredibles 2,BV,608600000.0,634200000.0,2018
3081,Jurassic World: Fallen Kingdom,Uni.,417700000.0,891800000.0,2018
3083,Aquaman,WB,335100000.0,812700000.0,2018


In [401]:
df.drop(['foreign_gross','year'], axis = 1, inplace = True)

In [402]:
df.head()

Unnamed: 0,title,studio,domestic_gross
3080,Black Panther,BV,700100000.0
3079,Avengers: Infinity War,BV,678800000.0
3082,Incredibles 2,BV,608600000.0
3081,Jurassic World: Fallen Kingdom,Uni.,417700000.0
3083,Aquaman,WB,335100000.0


In [404]:
df['domestic_gross'] = (df['domestic_gross']).astype(int)
df.head()

Unnamed: 0,title,studio,domestic_gross
3080,Black Panther,BV,700100000
3079,Avengers: Infinity War,BV,678800000
3082,Incredibles 2,BV,608600000
3081,Jurassic World: Fallen Kingdom,Uni.,417700000
3083,Aquaman,WB,335100000


In [405]:
df['domestic_gross'] = df['domestic_gross'].div(1000000)

In [406]:
df.head()

Unnamed: 0,title,studio,domestic_gross
3080,Black Panther,BV,700.1
3079,Avengers: Infinity War,BV,678.8
3082,Incredibles 2,BV,608.6
3081,Jurassic World: Fallen Kingdom,Uni.,417.7
3083,Aquaman,WB,335.1


In [407]:
x = df.title[:10]

In [408]:
x

3080                     Black Panther
3079            Avengers: Infinity War
3082                     Incredibles 2
3081    Jurassic World: Fallen Kingdom
3083                           Aquaman
3087                        Deadpool 2
3096      Dr. Seuss' The Grinch (2018)
3086     Mission: Impossible - Fallout
3089              Ant-Man and the Wasp
3084                 Bohemian Rhapsody
Name: title, dtype: object

In [413]:
y = list(tn_movie_budgets_df.title)
len(y)
x = list(df.title)

In [417]:
import os 
import sys 
import pandas as pd 
from bs4 import BeautifulSoup 
   
path = 'html.html'
   
# empty list 
data = [] 
   
# for getting the header from 
# the HTML file 
list_header = [] 
soup = BeautifulSoup(open(path),'html.parser') 
header = soup.find_all("table")[0].find("tr") 
  
for items in header: 
    try: 
        list_header.append(items.get_text()) 
    except: 
        continue
  
# for getting the data  
HTML_data = soup.find_all("table")[0].find_all("tr")[1:] 
  
for element in HTML_data: 
    sub_data = [] 
    for sub_element in element: 
        try: 
            sub_data.append(sub_element.get_text()) 
        except: 
            continue
    data.append(sub_data) 
  
# Storing the data into Pandas 
# DataFrame  
dataFrame = pd.DataFrame(data = data, columns = list_header) 
   
# Converting Pandas DataFrame 
# into CSV file 
dataFrame.to_csv('Geeks.csv') 

AttributeError: 'DataFrame' object has no attribute 'title'

# IGNORE

In [309]:
bom_movie_gross_df.set_index('title')

Unnamed: 0_level_0,studio,domestic_gross,foreign_gross,year
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Toy Story 3,BV,415000000.0,652000000,2010
Alice in Wonderland (2010),BV,334200000.0,691300000,2010
Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
Inception,WB,292600000.0,535700000,2010
Shrek Forever After,P/DW,238700000.0,513900000,2010
...,...,...,...,...
The Quake,Magn.,6200.0,0,2018
Edward II (2018 re-release),FM,4800.0,0,2018
El Pacto,Sony,2500.0,0,2018
The Swan,Synergetic,2400.0,0,2018


In [310]:
imbd_title_akas_df.set_index('title')

Unnamed: 0_level_0,title_id,region,language,types,attributes,is_original_title
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Джурасик свят,tt0369610,BG,bg,unknown,unknown,False
Jurashikku warudo,tt0369610,JP,unknown,imdbDisplay,unknown,False
Jurassic World: O Mundo dos Dinossauros,tt0369610,BR,unknown,imdbDisplay,unknown,False
O Mundo dos Dinossauros,tt0369610,BR,unknown,unknown,short title,False
Jurassic World,tt0369610,FR,unknown,imdbDisplay,unknown,False
...,...,...,...,...,...,...
Sayonara kuchibiru,tt9827784,unknown,unknown,original,unknown,True
Farewell Song,tt9827784,XWW,en,imdbDisplay,unknown,False
La atención,tt9880178,unknown,unknown,original,unknown,True
La atención,tt9880178,ES,unknown,unknown,unknown,False


In [311]:
tn_movie_budgets_df.rename({'movie': 'title'}, axis=1, inplace=True) # renaming so we can match the other dataframes
#tn_movie_budgets_df.head()
#tn_movie_budgets_df.drop('release_date', axis=1, inplace = True) # we don't need the full date
#tn_movie_budgets_df.head()
tn_movie_budgets_df.sort_values(by = 'title', inplace = True)
#tn_movie_budgets_df.set_index('title') # set the index to the title
tn_movie_budgets_df.set_index('title')

Unnamed: 0_level_0,production_budget,domestic_gross,worldwide_gross,year
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016: Obamaâs America,2500000,33349941,33349941,2012
21 Jump Street,42000000,138447667,202812429,2012
A Little Bit of Heaven,12500000,10011,1100287,2012
A Lonely Place to Die,4000000,0,442550,2012
A Thousand Words,40000000,18450127,20790486,2012
...,...,...,...,...
Wreck-It Ralph,165000000,189412677,496511521,2012
Wuthering Heights,8000000,100915,2721534,2012
Your Sister's Sister,120000,1597486,3090593,2012
Zambezia,20000000,0,34454336,2012


In [312]:
bom_movie_gross_df
imbd_title_akas_df
df = pd.merge(bom_movie_gross_df, imbd_title_akas_df, how="outer", on="title")
df.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year,title_id,region,language,types,attributes,is_original_title
0,Toy Story 3,BV,415000000.0,652000000,2010.0,tt0435761,DK,unknown,unknown,unknown,False
1,Toy Story 3,BV,415000000.0,652000000,2010.0,tt0435761,UY,unknown,unknown,3-D version,False
2,Toy Story 3,BV,415000000.0,652000000,2010.0,tt0435761,JP,en,unknown,unknown,False
3,Toy Story 3,BV,415000000.0,652000000,2010.0,tt0435761,ES,unknown,imdbDisplay,unknown,False
4,Toy Story 3,BV,415000000.0,652000000,2010.0,tt0435761,unknown,unknown,original,unknown,True


In [313]:
df.isna().sum()
df.drop(['types','attributes','is_original_title'], axis = 1, inplace = True)
df.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year,title_id,region,language
0,Toy Story 3,BV,415000000.0,652000000,2010.0,tt0435761,DK,unknown
1,Toy Story 3,BV,415000000.0,652000000,2010.0,tt0435761,UY,unknown
2,Toy Story 3,BV,415000000.0,652000000,2010.0,tt0435761,JP,en
3,Toy Story 3,BV,415000000.0,652000000,2010.0,tt0435761,ES,unknown
4,Toy Story 3,BV,415000000.0,652000000,2010.0,tt0435761,unknown,unknown


In [314]:
# df.sort_values(by = 'domestic_gross', inplace = True, ascending = 0)
df.head()
title = list(df.title.unique()) # all of the unique titles from the combined df
title = title[:25] # top 25 by dom_gross
title 

['Toy Story 3',
 'Alice in Wonderland (2010)',
 'Harry Potter and the Deathly Hallows Part 1',
 'Inception',
 'Shrek Forever After',
 'The Twilight Saga: Eclipse',
 'Iron Man 2',
 'Tangled',
 'Despicable Me',
 'How to Train Your Dragon',
 'Clash of the Titans (2010)',
 'The Chronicles of Narnia: The Voyage of the Dawn Treader',
 "The King's Speech",
 'Tron Legacy',
 'The Karate Kid',
 'Prince of Persia: The Sands of Time',
 'Black Swan',
 'Megamind',
 'Robin Hood',
 'The Last Airbender',
 'Little Fockers',
 'Resident Evil: Afterlife',
 'Shutter Island',
 'Salt',
 'Sex and the City 2']

In [315]:
df = df[df['title'].isin(title)]
df.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year,title_id,region,language
0,Toy Story 3,BV,415000000.0,652000000,2010.0,tt0435761,DK,unknown
1,Toy Story 3,BV,415000000.0,652000000,2010.0,tt0435761,UY,unknown
2,Toy Story 3,BV,415000000.0,652000000,2010.0,tt0435761,JP,en
3,Toy Story 3,BV,415000000.0,652000000,2010.0,tt0435761,ES,unknown
4,Toy Story 3,BV,415000000.0,652000000,2010.0,tt0435761,unknown,unknown


In [316]:
df.title.value_counts()
df

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year,title_id,region,language
0,Toy Story 3,BV,415000000.0,652000000,2010.0,tt0435761,DK,unknown
1,Toy Story 3,BV,415000000.0,652000000,2010.0,tt0435761,UY,unknown
2,Toy Story 3,BV,415000000.0,652000000,2010.0,tt0435761,JP,en
3,Toy Story 3,BV,415000000.0,652000000,2010.0,tt0435761,ES,unknown
4,Toy Story 3,BV,415000000.0,652000000,2010.0,tt0435761,unknown,unknown
...,...,...,...,...,...,...,...,...
153,Sex and the City 2,WB (NL),95300000.0,193000000,2010.0,tt1261945,DE,unknown
154,Sex and the City 2,WB (NL),95300000.0,193000000,2010.0,tt1261945,TR,tr
155,Sex and the City 2,WB (NL),95300000.0,193000000,2010.0,tt1261945,unknown,unknown
156,Sex and the City 2,WB (NL),95300000.0,193000000,2010.0,tt1261945,GR,unknown


In [317]:
#df.drop(['language','region'], axis = 1, inplace = True)
df.shape

(158, 8)

In [318]:
df

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year,title_id,region,language
0,Toy Story 3,BV,415000000.0,652000000,2010.0,tt0435761,DK,unknown
1,Toy Story 3,BV,415000000.0,652000000,2010.0,tt0435761,UY,unknown
2,Toy Story 3,BV,415000000.0,652000000,2010.0,tt0435761,JP,en
3,Toy Story 3,BV,415000000.0,652000000,2010.0,tt0435761,ES,unknown
4,Toy Story 3,BV,415000000.0,652000000,2010.0,tt0435761,unknown,unknown
...,...,...,...,...,...,...,...,...
153,Sex and the City 2,WB (NL),95300000.0,193000000,2010.0,tt1261945,DE,unknown
154,Sex and the City 2,WB (NL),95300000.0,193000000,2010.0,tt1261945,TR,tr
155,Sex and the City 2,WB (NL),95300000.0,193000000,2010.0,tt1261945,unknown,unknown
156,Sex and the City 2,WB (NL),95300000.0,193000000,2010.0,tt1261945,GR,unknown


In [319]:
df['year'] = (df['year']).astype(int)

In [320]:
df = df[df.region == 'US']

In [321]:
df.shape

(22, 8)

In [322]:
df

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year,title_id,region,language
6,Toy Story 3,BV,415000000.0,652000000,2010,tt0435761,US,unknown
14,Inception,WB,292600000.0,535700000,2010,tt1375666,US,unknown
24,Inception,WB,292600000.0,535700000,2010,tt1375666,US,en
26,Shrek Forever After,P/DW,238700000.0,513900000,2010,tt0892791,US,unknown
29,The Twilight Saga: Eclipse,Sum.,300500000.0,398000000,2010,tt1325004,US,unknown
39,Iron Man 2,Par.,312400000.0,311500000,2010,tt1228705,US,unknown
47,Tangled,BV,200800000.0,391000000,2010,tt0398286,US,unknown
49,Despicable Me,Uni.,251500000.0,291600000,2010,tt1323594,US,unknown
53,How to Train Your Dragon,P/DW,217600000.0,277300000,2010,tt0892769,US,unknown
55,The Chronicles of Narnia: The Voyage of the Da...,Fox,104400000.0,311300000,2010,tt0980970,US,unknown


In [323]:
df.drop_duplicates(subset="title", inplace = True)
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,title,studio,domestic_gross,foreign_gross,year,title_id,region,language
6,Toy Story 3,BV,415000000.0,652000000,2010,tt0435761,US,unknown
14,Inception,WB,292600000.0,535700000,2010,tt1375666,US,unknown
26,Shrek Forever After,P/DW,238700000.0,513900000,2010,tt0892791,US,unknown
29,The Twilight Saga: Eclipse,Sum.,300500000.0,398000000,2010,tt1325004,US,unknown
39,Iron Man 2,Par.,312400000.0,311500000,2010,tt1228705,US,unknown
47,Tangled,BV,200800000.0,391000000,2010,tt0398286,US,unknown
49,Despicable Me,Uni.,251500000.0,291600000,2010,tt1323594,US,unknown
53,How to Train Your Dragon,P/DW,217600000.0,277300000,2010,tt0892769,US,unknown
55,The Chronicles of Narnia: The Voyage of the Da...,Fox,104400000.0,311300000,2010,tt0980970,US,unknown
57,The King's Speech,Wein.,135500000.0,275400000,2010,tt1504320,US,unknown


In [324]:
df = df[:20]
df.drop(['title_id','region','language'], axis = 1, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [325]:
df.drop(['year'], axis = 1, inplace = True)

In [326]:
df

Unnamed: 0,title,studio,domestic_gross,foreign_gross
6,Toy Story 3,BV,415000000.0,652000000
14,Inception,WB,292600000.0,535700000
26,Shrek Forever After,P/DW,238700000.0,513900000
29,The Twilight Saga: Eclipse,Sum.,300500000.0,398000000
39,Iron Man 2,Par.,312400000.0,311500000
47,Tangled,BV,200800000.0,391000000
49,Despicable Me,Uni.,251500000.0,291600000
53,How to Train Your Dragon,P/DW,217600000.0,277300000
55,The Chronicles of Narnia: The Voyage of the Da...,Fox,104400000.0,311300000
57,The King's Speech,Wein.,135500000.0,275400000


In [339]:
#df = df.drop('index', axis = 1)
df
#df = df.drop('level_0', axis = 1)
df

Unnamed: 0,title,studio,domestic_gross,foreign_gross
0,Toy Story 3,BV,415000000.0,652000000
1,Inception,WB,292600000.0,535700000
2,Shrek Forever After,P/DW,238700000.0,513900000
3,The Twilight Saga: Eclipse,Sum.,300500000.0,398000000
4,Iron Man 2,Par.,312400000.0,311500000
5,Tangled,BV,200800000.0,391000000
6,Despicable Me,Uni.,251500000.0,291600000
7,How to Train Your Dragon,P/DW,217600000.0,277300000
8,The Chronicles of Narnia: The Voyage of the Da...,Fox,104400000.0,311300000
9,The King's Speech,Wein.,135500000.0,275400000
