In [124]:
import pandas as pd
import numpy as np

# General Questions to self.
- Should we writing/saving the clean versions to new csvs and etc? Confer with Kevin post cleaning?
- Could we potentially create a column with the genres, so as to be able to get gross/genre?

# 1. Loading and cleaning Bom gross

In [48]:
Bom = pd.read_csv("Data/bom.movie_gross.csv")

In [49]:
Bom.head()
Bom.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


# Comments on missing values for BOM Movie Gross

- 5 missing values in studio (will most likely remove, since they're only 5 out of 3387)
    - one of the movies in the missing studio did do $100m+ in gross, so could be a big gross to pull out from the foreign category
- 28 missing values in domestic gross (turn to 0)
- 1350 missing values in foreign gross (turn to 0)

In [50]:
# Foreign gross is a str, will have to convert that to a float. First, I need to replace all
# null's with 0. This will be to denote that they didn't have any foreign gross.
# Could do a geographical based analysis??
Bom.loc[Bom['foreign_gross'].isna()]
Bom['foreign_gross'] = Bom['foreign_gross'].str.replace(',','')
Bom['foreign_gross'] = Bom['foreign_gross'].astype(float)
Bom['foreign_gross'] = Bom['foreign_gross'].fillna(0)
#One of them has a comma, will have to strip that first.
Bom.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   3387 non-null   float64
 4   year            3387 non-null   int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 132.4+ KB


In [52]:
#Domestic gross has some missing values as well, will first inspect. Upon confirming that
#they're just foreign films, then I'll replace thos nulls with 0.
Bom.loc[Bom['domestic_gross'].isna()]
Bom['domestic_gross'] = Bom['domestic_gross'].fillna(0)
Bom.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3387 non-null   float64
 3   foreign_gross   3387 non-null   float64
 4   year            3387 non-null   int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 132.4+ KB


In [54]:
# Will create a new column, total_gross, to be able to compare the aggregate gross.
Bom['total_gross'] = Bom['domestic_gross'] + Bom['foreign_gross']

In [58]:
# Going to inspect the movies that have null studio values, since it's only 5 going to see
# if I can replace them from just looking them up. Decided on just dropping the 5 movies,
# due to it being sucha small percentage of the overall data set.
Bom.loc[Bom['studio'].isna()]
Clean_Bom = Bom.dropna(subset = ['studio'])
Clean_Bom.isna().sum()

title             0
studio            0
domestic_gross    0
foreign_gross     0
year              0
total_gross       0
dtype: int64

# 2. Loading in and cleaning RT movie info


In [158]:
Rt = pd.read_csv('Data/rt.movie_info.tsv',sep = '\t')

In [159]:
Rt.head()
#Rt.isna().sum()
Rt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1560 entries, 0 to 1559
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            1560 non-null   int64 
 1   synopsis      1498 non-null   object
 2   rating        1557 non-null   object
 3   genre         1552 non-null   object
 4   director      1361 non-null   object
 5   writer        1111 non-null   object
 6   theater_date  1201 non-null   object
 7   dvd_date      1201 non-null   object
 8   currency      340 non-null    object
 9   box_office    340 non-null    object
 10  runtime       1530 non-null   object
 11  studio        494 non-null    object
dtypes: int64(1), object(11)
memory usage: 146.4+ KB


# Missing data comments
- synopsis, only missing 62 entries
- rating missing 3 entries. After review of the rows, they will not add anything to the analysis, so we'll remove them.
- genre missing 8 values, can google or can remove since it's only eight entries. Should not be significant to delete them. After reviewing the null entries, determined they would not add anything to the analysis.
- director, missing 199 entries, as we go through the data will see how important that is
- writer, missing 449 entries
- theater_date, missing 359 entries, does not seem very important. But maybe seasonality or holidays do play a role
- dvd_date, missing 359 entries as well. Would not qualify it as important as premiere date.
- currency, missing 1210 entries. Can assume all in dollars, but need to look at values.
- box-office, missing 1210 entries as well. This will make it very difficult to run analytics by genre type. Need to dig into the nulls
- runtime, missing 20 entries
-  studio, missing 1066 entries
- I don't have any move titles....

# Now to clean the new dataframe
- Should condense genre
- Should turn the dates to date value
- Should turn the box office number into an int or float
- should standardize runtime into mins

In [167]:
# Cleaning up the box office numbers from str to int
Rt['box_office'] = Rt['box_office'].str.replace(',','')

In [164]:
# Cleaning up the theater and dvd dates to datetime format, could potentially add a column
# extracting the month val. Can see if there are any specific months that are better for movies.
Rt['theater_date'] = pd.to_datetime(Rt['theater_date'])
Rt['dvd_date'] = pd.to_datetime(Rt['dvd_date'])

In [172]:
# To condense genres, could just have first genre title as the genre type. Think on it.
#Rt['genre'].value_counts().head(20)

Drama                                               151
Comedy                                              110
Comedy|Drama                                         80
Drama|Mystery and Suspense                           67
Art House and International|Drama                    62
Action and Adventure|Drama                           42
Action and Adventure|Drama|Mystery and Suspense      40
Drama|Romance                                        35
Comedy|Romance                                       32
Horror                                               31
Art House and International|Comedy|Drama             31
Action and Adventure|Science Fiction and Fantasy     24
Comedy|Drama|Romance                                 23
Classics|Drama                                       21
Action and Adventure|Mystery and Suspense            20
Action and Adventure                                 19
Classics|Drama|Mystery and Suspense                  18
Horror|Mystery and Suspense                     

In [175]:
# Creating a separate dataframe to clean it up. Guessing that we'll have separate clean
# versions. Clean only genre, clean only box office.

# Clean_Rt_Box only has all the entries with non-null values for Box Office. Can analyze 
# Genre/Box Office.
Clean_Rt_Box = Rt.dropna(subset=['box_office'])
Clean_Rt_Box['box_office'] = Clean_Rt_Box['box_office'].astype(int)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Clean_Rt_Box['box_office'] = Clean_Rt_Box['box_office'].astype(int)


In [185]:
Clean_Rt_Box['runtime'] = Clean_Rt_Box['runtime'].str.replace("minutes",'').str.strip()
# for 499 and 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Clean_Rt_Box['runtime'] = Clean_Rt_Box['runtime'].str.replace("minutes",'').str.strip()


In [None]:
# Will create another dataframe below and remove all null directors. Can run analysis on director/
# box office.
#Clean_Rt_Director = Clean_Rt_Box.dropna(subset = ['director'])
# load budgets and see what movie titles are there, then run iteratively through the list