# NEW MICROSOFT'S MOVIE STUDIO (Exploratory Data Analysis)

# 1. Defining the Question
 
   ### a) Specifying the Data Analytic Question
   
The movies industry is characterized by diffrent genres of movies. They include animations, action movies, comedies etc. When producing a movie it is important to decide on what genre to get into. This can be done by determining the best perfoming movies in the market. There are different aspects that can be used to determine how different genres are perfoming example What are the top-perfoming movie genres in terms of box office revenue, and how can this information be used to guide Microsoft's new movie studio in selecting the most promising genre for their film production?
  
  ### b) Defining the Metrics for Success
  
The objective is to identify the movie genres that have perfomed exceptionally well in terms of box office revenue. By analyzing and comparing the revenue figures across different genres, we can determine the genres that have a higher likelihood of achieving commercial success. Additionally other metrics can be  used to evaluate success such as the involvement of popular actors in highly rated movies as well as the directors and writers with the best rated films.

### c) Understanding the context 
We will consider the following key aspects:
 1. Market dynamics: Get an understanding of the current state of the movie industry, including market trends, audience preferences and competition. This will help identify opportunities and challenges.
 2. Business goals: Understand microsoft's goals and objectives for entering the movie industry.
 3. External factors: familiarize ourselves with any relevant regulations or policies that may impact the movie production and distribution process.
 
 ### d) Recording the Experimental Design
 1. Data collection
 2. Data preprocessing
 3. Exploratory data analysis(EDA)
 4. Genre perfomance analyzing
 5. Comparative analysis
 6. insights and recommendations 
 7. Sensitivity analysis 
 8. Reporting and Visualization
 
 ### e) Data Relevance
 1. Data selection that is relevant to our analysis
 2. Data quality. Check completeness, accuracy, consisistency and reliability of the data.
 3. Data scope. we can cosider the scope of the data in terms of time.


## 2. IMPORTING RELEVANT LIBRARIES

In [66]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3
import warnings
warnings.filterwarnings('ignore')

## 3. LOADING THE DATA

In [3]:
# loading data of bom gross incomes
bom_gross = pd.read_csv('data/bom.movie_gross.csv.gz')
bom_gross.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [None]:
bom_gross.info()

In [4]:
# loading data of cast names
cast_names = pd.read_csv('data/imdb.name.basics.csv.gz')
cast_names

Unnamed: 0,nconst,primary_name,birth_year,death_year,primary_profession,known_for_titles
0,nm0061671,Mary Ellen Bauder,,,"miscellaneous,production_manager,producer","tt0837562,tt2398241,tt0844471,tt0118553"
1,nm0061865,Joseph Bauer,,,"composer,music_department,sound_department","tt0896534,tt6791238,tt0287072,tt1682940"
2,nm0062070,Bruce Baum,,,"miscellaneous,actor,writer","tt1470654,tt0363631,tt0104030,tt0102898"
3,nm0062195,Axel Baumann,,,"camera_department,cinematographer,art_department","tt0114371,tt2004304,tt1618448,tt1224387"
4,nm0062798,Pete Baxter,,,"production_designer,art_department,set_decorator","tt0452644,tt0452692,tt3458030,tt2178256"
...,...,...,...,...,...,...
606643,nm9990381,Susan Grobes,,,actress,
606644,nm9990690,Joo Yeon So,,,actress,"tt9090932,tt8737130"
606645,nm9991320,Madeline Smith,,,actress,"tt8734436,tt9615610"
606646,nm9991786,Michelle Modigliani,,,producer,


In [5]:
# loading data for movie titles
movie_title = pd.read_csv('data/imdb.title.akas.csv.gz')
movie_title.head()

Unnamed: 0,title_id,ordering,title,region,language,types,attributes,is_original_title
0,tt0369610,10,Джурасик свят,BG,bg,,,0.0
1,tt0369610,11,Jurashikku warudo,JP,,imdbDisplay,,0.0
2,tt0369610,12,Jurassic World: O Mundo dos Dinossauros,BR,,imdbDisplay,,0.0
3,tt0369610,13,O Mundo dos Dinossauros,BR,,,short title,0.0
4,tt0369610,14,Jurassic World,FR,,imdbDisplay,,0.0


In [6]:
# loading data for directors and writers
crew_information = pd.read_csv('data/imdb.title.crew.csv.gz')
crew_information.head()

Unnamed: 0,tconst,directors,writers
0,tt0285252,nm0899854,nm0899854
1,tt0438973,,"nm0175726,nm1802864"
2,tt0462036,nm1940585,nm1940585
3,tt0835418,nm0151540,"nm0310087,nm0841532"
4,tt0878654,"nm0089502,nm2291498,nm2292011",nm0284943


In [7]:
#loading data of individuals involved and thier roles
principals_information = pd.read_csv('data/imdb.title.principals.csv.gz')
principals_information.head()

Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0111414,1,nm0246005,actor,,"[""The Man""]"
1,tt0111414,2,nm0398271,director,,
2,tt0111414,3,nm3739909,producer,producer,
3,tt0323808,10,nm0059247,editor,,
4,tt0323808,1,nm3579312,actress,,"[""Beth Boothby""]"


In [50]:
# loading data for movie ratings
movie_ratings = pd.read_csv('data/imdb.title.ratings.csv.gz')
movie_ratings.head()

Unnamed: 0,tconst,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


In [51]:
# loading data for movie information
movie_info = pd.read_csv('data/rt.movie_info.tsv.gz', delimiter='\t')
movie_info.head()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


In [52]:
# loading data for ratings and reviews
ratings_reviews = pd.read_csv('data/rt.reviews.tsv.gz', delimiter='\t', encoding='latin-1')
ratings_reviews.head()

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


In [40]:
tmdb_movies = pd.read_csv('data/tmdb.movies.csv.gz', index_col=0)
tmdb_movies

Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,"[28, 878, 12]",27205,en,Inception,27.920,2010-07-16,Inception,8.3,22186
...,...,...,...,...,...,...,...,...,...
26512,"[27, 18]",488143,en,Laboratory Conditions,0.600,2018-10-13,Laboratory Conditions,0.0,1
26513,"[18, 53]",485975,en,_EXHIBIT_84xxx_,0.600,2018-05-01,_EXHIBIT_84xxx_,0.0,1
26514,"[14, 28, 12]",381231,en,The Last One,0.600,2018-10-01,The Last One,0.0,1
26515,"[10751, 12, 28]",366854,en,Trailer Made,0.600,2018-06-22,Trailer Made,0.0,1


In [53]:
# loading data for movie budget
movie_budget = pd.read_csv('data/tn.movie_budgets.csv.gz')
movie_budget.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


## 4. DATA WRANGLING

  ### 4.1 Dropping columns
  
Out of the several datasets that were collected, only some features and rows are relevant to this process. Therefore, in this step, the features that are not required from each dataset will be dropped. The remaining datasets will then be joined.

In [25]:
# dropping the studio column from the data set
new_bom_gross = bom_gross.drop("studio", axis=1)
new_bom_gross.head()

Unnamed: 0,title,domestic_gross,foreign_gross,year
0,Toy Story 3,415000000.0,652000000,2010
1,Alice in Wonderland (2010),334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,296000000.0,664300000,2010
3,Inception,292600000.0,535700000,2010
4,Shrek Forever After,238700000.0,513900000,2010


In [26]:
new_bom_gross.shape


(3387, 4)

The new data set has 3387 rows and 4 columns

In [27]:
# selecting the relevant columns of the cast
new_cast_names = cast_names[['nconst', 'primary_name', 'known_for_titles']]
new_cast_names.head()

Unnamed: 0,nconst,primary_name,known_for_titles
0,nm0061671,Mary Ellen Bauder,"tt0837562,tt2398241,tt0844471,tt0118553"
1,nm0061865,Joseph Bauer,"tt0896534,tt6791238,tt0287072,tt1682940"
2,nm0062070,Bruce Baum,"tt1470654,tt0363631,tt0104030,tt0102898"
3,nm0062195,Axel Baumann,"tt0114371,tt2004304,tt1618448,tt1224387"
4,nm0062798,Pete Baxter,"tt0452644,tt0452692,tt3458030,tt2178256"


In [29]:
new_cast_names.shape

(606648, 3)

The new data set has 606648 rows and 3 columns

In [20]:
# select the relevant columns
new_movie_title = movie_title[['title_id', 'ordering', 'title', 'is_original_title']]
new_movie_title.head()

Unnamed: 0,title_id,ordering,title,is_original_title
0,tt0369610,10,Джурасик свят,0.0
1,tt0369610,11,Jurashikku warudo,0.0
2,tt0369610,12,Jurassic World: O Mundo dos Dinossauros,0.0
3,tt0369610,13,O Mundo dos Dinossauros,0.0
4,tt0369610,14,Jurassic World,0.0


In [30]:
new_movie_title.shape

(331703, 4)

The new data set has 331703 rows and 4 columns

In [78]:
# dropping the irrelevant columns from the data set
new_principals_information = principals_information.drop(['characters','job'], axis=1)
new_principals_information.head()

Unnamed: 0,tconst,ordering,nconst,category
0,tt0111414,1,nm0246005,actor
1,tt0111414,2,nm0398271,director
2,tt0111414,3,nm3739909,producer
3,tt0323808,10,nm0059247,editor
4,tt0323808,1,nm3579312,actress


In [32]:
new_principals_information.shape

(1028186, 4)

The new data set has 1028186 rows and 4 columns

In [24]:
# selecting useful columns
new_movie_info = movie_info[['id', 'genre', 'rating', 'director', 'writer', 'currency', 'box_office']]
new_movie_info.head()

Unnamed: 0,id,genre,rating,director,writer,currency,box_office
0,1,Action and Adventure|Classics|Drama,R,William Friedkin,Ernest Tidyman,,
1,3,Drama|Science Fiction and Fantasy,R,David Cronenberg,David Cronenberg|Don DeLillo,$,600000.0
2,5,Drama|Musical and Performing Arts,R,Allison Anders,Allison Anders,,
3,6,Drama|Mystery and Suspense,R,Barry Levinson,Paul Attanasio|Michael Crichton,,
4,7,Drama|Romance,NR,Rodney Bennett,Giles Cooper,,


In [34]:
new_movie_info.shape

(1560, 7)

The new data set has 1560 rows and 7 columns

In [23]:
# selecting relevant columns for ratings and reviews
new_ratings_reviews = ratings_reviews[['id', 'rating', 'top_critic', 'date']]
new_ratings_reviews

Unnamed: 0,id,rating,top_critic,date
0,3,3/5,0,"November 10, 2018"
1,3,,0,"May 23, 2018"
2,3,,0,"January 4, 2018"
3,3,,0,"November 16, 2017"
4,3,,0,"October 12, 2017"
...,...,...,...,...
54427,2000,,1,"September 24, 2002"
54428,2000,1/5,0,"September 21, 2005"
54429,2000,2/5,0,"July 17, 2005"
54430,2000,2.5/5,0,"September 7, 2003"


In [37]:
new_ratings_reviews.shape

(54432, 4)

The new data set has 54432 rows and 4 columns

In [44]:
new_tmdb_movies = tmdb_movies.drop('original_language', axis=1)
new_tmdb_movies

Unnamed: 0,genre_ids,id,original_title,popularity,release_date,title,vote_average,vote_count
0,"[12, 14, 10751]",12444,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,"[14, 12, 16, 10751]",10191,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,"[12, 28, 878]",10138,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,"[16, 35, 10751]",862,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,"[28, 878, 12]",27205,Inception,27.920,2010-07-16,Inception,8.3,22186
...,...,...,...,...,...,...,...,...
26512,"[27, 18]",488143,Laboratory Conditions,0.600,2018-10-13,Laboratory Conditions,0.0,1
26513,"[18, 53]",485975,_EXHIBIT_84xxx_,0.600,2018-05-01,_EXHIBIT_84xxx_,0.0,1
26514,"[14, 28, 12]",381231,The Last One,0.600,2018-10-01,The Last One,0.0,1
26515,"[10751, 12, 28]",366854,Trailer Made,0.600,2018-06-22,Trailer Made,0.0,1


In [45]:
new_tmdb_movies.shape

(26517, 8)

The new data set has 54432 rows and 4 columns

Now that we have obtained all the required iformation from each dataset, we can format the data sets.

### 4.2 Formatting Datatypes.

The movies datasets are in different formats so we explore them to get an understanding of the datatypes and determine how to join them.

#### 4.2.1 Exploring incomes generated by movies

In [58]:
new_cast_names.info

<bound method DataFrame.info of            nconst         primary_name   
0       nm0061671    Mary Ellen Bauder  \
1       nm0061865         Joseph Bauer   
2       nm0062070           Bruce Baum   
3       nm0062195         Axel Baumann   
4       nm0062798          Pete Baxter   
...           ...                  ...   
606643  nm9990381         Susan Grobes   
606644  nm9990690          Joo Yeon So   
606645  nm9991320       Madeline Smith   
606646  nm9991786  Michelle Modigliani   
606647  nm9993380       Pegasus Envoyé   

                               known_for_titles  
0       tt0837562,tt2398241,tt0844471,tt0118553  
1       tt0896534,tt6791238,tt0287072,tt1682940  
2       tt1470654,tt0363631,tt0104030,tt0102898  
3       tt0114371,tt2004304,tt1618448,tt1224387  
4       tt0452644,tt0452692,tt3458030,tt2178256  
...                                         ...  
606643                                      NaN  
606644                      tt9090932,tt8737130  
606645       

In [60]:
# Generating descriptive statistics of numeric columns
new_cast_names.describe() 


Unnamed: 0,nconst,primary_name,known_for_titles
count,606648,606648,576444
unique,606648,577203,482207
top,nm0061671,Michael Brown,tt4773466
freq,1,16,45


 The count indicates 576, 444 non-null values, which implies there are missing values in the column.

In [67]:
# Replace missing values in known_for_titles column with a specific value
new_cast_names['known_for_titles'].fillna("Unknown", inplace=True)


In [65]:
new_cast_names.describe()

Unnamed: 0,nconst,primary_name,known_for_titles
count,606648,606648,606648
unique,606648,577203,482208
top,nm0061671,Michael Brown,Unknown
freq,1,16,30204


The count now matches with 606648 meaning we dont have any missing values in any of our columns.

In [69]:
new_movie_title.describe()

Unnamed: 0,ordering,is_original_title
count,331703.0,331678.0
mean,5.125872,0.134769
std,6.706664,0.341477
min,1.0,0.0
25%,1.0,0.0
50%,2.0,0.0
75%,6.0,0.0
max,61.0,1.0


We can see that the count of is_original_title is lower than the total columns count indicating presence of missing values.

In [71]:
# checking for the number of missing values in is_original_title to determine how this may affect our analysis
new_movie_title['is_original_title'].isnull().sum()


25

In [72]:
# replacing the missimg values with the mean since the number of missing values was not that significant
mean_value = new_movie_title['is_original_title'].mean()
new_movie_title['is_original_title'] = new_movie_title['is_original_title'].fillna(mean_value)


In [84]:
new_movie_title.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 331703 entries, 0 to 331702
Data columns (total 4 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   title_id           331703 non-null  object 
 1   ordering           331703 non-null  int64  
 2   title              331702 non-null  object 
 3   is_original_title  331703 non-null  float64
dtypes: float64(1), int64(1), object(2)
memory usage: 10.1+ MB


All our columns are now populated

In [83]:
# checking for missing values in the columns
new_principals_information.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1028186 entries, 0 to 1028185
Data columns (total 4 columns):
 #   Column    Non-Null Count    Dtype 
---  ------    --------------    ----- 
 0   tconst    1028186 non-null  object
 1   ordering  1028186 non-null  int64 
 2   nconst    1028186 non-null  object
 3   category  1028186 non-null  object
dtypes: int64(1), object(3)
memory usage: 31.4+ MB


There are no missing values in any of the columns.

In [85]:
# displaying the information of the data set
new_movie_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1560 entries, 0 to 1559
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          1560 non-null   int64 
 1   genre       1552 non-null   object
 2   rating      1557 non-null   object
 3   director    1361 non-null   object
 4   writer      1111 non-null   object
 5   currency    340 non-null    object
 6   box_office  340 non-null    object
dtypes: int64(1), object(6)
memory usage: 85.4+ KB


Our data is consistent

In [86]:
new_ratings_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54432 entries, 0 to 54431
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          54432 non-null  int64 
 1   rating      40915 non-null  object
 2   top_critic  54432 non-null  int64 
 3   date        54432 non-null  object
dtypes: int64(2), object(2)
memory usage: 1.7+ MB


#### 4.2.2 Exploring **TMDB** movies data.

In [88]:
# Exploring the TMDB movies dataset
new_tmdb_movies.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26517 entries, 0 to 26516
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   genre_ids       26517 non-null  object 
 1   id              26517 non-null  int64  
 2   original_title  26517 non-null  object 
 3   popularity      26517 non-null  float64
 4   release_date    26517 non-null  object 
 5   title           26517 non-null  object 
 6   vote_average    26517 non-null  float64
 7   vote_count      26517 non-null  int64  
dtypes: float64(2), int64(2), object(4)
memory usage: 1.8+ MB


Based on the information provided there are no missing values.

#### 4.2.3 Exploring icome generated by bom from movies. 

In [89]:
new_bom_gross.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   domestic_gross  3359 non-null   float64
 2   foreign_gross   2037 non-null   object 
 3   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(2)
memory usage: 106.0+ KB


In [90]:
new_bom_gross.head()

Unnamed: 0,title,domestic_gross,foreign_gross,year
0,Toy Story 3,415000000.0,652000000,2010
1,Alice in Wonderland (2010),334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,296000000.0,664300000,2010
3,Inception,292600000.0,535700000,2010
4,Shrek Forever After,238700000.0,513900000,2010


In [94]:
new_bom_gross['domestic_gross']

0       415000000.0
1       334200000.0
2       296000000.0
3       292600000.0
4       238700000.0
           ...     
3382         6200.0
3383         4800.0
3384         2500.0
3385         2400.0
3386         1700.0
Name: domestic_gross, Length: 3387, dtype: float64

There are missing values in the two rows domestic_gross and forign_gross of type floats. We can convert them into integers so that we can calculate its measures of central tendency and replace the missing values appropriately

There are columns containing non-finite values so I first had to replace them with the mode to allow the conversion, there were also commas so i had to handle that too.

In [101]:
# Fill missing values with the mode in 'domestic_gross' column
domestic_mode = new_bom_gross['domestic_gross'].mode().values[0]
new_bom_gross['domestic_gross'].fillna(domestic_mode, inplace=True)

# Fill missing values with the mode in 'foreign_gross' column
foreign_mode = new_bom_gross['foreign_gross'].mode().values[0]
new_bom_gross['foreign_gross'].fillna(foreign_mode, inplace=True)

# Remove commas from 'domestic_gross' column values, convert to string, and then to integer
new_bom_gross['domestic_gross'] = new_bom_gross['domestic_gross'].astype(str).str.replace(',', '').astype(int)

# Remove commas from 'foreign_gross' column values, convert to string, and then to float
new_bom_gross['foreign_gross'] = new_bom_gross['foreign_gross'].astype(str).str.replace(',', '').astype(float)


In [102]:
new_bom_gross.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   domestic_gross  3387 non-null   int32  
 2   foreign_gross   3387 non-null   float64
 3   year            3387 non-null   int64  
dtypes: float64(1), int32(1), int64(1), object(1)
memory usage: 92.7+ KB
