## Data Cleaning  ##

Going to reduce the number of files being used and will clean and merge them w/in this document.  Goal is to have data organized so that we can work with the following:
    - genre (TMDB)
    - studio (gross profits)
    - cost (TN, not sure what that stands for)
    - profit (BOM)
    

In [67]:
!ls zippedData

InitialDataExploration.ipynb
Untitled.ipynb
bom.movie_gross.csv.gz
imdb.name.basics.csv.gz
imdb.title.akas.csv.gz
imdb.title.basics.csv.gz
imdb.title.crew.csv.gz
imdb.title.principals.csv.gz
imdb.title.ratings.csv.gz
rt.movie_info.tsv.gz
rt.reviews.tsv.gz
tmdb.movies.csv.gz
tn.movie_budgets.csv.gz


In [68]:
import pandas as pd

movie_gross_df = pd.read_csv('zippedData/bom.movie_gross.csv.gz')
# imdb_name_basics_df = pd.read_csv('zippedData/imdb.name.basics.csv.gz')
# imdb_title_akas_df = pd.read_csv('zippedData/imdb.title.akas.csv.gz')
imdb_title_basics_df = pd.read_csv('zippedData/imdb.title.basics.csv.gz')
# imdb_title_crew_df = pd.read_csv('zippedData/imdb.title.crew.csv.gz')
# imdb_title_principals_df = pd.read_csv('zippedData/imdb.title.principals.csv.gz')
imdb_title_ratings_df = pd.read_csv('zippedData/imdb.title.ratings.csv.gz')
# rt_movie_info_df = pd.read_csv('zippedData/rt.movie_info.tsv.gz', delimiter='\t')
# rt_reviews_df = pd.read_csv('zippedData/rt.reviews.tsv.gz', delimiter = '\t', encoding='latin1')
tmdb_movies_df = pd.read_csv('zippedData/tmdb.movies.csv.gz')
tn_movie_budgets_df = pd.read_csv('zippedData/tn.movie_budgets.csv.gz')

## Supress scientific notation ##

In [69]:
##will display floats to the second decimal place
## this code resets this change: pd.reset_option('^display.', silent=True)

pd.options.display.float_format = '{:.2f}'.format

In [70]:
movie_gross_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


### Cleaning To Dos for Movie Gross ###
- Get rid of rows with null values in the studio field and in the domestic gross field since there aren't many of them 
- ignore the foreign gross nulls, info may be listed on other sheets iirc. Alternately possibly could be calculated w/ info from other sheets
- do a unique value check on year, make sure nothing weird
- do a unique value check on studio, can we replace w/proper names?
- sort by domestic gross and check the tail
- clean out "(YEAR)" from movie titles (via regular expression?)

In [71]:
movie_gross_df.dropna(subset=['studio','domestic_gross'], inplace = True)

In [72]:
movie_gross_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3356 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3356 non-null   object 
 1   studio          3356 non-null   object 
 2   domestic_gross  3356 non-null   float64
 3   foreign_gross   2007 non-null   object 
 4   year            3356 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 157.3+ KB


In [73]:
movie_gross_df['year'].unique()

array([2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018], dtype=int64)

**NOTE THAT THIS ONLY RUNS FROM 2010-2018**

In [74]:
movie_gross_df['studio'].unique()

array(['BV', 'WB', 'P/DW', 'Sum.', 'Par.', 'Uni.', 'Fox', 'Wein.', 'Sony',
       'FoxS', 'SGem', 'WB (NL)', 'LGF', 'MBox', 'CL', 'W/Dim.', 'CBS',
       'Focus', 'MGM', 'Over.', 'Mira.', 'IFC', 'CJ', 'NM', 'SPC', 'ParV',
       'Gold.', 'JS', 'RAtt.', 'Magn.', 'Free', '3D', 'UTV', 'Rela.',
       'Zeit.', 'Anch.', 'PDA', 'Lorb.', 'App.', 'Drft.', 'Osci.', 'IW',
       'Rog.', 'Eros', 'Relbig.', 'Viv.', 'Hann.', 'Strand', 'NGE',
       'Scre.', 'Kino', 'Abr.', 'CZ', 'ATO', 'First', 'GK', 'FInd.',
       'NFC', 'TFC', 'Pala.', 'Imag.', 'NAV', 'Arth.', 'CLS', 'Mont.',
       'Olive', 'CGld', 'FOAK', 'IVP', 'Yash', 'ICir', 'FM', 'Vita.',
       'WOW', 'Truly', 'Indic.', 'FD', 'Vari.', 'TriS', 'ORF', 'IM',
       'Elev.', 'Cohen', 'NeoC', 'Jan.', 'MNE', 'Trib.', 'Rocket',
       'OMNI/FSR', 'KKM', 'Argo.', 'SMod', 'Libre', 'FRun', 'WHE', 'P4',
       'KC', 'SD', 'AM', 'MPFT', 'Icar.', 'AGF', 'A23', 'Da.', 'NYer',
       'Rialto', 'DF', 'KL', 'ALP', 'LG/S', 'WGUSA', 'MPI', 'RTWC', 'FIP',
  

We are not going to try to replace these with proper names, too many - what may end up happening is that we look at the top few studios and just correct those...

In [75]:
v_count = movie_gross_df['studio'].value_counts()

In [76]:
v_count[:11]

IFC      166
Uni.     147
WB       140
Fox      136
Magn.    136
SPC      123
Sony     109
BV       106
LGF      102
Par.     101
Eros      89
Name: studio, dtype: int64

In [77]:
movie_gross_df.loc[movie_gross_df['studio'] == "LGF"] #plug in different studio values to see movie titles 

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
26,The Expendables,LGF,103100000.00,171400000,2010
51,Saw 3D,LGF,45700000.00,90400000,2010
64,Killers,LGF,47100000.00,51100000,2010
65,Kick-Ass,LGF,48100000.00,48100000,2010
87,The Last Exorcism,LGF,41000000.00,26700000,2010
...,...,...,...,...,...
3207,Hell Fest,LGF,11100000.00,7100000,2018
3229,Kin,LGF,5700000.00,4300000,2018
3231,Traffik,LGF,9200000.00,336000,2018
3235,Condorito: La Pelicula,LGF,448000.00,8000000,2018


In [78]:
movie_gross_df.sort_values('domestic_gross').head(10)

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
1476,Storage 24,Magn.,100.0,,2013
2321,The Chambermaid,FM,300.0,,2015
2757,Satanic,Magn.,300.0,,2016
2756,News From Planet Mars,KL,300.0,,2016
1018,Apartment 143,Magn.,400.0,426000.0,2012
3078,2:22,Magn.,400.0,,2017
3077,Max & Leon,Distrib.,500.0,,2017
1126,Death of a Superhero,Trib.,600.0,,2012
2920,Amityville: The Awakening,W/Dim.,700.0,7700000.0,2017
1475,Into the White,Magn.,700.0,,2013


looks like domestic gross makes sense, both highs and lows 

In [79]:
#importing regex so I can more easily find the films with the year added to their titles

import re

In [80]:
reg_expression = '\([0-9]{4}\)' #looks for 4-digit numeric string between '(' and ')'

# found the below format online, returns all movies that meet the reg expression

titles_need_formatting = movie_gross_df[movie_gross_df['title'].str.count(reg_expression)>0]
titles_need_formatting

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
1,Alice in Wonderland (2010),BV,334200000.00,691300000,2010
10,Clash of the Titans (2010),WB,163200000.00,330000000,2010
55,A Nightmare on Elm Street (2010),WB (NL),63100000.00,52600000,2010
85,Legion (2010),SGem,40200000.00,27800000,2010
106,Death at a Funeral (2010),SGem,42700000.00,6300000,2010
...,...,...,...,...,...
3326,The Little Mermaid (2018),Conglomerate,147000.00,,2018
3340,Revenge (2018),Neon,102000.00,,2018
3341,Unstoppable (2018),WGUSA,101000.00,,2018
3365,The Apparition (2018),MBox,28300.00,,2018


In [81]:
## export the full data frame to a csv file so I could quickly visually confirm that 
## the results all had the addn'l year info at the end of the title string 
## (they did)

titles_need_formatting.to_csv("titles.csv")



In [82]:
# removes the last 7 chars in a title string if it meets the reg expression
# new_string = re.sub(r"xxx|yyy", "abc", a_strin

movie_gross_df['title'] = movie_gross_df['title'].apply(lambda x: re.sub(reg_expression,"",x)).str.rstrip()

# movie_gross_df[movie_gross_df['title'].apply(lambda x: re.sub(reg_expression,"",x))]

In [83]:
#this cell confirms that all the TITLE (YEAR) values in the title column have 
#had the year info removed. 

movie_gross_df[movie_gross_df['title'].str.count(reg_expression)>0]

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year


In [84]:
# finds all the re-releases

substring = 're-release'

movie_gross_df[movie_gross_df['title'].str.find(substring)>0]

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
1823,The Conformist (2014 re-release),KL,58700.0,,2014
1833,Alphaville (2013 re-release),Rialto,47700.0,,2014
2139,The Third Man (2015 re-release),Rialto,449000.0,,2015
2604,Only Yesterday (2016 re-release),GK,453000.0,,2016
3264,2001: A Space Odyssey (2018 re-release),WB,3200000.0,,2018
3289,Schindler's List (2018 re-release),Uni.,833000.0,,2018
3296,The Sound of Music (2018 re-release),Fathom,616000.0,,2018
3383,Edward II (2018 re-release),FM,4800.0,,2018


In [85]:
#drop the re-releases from the data set as they were not made in the listed year

movie_gross_df.drop(movie_gross_df[movie_gross_df['title'].str.find(substring)>0].index, inplace=True)

In [86]:
#check to make sure re-releases are gone from df

movie_gross_df[movie_gross_df['title'].str.find(substring)>0]

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year


In [87]:
#Clean up column titles 


movie_gross_df.rename(columns = {'title':'Title', 'studio':'Studio', 'domestic_gross':'Domestic Gross',
          'foreign_gross':'Foreign Gross', 'year':'Year'}, inplace = True)
movie_gross_df

Unnamed: 0,Title,Studio,Domestic Gross,Foreign Gross,Year
0,Toy Story 3,BV,415000000.00,652000000,2010
1,Alice in Wonderland,BV,334200000.00,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.00,664300000,2010
3,Inception,WB,292600000.00,535700000,2010
4,Shrek Forever After,P/DW,238700000.00,513900000,2010
...,...,...,...,...,...
3381,Beauty and the Dogs,Osci.,8900.00,,2018
3382,The Quake,Magn.,6200.00,,2018
3384,El Pacto,Sony,2500.00,,2018
3385,The Swan,Synergetic,2400.00,,2018


## Cleaning To Dos for tmdb_movies_df ##
- add genre columns
- check min/maxs
- note no nulls
- clean out rows w/empty lists in genre category (like 10% of the data set, 2.6k, more then I like but average isn't going to cut it and there's no way to manually enter that many)

In [88]:
tmdb_movies_df.tail()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
26512,26512,"[27, 18]",488143,en,Laboratory Conditions,0.6,2018-10-13,Laboratory Conditions,0.0,1
26513,26513,"[18, 53]",485975,en,_EXHIBIT_84xxx_,0.6,2018-05-01,_EXHIBIT_84xxx_,0.0,1
26514,26514,"[14, 28, 12]",381231,en,The Last One,0.6,2018-10-01,The Last One,0.0,1
26515,26515,"[10751, 12, 28]",366854,en,Trailer Made,0.6,2018-06-22,Trailer Made,0.0,1
26516,26516,"[53, 27]",309885,en,The Church,0.6,2018-10-05,The Church,0.0,1


In [89]:
tmdb_movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         26517 non-null  int64  
 1   genre_ids          26517 non-null  object 
 2   id                 26517 non-null  int64  
 3   original_language  26517 non-null  object 
 4   original_title     26517 non-null  object 
 5   popularity         26517 non-null  float64
 6   release_date       26517 non-null  object 
 7   title              26517 non-null  object 
 8   vote_average       26517 non-null  float64
 9   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 2.0+ MB


### The below gets rid of the empty lists in genre ###

In [90]:
tmdb_movies_df[tmdb_movies_df['genre_ids'] == '[]']

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
517,517,[],31059,ru,Наша Russia: Яйца судьбы,3.87,2010-01-21,Nasha Russia: Yaytsa sudby,4.30,25
559,559,[],151316,en,Shrek’s Yule Log,3.42,2010-12-07,Shrek’s Yule Log,4.70,9
589,589,[],75828,en,Erratum,3.15,2010-09-16,Erratum,6.60,7
689,689,[],150782,en,Bikini Frankenstein,2.62,2010-01-18,Bikini Frankenstein,6.00,4
731,731,[],200946,en,Weakness,2.45,2010-10-24,Weakness,4.50,2
...,...,...,...,...,...,...,...,...,...,...
26495,26495,[],556601,en,Recursion,0.60,2018-08-28,Recursion,2.00,1
26497,26497,[],514045,en,The Portuguese Kid,0.60,2018-02-14,The Portuguese Kid,2.00,1
26498,26498,[],497839,en,The 23rd Annual Critics' Choice Awards,0.60,2018-01-11,The 23rd Annual Critics' Choice Awards,2.00,1
26500,26500,[],561932,en,Two,0.60,2018-02-04,Two,1.00,1


In [91]:

tmdb_movies_df.drop(tmdb_movies_df[tmdb_movies_df['genre_ids'] == '[]'].index, inplace = True)

In [92]:
tmdb_movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24038 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         24038 non-null  int64  
 1   genre_ids          24038 non-null  object 
 2   id                 24038 non-null  int64  
 3   original_language  24038 non-null  object 
 4   original_title     24038 non-null  object 
 5   popularity         24038 non-null  float64
 6   release_date       24038 non-null  object 
 7   title              24038 non-null  object 
 8   vote_average       24038 non-null  float64
 9   vote_count         24038 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 2.0+ MB


In [93]:
tmdb_movies_df[tmdb_movies_df['genre_ids'] == '[]']

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count


In [94]:
tmdb_movies_df.describe()

Unnamed: 0.1,Unnamed: 0,id,popularity,vote_average,vote_count
count,24038.0,24038.0,24038.0,24038.0,24038.0
mean,13377.27,292504.55,3.38,5.98,214.05
std,7701.86,155419.52,4.5,1.78,1007.21
min,0.0,27.0,0.6,0.0,1.0
25%,6628.25,148607.25,0.64,5.0,2.0
50%,13466.5,307170.0,1.49,6.0,6.0
75%,20104.75,419629.25,4.42,7.0,34.0
max,26516.0,608079.0,80.77,10.0,22186.0


In [100]:
## going to try and create a column for each genre based on the list and then populate it with boolean for the particular film
## start going to make a dictionary of the codes, then going to loop through, each loop creates a column and populates it

genre_dict = {28:'Action', 12: 'Adventure', 16: 'Animation', 35: 'Comedy', 80:'Crime', 99:'Documentary',18:'Drama', 10751:'Family',
              14:'Fantasy', 36: 'History', 27:'Horror', 10402:'Music', 9648:'Mystery', 10749:'Romance', 878:'Science Fiction',
              53:'Thriller', 10752:'War',37:'Western'}
#didn't include TV Movie category since we don't care about those (may want to filter them out...)

In [96]:
tmdb_movies_df[tmdb_movies_df['genre_ids'].str.contains('10770')]

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
64,64,"[10770, 16, 14, 10751]",47626,en,Legend of the BoneKnapper Dragon,13.01,2010-10-15,Legend of the BoneKnapper Dragon,6.40,116
89,89,"[35, 10402, 10770]",44244,en,Camp Rock 2: The Final Jam,11.54,2010-09-03,Camp Rock 2: The Final Jam,6.10,878
101,101,"[16, 10751, 10770]",50393,en,Kung Fu Panda Holiday,11.08,2010-11-26,Kung Fu Panda Holiday,6.60,110
158,158,"[35, 10749, 10770, 10402]",35558,en,Starstruck,9.41,2010-02-14,Starstruck,6.70,627
235,235,"[10770, 18, 10751, 14]",50479,en,Avalon High,8.10,2010-11-12,Avalon High,6.10,313
...,...,...,...,...,...,...,...,...,...,...
26381,26381,"[10770, 99]",560930,en,Gypsy's Revenge,0.60,2018-11-06,Gypsy's Revenge,6.70,3
26387,26387,"[99, 10770]",525846,en,Casey Anthony's Parents Speak,0.60,2018-05-28,Casey Anthony's Parents Speak,6.50,2
26398,26398,"[80, 10770]",571692,en,Nightmare Best Friend,0.60,2018-12-29,Nightmare Best Friend,6.00,1
26402,26402,"[18, 10749, 10770]",562466,en,Christmas on the Coast,0.60,2018-11-25,Christmas on the Coast,6.00,1


## Clears out TV Movies from df, creates a new column for each genre fills it with true/false based on genre id column ##

In [97]:
# clears out the tv movies 
tmdb_movies_df.drop(tmdb_movies_df.loc[tmdb_movies_df['genre_ids'].str.contains('10770')].index, inplace = True)

In [98]:
tmdb_movies_df['genre_ids'].str.contains('28')

0        False
1        False
2         True
3        False
4         True
         ...  
26512    False
26513    False
26514     True
26515     True
26516    False
Name: genre_ids, Length: 22954, dtype: bool

In [101]:
for key, value in genre_dict.items():
    tmdb_movies_df[value] = tmdb_movies_df['genre_ids'].str.contains(str(key))

In [102]:
tmdb_movies_df.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count,...,Fantasy,History,Horror,Music,Mystery,Romance,Science Fiction,Thriller,War,Western
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.53,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788,...,True,False,False,False,False,False,False,False,False,False
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.73,2010-03-26,How to Train Your Dragon,7.7,7610,...,True,False,False,False,False,False,False,False,False,False
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.52,2010-05-07,Iron Man 2,6.8,12368,...,False,False,False,False,False,False,True,False,False,False
3,3,"[16, 35, 10751]",862,en,Toy Story,28.0,1995-11-22,Toy Story,7.9,10174,...,False,False,False,False,False,False,False,False,False,False
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186,...,False,False,False,False,False,False,True,False,False,False


In [104]:
#cleaning up column names
column_names = []
for x in tmdb_movies_df:
    column_names.append(x)
    
column_names

['Unnamed: 0',
 'genre_ids',
 'id',
 'original_language',
 'original_title',
 'popularity',
 'release_date',
 'title',
 'vote_average',
 'vote_count',
 'Action',
 'Adventure',
 'Animation',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'History',
 'Horror',
 'Music',
 'Mystery',
 'Romance',
 'Science Fiction',
 'Thriller',
 'War',
 'Western']

In [107]:
tmdb_movies_df.rename(columns={'genre_ids':'Genre IDs','id':'ID','original_language':'Original Language','original_title':'Original Title',
 'popularity':'Popularity','release_date':'Release Date', 'title':'Title', 'vote_average':'Average Vote',
 'vote_count':'Vote Count'}, inplace=True)
tmdb_movies_df

Unnamed: 0.1,Unnamed: 0,Genre IDs,ID,Original Language,Original Title,Popularity,Release Date,Title,Average Vote,Vote Count,...,Fantasy,History,Horror,Music,Mystery,Romance,Science Fiction,Thriller,War,Western
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.53,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.70,10788,...,True,False,False,False,False,False,False,False,False,False
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.73,2010-03-26,How to Train Your Dragon,7.70,7610,...,True,False,False,False,False,False,False,False,False,False
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.52,2010-05-07,Iron Man 2,6.80,12368,...,False,False,False,False,False,False,True,False,False,False
3,3,"[16, 35, 10751]",862,en,Toy Story,28.00,1995-11-22,Toy Story,7.90,10174,...,False,False,False,False,False,False,False,False,False,False
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.30,22186,...,False,False,False,False,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26512,26512,"[27, 18]",488143,en,Laboratory Conditions,0.60,2018-10-13,Laboratory Conditions,0.00,1,...,False,False,True,False,False,False,False,False,False,False
26513,26513,"[18, 53]",485975,en,_EXHIBIT_84xxx_,0.60,2018-05-01,_EXHIBIT_84xxx_,0.00,1,...,False,False,False,False,False,False,False,True,False,False
26514,26514,"[14, 28, 12]",381231,en,The Last One,0.60,2018-10-01,The Last One,0.00,1,...,True,False,False,False,False,False,False,False,False,False
26515,26515,"[10751, 12, 28]",366854,en,Trailer Made,0.60,2018-06-22,Trailer Made,0.00,1,...,False,False,False,False,False,False,False,False,False,False


### Cleaning Up imdb_title_ratings_df ###

In [15]:
imdb_title_ratings_df.head()

Unnamed: 0,tconst,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


In [20]:
imdb_title_ratings_df.describe()

Unnamed: 0,averagerating,numvotes
count,73856.0,73856.0
mean,6.33,3523.66
std,1.47,30294.02
min,1.0,5.0
25%,5.5,14.0
50%,6.5,49.0
75%,7.4,282.0
max,10.0,1841066.0


These numbers seem okay - nothing improbable

In [21]:
imdb_title_ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   tconst         73856 non-null  object 
 1   averagerating  73856 non-null  float64
 2   numvotes       73856 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 1.7+ MB


In [63]:
#cleans up column titles

imdb_title_ratings_df.rename(columns = {'averagerating':'Average Rating', 'numvotes':'Number of Votes'}, inplace = True)
imdb_title_ratings_df

Unnamed: 0,tconst,Average Rating,Number of Votes
0,tt10356526,8.30,31
1,tt10384606,8.90,559
2,tt1042974,6.40,20
3,tt1043726,4.20,50352
4,tt1060240,6.50,21
...,...,...,...
73851,tt9805820,8.10,25
73852,tt9844256,7.50,24
73853,tt9851050,4.70,14
73854,tt9886934,7.00,5
