# Pandas V/s Numpy

In the field of data analytics, the most popular packages in python are pandas and numpy

Here's a link to the differences: https://www.educba.com/pandas-vs-numpy/

We will be using pandas for now, but it's in my habit to start off with both anyways. The commonly used aliases are pd and np. 

In [3]:
#Importing packages
import pandas as pd
import numpy as np

In [4]:
df = pd.read_excel('movie_training.xlsx') # Make sure this excel file is in the same location as to where your jupyter notebook is.

This should be a habit each time you load a dataset: Take a look at it!

    * This could be just going through first n rows using head(n). Default is 5
    * This could be identifying the number of NULLS in each column
    * This could be by plotting through the data. Before you do your analysis, always plot! It is called Exploratory Data Analysis or EDA and it helps you to observe patterns in the data which you might not have been aware of.

In [5]:
df.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,country,content_rating,budget,title_year,actor_2_facebook_likes,aspect_ratio,movie_facebook_likes,IMDB_user_reviews,IMDB_user_votes,IMDB_score
0,Color,Eric Leighton,145.0,82.0,0,388.0,D.B. Sweeney,1000.0,137748063.0,Adventure|Animation|Family|Thriller,...,USA,PG,127500000.0,2000,558.0,1.85,0,241.0,38438.0,6.5
1,Color,Ron Howard,175.0,110.0,2000,636.0,T.J. Thyne,1000.0,260031035.0,Comedy|Family|Fantasy,...,USA,PG,123000000.0,2000,722.0,1.85,0,482.0,141414.0,6.0
2,Color,John Woo,237.0,123.0,610,653.0,Dougray Scott,10000.0,215397307.0,Action|Adventure|Thriller,...,USA,PG-13,125000000.0,2000,794.0,2.35,0,1426.0,242188.0,6.1
3,Color,Wolfgang Petersen,231.0,130.0,249,461.0,Mary Elizabeth Mastrantonio,784.0,182618434.0,Action|Adventure|Drama|Thriller,...,USA,PG-13,140000000.0,2000,638.0,2.35,0,779.0,133076.0,6.4
4,Color,Roland Emmerich,192.0,142.0,776,1000.0,Adam Baldwin,13000.0,113330342.0,Action|Drama|History|War,...,USA,R,110000000.0,2000,2000.0,2.35,4000,1144.0,207613.0,7.1


In [43]:
df.head(25)

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,country,content_rating,budget,title_year,actor_2_facebook_likes,aspect_ratio,movie_facebook_likes,IMDB_user_reviews,IMDB_user_votes,IMDB_score
0,Color,Eric Leighton,145.0,82.0,0,388.0,D.B. Sweeney,1000.0,137748063.0,Adventure|Animation|Family|Thriller,...,USA,PG,127500000.0,2000,558.0,1.85,0,241.0,38438.0,6.5
1,Color,Ron Howard,175.0,110.0,2000,636.0,T.J. Thyne,1000.0,260031035.0,Comedy|Family|Fantasy,...,USA,PG,123000000.0,2000,722.0,1.85,0,482.0,141414.0,6.0
2,Color,John Woo,237.0,123.0,610,653.0,Dougray Scott,10000.0,215397307.0,Action|Adventure|Thriller,...,USA,PG-13,125000000.0,2000,794.0,2.35,0,1426.0,242188.0,6.1
3,Color,Wolfgang Petersen,231.0,130.0,249,461.0,Mary Elizabeth Mastrantonio,784.0,182618434.0,Action|Adventure|Drama|Thriller,...,USA,PG-13,140000000.0,2000,638.0,2.35,0,779.0,133076.0,6.4
4,Color,Roland Emmerich,192.0,142.0,776,1000.0,Adam Baldwin,13000.0,113330342.0,Action|Drama|History|War,...,USA,R,110000000.0,2000,2000.0,2.35,4000,1144.0,207613.0,7.1
5,Color,Dominic Sena,175.0,127.0,57,3000.0,Angelina Jolie Pitt,12000.0,101643008.0,Action|Crime|Thriller,...,USA,PG-13,90000000.0,2000,11000.0,2.35,0,498.0,218341.0,6.5
6,Color,Ridley Scott,265.0,171.0,0,695.0,Connie Nielsen,3000.0,187670866.0,Action|Drama|Romance,...,USA,R,103000000.0,2000,933.0,2.35,21000,2368.0,982637.0,8.5
7,Color,Mark Dindal,141.0,78.0,10,253.0,Wendie Malick,558.0,89296573.0,Adventure|Animation|Comedy|Family|Fantasy,...,USA,G,100000000.0,2000,452.0,1.66,0,297.0,128285.0,7.3
8,Color,Bibo Bergeron,82.0,89.0,10,442.0,Rosie Perez,2000.0,50802661.0,Adventure|Animation|Comedy|Family|Romance,...,USA,PG,95000000.0,2000,919.0,1.85,0,139.0,58300.0,6.9
9,Color,Robert Zemeckis,185.0,130.0,0,568.0,Amber Valletta,11000.0,155370362.0,Drama|Fantasy|Horror|Mystery|Thriller,...,USA,PG-13,100000000.0,2000,627.0,2.35,0,683.0,98403.0,6.6


Notice the ellpises in between genres and country. This means that some columns have been hidden from view. Listing out the columns shows you the true number of columns present and also is a way for you to copy column names from.

In this example it should be fine as the column names are pretty straightforward, but you'll thank me when you come across column names such as 'ar_13_fthsw_len' and 'lo_13fthsw_len'

In [8]:
#Listing the columns
df.columns

Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'cast_total_facebook_likes', 'actor_3_name',
       'facenumber_in_poster', 'plot_keywords', 'movie_imdb_link', 'language',
       'country', 'content_rating', 'budget', 'title_year',
       'actor_2_facebook_likes', 'aspect_ratio', 'movie_facebook_likes',
       'IMDB_user_reviews', 'IMDB_user_votes', 'IMDB_score'],
      dtype='object')

Having more data is always better, but sometimes you gotta go Marie Kondo on your data. 

This could be because :
    
    * Some columns just doesn't have any use. Let's say if you were working with a subset of data for Apple Music users in the US. Now the "Country" column will only have USA, which is of no use to you. So that has to go
    * Too many null values. There are ways to deal with nulls, but sometimes it's just too much! Also there is not much to gain from working with these
    * Model Complexity can also play a role. Since every movie name is unique, it makes no sense to take them into account. Sure, actor names could really help in determining a movie's rating, but it is too complex for the scope of this project. There are ways to get around this, but we'll just drop them now
    * Domain knowledge plays an important role. Understanding what variables doesn't make sense for each domain is vital. You'll see a lot of this in case competitions where they'll just throw everything at you. 
    

In [9]:
#Dropping Unnecessary Columnns

df1 = df.drop(['actor_3_facebook_likes','actor_3_name','facenumber_in_poster','cast_total_facebook_likes','plot_keywords','movie_imdb_link','aspect_ratio'], axis = 1)

In [10]:
df1.head(5)

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,...,language,country,content_rating,budget,title_year,actor_2_facebook_likes,movie_facebook_likes,IMDB_user_reviews,IMDB_user_votes,IMDB_score
0,Color,Eric Leighton,145.0,82.0,0,D.B. Sweeney,1000.0,137748063.0,Adventure|Animation|Family|Thriller,Alfre Woodard,...,English,USA,PG,127500000.0,2000,558.0,0,241.0,38438.0,6.5
1,Color,Ron Howard,175.0,110.0,2000,T.J. Thyne,1000.0,260031035.0,Comedy|Family|Fantasy,Clint Howard,...,English,USA,PG,123000000.0,2000,722.0,0,482.0,141414.0,6.0
2,Color,John Woo,237.0,123.0,610,Dougray Scott,10000.0,215397307.0,Action|Adventure|Thriller,Tom Cruise,...,English,USA,PG-13,125000000.0,2000,794.0,0,1426.0,242188.0,6.1
3,Color,Wolfgang Petersen,231.0,130.0,249,Mary Elizabeth Mastrantonio,784.0,182618434.0,Action|Adventure|Drama|Thriller,Karen Allen,...,English,USA,PG-13,140000000.0,2000,638.0,0,779.0,133076.0,6.4
4,Color,Roland Emmerich,192.0,142.0,776,Adam Baldwin,13000.0,113330342.0,Action|Drama|History|War,Heath Ledger,...,English,USA,R,110000000.0,2000,2000.0,4000,1144.0,207613.0,7.1


Always check if your code worked! Never assume! But don't forget to delete the cells after you're done

In [11]:
#Listing all the columns for df1
df1.columns

Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_2_name', 'actor_1_facebook_likes',
       'gross', 'genres', 'actor_1_name', 'movie_title', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'movie_facebook_likes', 'IMDB_user_reviews', 'IMDB_user_votes',
       'IMDB_score'],
      dtype='object')

# Data Wrangling!

Very rarely do you get served data in a silver platter. Data collection might not be uniform, or in a pattern that is most useful for analysts. Hence a significant amount of time spent by data scientists is to clean up the data to the form that is acceptable to models.

This is where looking at your data helps. You'll understand which columns need work and which don't

People say it's boring, but I quite like it. But then again, I pour milk first, then cereal...


In [12]:
df1['genre'] = df1['genres'].str.split('|')
df1['genre'] = df1['genre'].str[0]
df1.drop(['genres'], axis = 1)

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,actor_1_name,movie_title,...,country,content_rating,budget,title_year,actor_2_facebook_likes,movie_facebook_likes,IMDB_user_reviews,IMDB_user_votes,IMDB_score,genre
0,Color,Eric Leighton,145.0,82.0,0,D.B. Sweeney,1000.0,137748063.0,Alfre Woodard,Dinosaur,...,USA,PG,127500000.0,2000,558.0,0,241.0,38438.0,6.5,Adventure
1,Color,Ron Howard,175.0,110.0,2000,T.J. Thyne,1000.0,260031035.0,Clint Howard,How the Grinch Stole Christmas,...,USA,PG,123000000.0,2000,722.0,0,482.0,141414.0,6.0,Comedy
2,Color,John Woo,237.0,123.0,610,Dougray Scott,10000.0,215397307.0,Tom Cruise,Mission: Impossible II,...,USA,PG-13,125000000.0,2000,794.0,0,1426.0,242188.0,6.1,Action
3,Color,Wolfgang Petersen,231.0,130.0,249,Mary Elizabeth Mastrantonio,784.0,182618434.0,Karen Allen,The Perfect Storm,...,USA,PG-13,140000000.0,2000,638.0,0,779.0,133076.0,6.4,Action
4,Color,Roland Emmerich,192.0,142.0,776,Adam Baldwin,13000.0,113330342.0,Heath Ledger,The Patriot,...,USA,R,110000000.0,2000,2000.0,4000,1144.0,207613.0,7.1,Action
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3592,Color,Warren Sheppard,3.0,94.0,0,Randy Jay Burrell,918.0,,Jennifer Hale,Fight to the Finish,...,USA,PG-13,150000.0,2016,402.0,381,,,,Action
3593,Color,Darren Lynn Bousman,10.0,97.0,163,Barry Bostwick,636.0,,Paul Sorvino,Alleluia! The Devil's Carnival,...,USA,,500000.0,2016,456.0,707,,,,Horror
3594,Color,Joel Paul Reisig,1.0,108.0,431,Joel Paul Reisig,466.0,,Carrie Bradstreet,Rodeo Girl,...,USA,PG,500000.0,2016,431.0,0,,,,Family
3595,Color,Luke Dye,1.0,84.0,0,Jeff Delaney,385.0,,Mike Stanley,The Little Ponderosa Zoo,...,USA,,500000.0,2016,169.0,9,,,,Family


The order of the columns above are a bit unorganized. Time to organize them!

In [13]:
df2 = df1[['movie_title','title_year','duration','color','genre','language','country','content_rating','director_name','actor_1_name','actor_2_name','budget','gross','director_facebook_likes','actor_1_facebook_likes','actor_2_facebook_likes','movie_facebook_likes','num_critic_for_reviews','IMDB_user_votes','IMDB_user_reviews','IMDB_score']]

In [14]:
df2.head(5)

Unnamed: 0,movie_title,title_year,duration,color,genre,language,country,content_rating,director_name,actor_1_name,...,budget,gross,director_facebook_likes,actor_1_facebook_likes,actor_2_facebook_likes,movie_facebook_likes,num_critic_for_reviews,IMDB_user_votes,IMDB_user_reviews,IMDB_score
0,Dinosaur,2000,82.0,Color,Adventure,English,USA,PG,Eric Leighton,Alfre Woodard,...,127500000.0,137748063.0,0,1000.0,558.0,0,145.0,38438.0,241.0,6.5
1,How the Grinch Stole Christmas,2000,110.0,Color,Comedy,English,USA,PG,Ron Howard,Clint Howard,...,123000000.0,260031035.0,2000,1000.0,722.0,0,175.0,141414.0,482.0,6.0
2,Mission: Impossible II,2000,123.0,Color,Action,English,USA,PG-13,John Woo,Tom Cruise,...,125000000.0,215397307.0,610,10000.0,794.0,0,237.0,242188.0,1426.0,6.1
3,The Perfect Storm,2000,130.0,Color,Action,English,USA,PG-13,Wolfgang Petersen,Karen Allen,...,140000000.0,182618434.0,249,784.0,638.0,0,231.0,133076.0,779.0,6.4
4,The Patriot,2000,142.0,Color,Action,English,USA,R,Roland Emmerich,Heath Ledger,...,110000000.0,113330342.0,776,13000.0,2000.0,4000,192.0,207613.0,1144.0,7.1


# Cleaning Individual Columns

## IMDB Score

If there is no IMDB score, you have to drop the row. This will be the only drop done

In [39]:
df2 = df2.dropna(subset = ['IMDB_score'])

### Content Rating

My approach to handling NAs is to check the unique values of the column, just to see what the values look like. Sometimes NAs can be represented in other ways like NaN or '' or '0' or '-'. In the case of our data, we have 'Unrated' and 'Not Rated"

Again, ALWAYS LOOK AT YOUR DATA BEFORE DIVING IN!

In [15]:
df2['content_rating'].unique()

array(['PG', 'PG-13', 'R', 'G', nan, 'Unrated', 'Not Rated', 'NC-17',
       'TV-G', 'TV-14', 'TV-PG'], dtype=object)

Here we can see that 'Unrated' and 'Not Rated' are actually the same thing. We are going to bundle up NaN with Unrated.

In [16]:
df2['content_rating'][df2['content_rating'] == 'Not Rated'] = 'Unrated'
df2['content_rating'][df2['content_rating'].isnull() == True] = 'Unrated'
df2['content_rating'].unique()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['content_rating'][df2['content_rating'] == 'Not Rated'] = 'Unrated'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._where(~key, value, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['content_rating'][df2['content_rating'].isnull() == True] = 'Unrated'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view

array(['PG', 'PG-13', 'R', 'G', 'Unrated', 'NC-17', 'TV-G', 'TV-14',
       'TV-PG'], dtype=object)

### Color

In [17]:
df2['color'].unique()

array(['Color', ' Black and White', nan], dtype=object)

In [18]:
#Checking the null values
df2['movie_title'][df2['color'].isnull()]

1980                                    Shinjuku Incident
2185                                            Dear John
2488                       Snow Flower and the Secret Fan
2552                                           The Ridges
2706                                         Freaky Deaky
2731                                     Small Apartments
2971                           Once Upon a Time in Queens
3100                                              Red Sky
3166    Alpha and Omega 4: The Legend of the Saw Tooth...
3203                                     Something Wicked
3231                                          A Fine Step
3395                                Into the Grizzly Maze
3430                                The Rise of the Krays
3588                                 Kickboxer: Vengeance
Name: movie_title, dtype: object

In [20]:
df2['color'][df2['color'].isnull() ] = 'Color'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['color'][df2['color'].isnull() ] = 'Color'


In [21]:
df2['color'].unique()

array(['Color', ' Black and White'], dtype=object)

### Duration

In [22]:
df2['duration'].unique()

array([ 82., 110., 123., 130., 142., 127., 171.,  78.,  89.,  94., 119.,
       114., 143., 100., 109.,  90., 145., 106.,  92., 104., 124., 135.,
       105., 116., 125., 126., 152.,  91., 108.,  99., 131., 118.,  87.,
       190.,  93., 107., 220., 136., 103.,  84., 102.,  95.,  97., 129.,
       120., 128., 121., 115.,  98., 112.,  88.,  77., 113.,  85.,  96.,
       101.,  83., 140.,  81.,  46., 122.,  nan, 111., 167.,  86., 184.,
       159., 165., 146., 141., 132., 144., 147., 300., 133., 138., 153.,
       174., 216., 172., 117.,  74.,  80., 150.,  72.,  76.,  73.,  79.,
       154., 134., 192., 280., 139., 206., 196., 170., 137., 178., 148.,
        59., 201., 157., 194., 163.,  68.,  20., 160.,  75.,  47., 169.,
       151., 193., 168., 156., 176., 162., 189., 158.,   7., 270.,  66.,
        53.,  35., 166.,  45.,  42., 215., 161.,  41.,  63., 186., 164.,
       173., 182., 180.,  14., 195., 240., 177.,  67.,  62., 149.,  71.,
        52.,  60., 187., 155., 183.])

Sometimes you need to check where the nulls are coming from. If there is a pattern in where the nulls are coming from, you can get a better idea as to how to tackle them.

In [23]:
#Checking the null values
df2['movie_title'][df2['duration'].isnull()]

153                            Hum To Mohabbat Karega
1138                              Dil Jo Bhi Kahey...
1410                                    The Naked Ape
1917                              Black Water Transit
2116     Harry Potter and the Deathly Hallows: Part I
2309                                         N-Secure
2345    Harry Potter and the Deathly Hallows: Part II
2712                             Should've Been Romeo
2938                                            Barfi
3235                                          Destiny
3462                                Karachi se Lahore
3485                                 Romantic Schemer
Name: movie_title, dtype: object

In [24]:
#np.nanmean(df2['duration'])
np.nanmedian(df2['duration']) #Why did we use median?

103.0

## Imputing Values

You need to decide how to decide when imputing a value for NaN

    * A constant: You can impute a constant to be in place of Nan. This requires domain knowledge 
    * Max/Min values: You can take the minimum and maximum value to impute. However the max and minimum values could be far off than the original values.
    * Mean: This is a commonly used approach, but outliers can throw it off. Eg [100,90,89,67,97,2,87]. The 2 brings down the average quite a lot
    * Median: Median is used to select the value that lies in the middle of the order, in ascending orders. This is the most preferred method

In [25]:
df2['duration'][df2['duration'].isnull()] = np.nanmedian(df2['duration'])
df2['duration'].unique()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['duration'][df2['duration'].isnull()] = np.nanmedian(df2['duration'])


array([ 82., 110., 123., 130., 142., 127., 171.,  78.,  89.,  94., 119.,
       114., 143., 100., 109.,  90., 145., 106.,  92., 104., 124., 135.,
       105., 116., 125., 126., 152.,  91., 108.,  99., 131., 118.,  87.,
       190.,  93., 107., 220., 136., 103.,  84., 102.,  95.,  97., 129.,
       120., 128., 121., 115.,  98., 112.,  88.,  77., 113.,  85.,  96.,
       101.,  83., 140.,  81.,  46., 122., 111., 167.,  86., 184., 159.,
       165., 146., 141., 132., 144., 147., 300., 133., 138., 153., 174.,
       216., 172., 117.,  74.,  80., 150.,  72.,  76.,  73.,  79., 154.,
       134., 192., 280., 139., 206., 196., 170., 137., 178., 148.,  59.,
       201., 157., 194., 163.,  68.,  20., 160.,  75.,  47., 169., 151.,
       193., 168., 156., 176., 162., 189., 158.,   7., 270.,  66.,  53.,
        35., 166.,  45.,  42., 215., 161.,  41.,  63., 186., 164., 173.,
       182., 180.,  14., 195., 240., 177.,  67.,  62., 149.,  71.,  52.,
        60., 187., 155., 183.])

### Budget

In [26]:
df2['budget'].unique()


array([1.2750000e+08, 1.2300000e+08, 1.2500000e+08, 1.4000000e+08,
       1.1000000e+08, 9.0000000e+07, 1.0300000e+08, 1.0000000e+08,
       9.5000000e+07, 9.2000000e+07, 8.5000000e+07, 6.5000000e+07,
       8.2000000e+07, 8.0000000e+07, 4.4000000e+07, 7.0000000e+07,
       7.6000000e+07, 7.5000000e+07, 6.0000000e+07, 6.2000000e+07,
       5.5000000e+07, 4.6000000e+07, 5.1000000e+07, 5.2000000e+07,
       5.0000000e+07, 4.8000000e+07, 4.5000000e+07, 5.7000000e+07,
       4.3000000e+07, 4.0000000e+07, 3.4000000e+07, 3.6000000e+07,
       3.3000000e+07, 3.0000000e+07, 3.2000000e+07, 3.1000000e+07,
       2.8000000e+07, 2.6000000e+07, 2.5000000e+07, 2.4000000e+07,
       2.3000000e+07, 2.2000000e+07, 2.0000000e+07, 1.9000000e+07,
       1.8000000e+07, 1.5600000e+07, 1.6000000e+07, 1.5000000e+07,
       1.4000000e+07, 1.3500000e+07, 1.3000000e+07, 1.2800000e+07,
       1.0000000e+07, 6.0000000e+06, 1.1000000e+07, 9.5000000e+06,
       9.0000000e+06, 8.0000000e+06, 8.5000000e+06, 7.0000000e

In [27]:
np.nanmedian(df2['budget'])

21000000.0

In [28]:
df2['budget'][df2['budget'].isnull()] = np.nanmedian(df2['budget'])
df2['budget'].unique()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['budget'][df2['budget'].isnull()] = np.nanmedian(df2['budget'])


array([1.2750000e+08, 1.2300000e+08, 1.2500000e+08, 1.4000000e+08,
       1.1000000e+08, 9.0000000e+07, 1.0300000e+08, 1.0000000e+08,
       9.5000000e+07, 9.2000000e+07, 8.5000000e+07, 6.5000000e+07,
       8.2000000e+07, 8.0000000e+07, 4.4000000e+07, 7.0000000e+07,
       7.6000000e+07, 7.5000000e+07, 6.0000000e+07, 6.2000000e+07,
       5.5000000e+07, 4.6000000e+07, 5.1000000e+07, 5.2000000e+07,
       5.0000000e+07, 4.8000000e+07, 4.5000000e+07, 5.7000000e+07,
       4.3000000e+07, 4.0000000e+07, 3.4000000e+07, 3.6000000e+07,
       3.3000000e+07, 3.0000000e+07, 3.2000000e+07, 3.1000000e+07,
       2.8000000e+07, 2.6000000e+07, 2.5000000e+07, 2.4000000e+07,
       2.3000000e+07, 2.2000000e+07, 2.0000000e+07, 1.9000000e+07,
       1.8000000e+07, 1.5600000e+07, 1.6000000e+07, 1.5000000e+07,
       1.4000000e+07, 1.3500000e+07, 1.3000000e+07, 1.2800000e+07,
       1.0000000e+07, 6.0000000e+06, 1.1000000e+07, 9.5000000e+06,
       9.0000000e+06, 8.0000000e+06, 8.5000000e+06, 7.0000000e

## Gross

In [29]:
df2['gross'][df2['gross'].isnull()] = np.nanmedian(df2['gross'])
df2['gross'].unique()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['gross'][df2['gross'].isnull()] = np.nanmedian(df2['gross'])


array([1.37748063e+08, 2.60031035e+08, 2.15397307e+08, ...,
       1.23777000e+05, 3.10526900e+06, 2.07730700e+07])

### Actor 1 and 2 Facebook Likes


In [30]:
df2['actor_1_facebook_likes'].unique()

array([1.00e+03, 1.00e+04, 7.84e+02, 1.30e+04, 1.20e+04, 3.00e+03,
       5.58e+02, 2.00e+03, 1.10e+04, 8.33e+02, 1.50e+04, 6.11e+02,
       9.81e+02, 7.43e+02, 2.00e+00, 2.20e+04, 2.00e+04, 8.67e+02,
       1.80e+04, 1.60e+04, 3.24e+02, 9.75e+02, 8.89e+02, 4.85e+02,
       2.30e+04, 8.00e+03, 2.90e+04, 5.99e+02, 5.45e+02, 8.11e+02,
       6.70e+02, 5.79e+02, 2.87e+02, 2.40e+04, 3.74e+02, 9.00e+03,
       9.44e+02, 8.12e+02, 2.10e+04, 8.18e+02, 8.26e+02, 3.30e+04,
       9.71e+02, 9.55e+02, 4.97e+02, 5.41e+02, 5.00e+03, 7.74e+02,
       8.93e+02, 1.40e+04, 3.27e+02, 5.54e+02, 3.04e+02, 3.02e+02,
       7.57e+02, 5.55e+02, 1.03e+02, 1.93e+02, 9.12e+02, 9.89e+02,
       4.77e+02, 6.31e+02, 5.71e+02, 9.63e+02, 9.31e+02, 8.65e+02,
       2.60e+04, 4.00e+03, 9.02e+02, 7.23e+02, 1.87e+02, 7.76e+02,
       8.44e+02, 9.57e+02, 6.97e+02, 7.16e+02, 8.38e+02, 9.30e+01,
       9.39e+02, 9.45e+02, 7.36e+02, 8.80e+01, 4.72e+02, 1.27e+02,
       8.50e+01, 3.88e+02, 3.53e+02, 7.67e+02, 8.27e+02, 9.78e

In [31]:
df2['actor_2_facebook_likes'].unique()

array([5.58e+02, 7.22e+02, 7.94e+02, 6.38e+02, 2.00e+03, 1.10e+04,
       9.33e+02, 4.52e+02, 9.19e+02, 6.27e+02, 1.00e+03, 6.24e+02,
       4.10e+02, 7.95e+02, 5.92e+02, 9.75e+02, 2.74e+02, 1.17e+02,
       0.00e+00, 7.64e+02, 8.26e+02, 1.30e+04, 9.55e+02, 4.95e+02,
       7.18e+02, 1.92e+02, 9.09e+02, 3.00e+03, 1.00e+04, 2.90e+02,
       8.91e+02, 7.13e+02, 1.10e+02, 8.83e+02, 4.00e+03, 4.14e+02,
       6.02e+02, 3.99e+02, 5.75e+02, 3.37e+02, 6.25e+02, 8.61e+02,
       3.70e+02, 5.30e+02, 1.57e+02, 2.58e+02, 9.41e+02, 8.86e+02,
       5.51e+02, 6.49e+02, 4.36e+02, 2.23e+02, 9.00e+03, 7.02e+02,
       1.80e+04, 8.44e+02, 8.69e+02, 3.47e+02, 5.96e+02, 8.48e+02,
       6.92e+02, 8.54e+02, 9.57e+02, 8.47e+02, 9.91e+02, 2.81e+02,
       4.39e+02, 2.53e+02, 9.60e+02, 2.16e+02, 5.39e+02, 2.20e+02,
       4.90e+02, 2.40e+02, 4.03e+02, 2.10e+01, 1.84e+02, 7.86e+02,
       8.41e+02, 8.97e+02, 3.97e+02, 2.15e+02, 2.30e+02, 9.34e+02,
       4.42e+02, 9.00e+02, 5.95e+02, 7.44e+02, 6.85e+02, 7.23e

In [32]:
df2['actor_1_facebook_likes'][df2['actor_1_facebook_likes'].isnull()] = np.nanmedian(df2['actor_1_facebook_likes'])
df2['actor_2_facebook_likes'][df2['actor_2_facebook_likes'].isnull()] = np.nanmedian(df2['actor_2_facebook_likes'])


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['actor_1_facebook_likes'][df2['actor_1_facebook_likes'].isnull()] = np.nanmedian(df2['actor_1_facebook_likes'])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['actor_2_facebook_likes'][df2['actor_2_facebook_likes'].isnull()] = np.nanmedian(df2['actor_2_facebook_likes'])


In [33]:
df2.head()

Unnamed: 0,movie_title,title_year,duration,color,genre,language,country,content_rating,director_name,actor_1_name,...,budget,gross,director_facebook_likes,actor_1_facebook_likes,actor_2_facebook_likes,movie_facebook_likes,num_critic_for_reviews,IMDB_user_votes,IMDB_user_reviews,IMDB_score
0,Dinosaur,2000,82.0,Color,Adventure,English,USA,PG,Eric Leighton,Alfre Woodard,...,127500000.0,137748063.0,0,1000.0,558.0,0,145.0,38438.0,241.0,6.5
1,How the Grinch Stole Christmas,2000,110.0,Color,Comedy,English,USA,PG,Ron Howard,Clint Howard,...,123000000.0,260031035.0,2000,1000.0,722.0,0,175.0,141414.0,482.0,6.0
2,Mission: Impossible II,2000,123.0,Color,Action,English,USA,PG-13,John Woo,Tom Cruise,...,125000000.0,215397307.0,610,10000.0,794.0,0,237.0,242188.0,1426.0,6.1
3,The Perfect Storm,2000,130.0,Color,Action,English,USA,PG-13,Wolfgang Petersen,Karen Allen,...,140000000.0,182618434.0,249,784.0,638.0,0,231.0,133076.0,779.0,6.4
4,The Patriot,2000,142.0,Color,Action,English,USA,R,Roland Emmerich,Heath Ledger,...,110000000.0,113330342.0,776,13000.0,2000.0,4000,192.0,207613.0,1144.0,7.1


## Num critic for reviews

In [34]:
df2['num_critic_for_reviews'][df2['num_critic_for_reviews'].isnull()] = np.nanmedian(df2['num_critic_for_reviews'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['num_critic_for_reviews'][df2['num_critic_for_reviews'].isnull()] = np.nanmedian(df2['num_critic_for_reviews'])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._where(~key, value, inplace=True)


### IMDB User Votes

In [35]:
df2['IMDB_user_votes'].unique()

array([3.84380e+04, 1.41414e+05, 2.42188e+05, ..., 4.75020e+04,
       7.00000e+01,         nan])

In [36]:
df2['IMDB_user_votes'][df2['IMDB_user_votes'].isnull()] = np.nanmedian(df2['IMDB_user_votes'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['IMDB_user_votes'][df2['IMDB_user_votes'].isnull()] = np.nanmedian(df2['IMDB_user_votes'])


In [37]:
df2['IMDB_user_reviews'].unique()

array([2.410e+02, 4.820e+02, 1.426e+03, 7.790e+02, 1.144e+03, 4.980e+02,
       2.368e+03, 2.970e+02, 1.390e+02, 6.830e+02, 6.430e+02, 6.280e+02,
       9.490e+02, 1.051e+03, 7.700e+01, 1.710e+02, 2.890e+02, 2.370e+02,
       3.150e+02, 1.308e+03, 3.480e+02, 1.970e+02, 1.401e+03, 3.790e+02,
       3.770e+02, 1.344e+03, 3.950e+02, 3.260e+02, 2.650e+02, 6.900e+01,
       6.020e+02, 3.220e+02, 2.670e+02, 8.220e+02, 2.820e+02, 8.500e+01,
       5.070e+02, 2.930e+02, 3.010e+02, 4.810e+02, 2.610e+02, 5.480e+02,
       9.100e+01, 8.670e+02, 2.540e+02, 3.700e+02, 1.830e+02, 1.940e+02,
       3.580e+02, 7.340e+02, 1.800e+02, 1.300e+02, 6.600e+02, 1.670e+02,
       1.880e+02, 2.840e+02, 2.120e+02, 6.400e+01, 6.770e+02, 3.350e+02,
       6.100e+01, 1.480e+02, 1.810e+02, 2.240e+02, 5.100e+02, 4.020e+02,
       4.900e+01, 2.590e+02, 3.650e+02, 8.620e+02, 1.620e+02, 3.180e+02,
       2.750e+02, 2.630e+02, 1.010e+02, 6.740e+02, 8.050e+02, 4.700e+01,
       3.800e+01, 5.300e+01, 1.580e+02, 1.380e+02, 

In [38]:
df2['IMDB_user_reviews'][df2['IMDB_user_reviews'].isnull()] = np.nanmedian(df2['IMDB_user_reviews'])
df2['IMDB_user_reviews'].unique()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['IMDB_user_reviews'][df2['IMDB_user_reviews'].isnull()] = np.nanmedian(df2['IMDB_user_reviews'])


array([2.410e+02, 4.820e+02, 1.426e+03, 7.790e+02, 1.144e+03, 4.980e+02,
       2.368e+03, 2.970e+02, 1.390e+02, 6.830e+02, 6.430e+02, 6.280e+02,
       9.490e+02, 1.051e+03, 7.700e+01, 1.710e+02, 2.890e+02, 2.370e+02,
       3.150e+02, 1.308e+03, 3.480e+02, 1.970e+02, 1.401e+03, 3.790e+02,
       3.770e+02, 1.344e+03, 3.950e+02, 3.260e+02, 2.650e+02, 6.900e+01,
       6.020e+02, 3.220e+02, 2.670e+02, 8.220e+02, 2.820e+02, 8.500e+01,
       5.070e+02, 2.930e+02, 3.010e+02, 4.810e+02, 2.610e+02, 5.480e+02,
       9.100e+01, 8.670e+02, 2.540e+02, 3.700e+02, 1.830e+02, 1.940e+02,
       3.580e+02, 7.340e+02, 1.800e+02, 1.300e+02, 6.600e+02, 1.670e+02,
       1.880e+02, 2.840e+02, 2.120e+02, 6.400e+01, 6.770e+02, 3.350e+02,
       6.100e+01, 1.480e+02, 1.810e+02, 2.240e+02, 5.100e+02, 4.020e+02,
       4.900e+01, 2.590e+02, 3.650e+02, 8.620e+02, 1.620e+02, 3.180e+02,
       2.750e+02, 2.630e+02, 1.010e+02, 6.740e+02, 8.050e+02, 4.700e+01,
       3.800e+01, 5.300e+01, 1.580e+02, 1.380e+02, 

# Writing a CSV file

In [None]:
df2.to_csv('data.csv')


#For Athena, header is not required
df2.to_csv('data1.csv', header = False,index = False)