## The Dataset

The IMDb Movies Dataset comprises the top 1000 most popular and highly-rated films on the IMDb website, which serves as an extensive online database for information on movies, television series, videos, and related content.

According to the information page of the dataset, the IMDb Movies Dataset was constructed using web scraping techniques to extract information directly from the IMDB site itself. Web scraping is an automated extraction of data from relevant webpages in a short amount of time (Hillier, 2021). 

Knowing this, the dataset could potentially have data inconsistencies due to variations in how information is presented on web pages. As such, data cleaning and preprocessing efforts are required for this dataset.

In the dataset, one can see that there are a total of 1000 observations (rows), across 16 variables (columns). Here are the descriptions of each variable in the dataset:

- **`Poster_Link`** - link to the image of the movie's poster on IMDb
- **`Series_Title`** - the title of the movie
- **`Released_Year`** - the year
- **`Certificate`** -
- **`Runtime`** -
- **`Genre`** -
- **`IMDB_Rating`** -
- **`Overview`** -
- **`Meta_score`** -
- **`Director`** -
- **`Star1`** -
- **`Star2`** -
- **`Star3`** -
- **`Star4`** -
- **`No_of_votes`** -
- **`Gross`** -

## Reading the Dataset
Our first step is to load the dataset using pandas. This will load the dataset into a pandas `DataFrame`. To load the dataset, we use the [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function. Note that you may need to change the path depending on the location of the file in your machine.

Note: Copied and pasted this from the activity. We might want to revise this.

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [2]:
imdb_df = pd.read_csv("imdb_top_1000.csv")

## Data Cleaning
The researchers took a better look at the dataset prior to data cleaning. It seems like there are missing values in some of the variables. There are also inconsistent data types.

In [3]:
imdb_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Poster_Link    1000 non-null   object 
 1   Series_Title   1000 non-null   object 
 2   Released_Year  1000 non-null   object 
 3   Certificate    899 non-null    object 
 4   Runtime        1000 non-null   object 
 5   Genre          1000 non-null   object 
 6   IMDB_Rating    1000 non-null   float64
 7   Overview       1000 non-null   object 
 8   Meta_score     843 non-null    float64
 9   Director       1000 non-null   object 
 10  Star1          1000 non-null   object 
 11  Star2          1000 non-null   object 
 12  Star3          1000 non-null   object 
 13  Star4          1000 non-null   object 
 14  No_of_Votes    1000 non-null   int64  
 15  Gross          831 non-null    object 
dtypes: float64(2), int64(1), object(13)
memory usage: 125.1+ KB


In [4]:
print(imdb_df.isnull().sum())

Poster_Link        0
Series_Title       0
Released_Year      0
Certificate      101
Runtime            0
Genre              0
IMDB_Rating        0
Overview           0
Meta_score       157
Director           0
Star1              0
Star2              0
Star3              0
Star4              0
No_of_Votes        0
Gross            169
dtype: int64


There are missing values in the following variables. Since they are important to the analysis, imputation will be done.
* Certificate
* Meta_score
* Gross

In [5]:
print("Number of duplicated data:", imdb_df.duplicated().sum())

Number of duplicated data: 0


In [6]:
imdb_df['Certificate'].value_counts()

Certificate
U           234
A           197
UA          175
R           146
PG-13        43
PG           37
Passed       34
G            12
Approved     11
TV-PG         3
GP            2
TV-14         1
16            1
TV-MA         1
Unrated       1
U/A           1
Name: count, dtype: int64

There are different types of Certificate values, which may seem inconsistent. (insert source bc of diff values)

In [7]:
## Replace null values in certificate to the most frequent category
imdb_df['Certificate'].fillna('U', inplace=True)

In [8]:
## Convert the Runtime object to int
imdb_df['Runtime'] = imdb_df['Runtime'].str.replace(' min', '').astype(int)

In [9]:
## Replace null values in Meta_score with the mean
imdb_df['Meta_score'].fillna(imdb_df['Meta_score'].mean(), inplace=True)

In [10]:
## Convert Gross to float
imdb_df['Gross'] = imdb_df['Gross'].str.replace(',', '').astype(float)

In [11]:
## Replace null values in Gross with the mean
imdb_df['Gross'].fillna(imdb_df['Gross'].mean(), inplace=True)

In [12]:
# Check values in Released_Year
imdb_df['Released_Year'].value_counts()

Released_Year
2014    32
2004    31
2009    29
2013    28
2016    28
        ..
1926     1
1936     1
1924     1
1921     1
PG       1
Name: count, Length: 100, dtype: int64

There seems to be an error in encoding in one of the observations.

In [13]:
imdb_df[imdb_df['Released_Year'] == 'PG']

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
966,https://m.media-amazon.com/images/M/MV5BNjEzYj...,Apollo 13,PG,U,140,"Adventure, Drama, History",7.6,NASA must devise a strategy to return Apollo 1...,77.0,Ron Howard,Tom Hanks,Bill Paxton,Kevin Bacon,Gary Sinise,269197,173837933.0


Instead of 'PG' as Apollo 13's release year, it will be replaced by its release year of 1995.

In [14]:
imdb_df['Released_Year'] = imdb_df['Released_Year'].replace('PG', '1995')
imdb_df[imdb_df['Series_Title'] == 'Apollo 13']

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
966,https://m.media-amazon.com/images/M/MV5BNjEzYj...,Apollo 13,1995,U,140,"Adventure, Drama, History",7.6,NASA must devise a strategy to return Apollo 1...,77.0,Ron Howard,Tom Hanks,Bill Paxton,Kevin Bacon,Gary Sinise,269197,173837933.0


In [15]:
## https://stackoverflow.com/questions/62408093/one-hot-encoding-multiple-categorical-data-in-a-column
genre = imdb_df['Genre'].str.split(', ').tolist()

flat_genre = [item for sublist in genre for item in sublist]

set_genre = set(flat_genre)

unique_genre = list(set_genre)

imdb_df = imdb_df.reindex(imdb_df.columns.tolist() + unique_genre, axis=1, fill_value=0)

# for each value inside column, update the dummy
for index, row in imdb_df.iterrows():
    for val in row.Genre.split(', '):
        if val != 'NA':
            imdb_df.loc[index, val] = 1

imdb_df.drop('Genre', axis = 1, inplace = True)    

In [16]:
## Merge stars
imdb_df['Stars'] = ''
for i in range(imdb_df.shape[0]):
    imdb_df['Stars'][i] = [imdb_df['Star1'][i], imdb_df['Star2'][i], imdb_df['Star3'][i], imdb_df['Star4'][i]]
    
imdb_df.drop(['Star1', 'Star2', 'Star3', 'Star4'], axis = 1, inplace = True)

In [24]:
imdb_df.drop(['Poster_Link', 'Overview'], axis = 1, inplace = True)

In [25]:
imdb_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 31 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Series_Title   1000 non-null   object 
 1   Released_Year  1000 non-null   object 
 2   Certificate    1000 non-null   object 
 3   Runtime        1000 non-null   int32  
 4   IMDB_Rating    1000 non-null   float64
 5   Meta_score     1000 non-null   float64
 6   Director       1000 non-null   object 
 7   No_of_Votes    1000 non-null   int64  
 8   Gross          1000 non-null   float64
 9   Thriller       1000 non-null   int64  
 10  Drama          1000 non-null   int64  
 11  Biography      1000 non-null   int64  
 12  Action         1000 non-null   int64  
 13  Adventure      1000 non-null   int64  
 14  Crime          1000 non-null   int64  
 15  Family         1000 non-null   int64  
 16  Comedy         1000 non-null   int64  
 17  Sport          1000 non-null   int64  
 18  History  

In [20]:
reordered_cols = ['Series_Title', 'Released_Year', 'Certificate', 'Runtime', 'IMDB_Rating', 'Meta_score', 'Director', 'Stars', 'No_of_Votes', 'Gross']
final_cols = reordered_cols + unique_genre

In [26]:
set(imdb_df.columns) == set(final_cols)

True

In [27]:
imdb_df = imdb_df[final_cols]

In [28]:
imdb_df.head()

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,IMDB_Rating,Meta_score,Director,Stars,No_of_Votes,Gross,...,Film-Noir,Horror,Romance,Music,Fantasy,Mystery,Sci-Fi,Animation,Musical,Western
0,The Shawshank Redemption,1994,A,142,9.3,80.0,Frank Darabont,"[Tim Robbins, Morgan Freeman, Bob Gunton, Will...",2343110,28341469.0,...,0,0,0,0,0,0,0,0,0,0
1,The Godfather,1972,A,175,9.2,100.0,Francis Ford Coppola,"[Marlon Brando, Al Pacino, James Caan, Diane K...",1620367,134966411.0,...,0,0,0,0,0,0,0,0,0,0
2,The Dark Knight,2008,UA,152,9.0,84.0,Christopher Nolan,"[Christian Bale, Heath Ledger, Aaron Eckhart, ...",2303232,534858444.0,...,0,0,0,0,0,0,0,0,0,0
3,The Godfather: Part II,1974,A,202,9.0,90.0,Francis Ford Coppola,"[Al Pacino, Robert De Niro, Robert Duvall, Dia...",1129952,57300000.0,...,0,0,0,0,0,0,0,0,0,0
4,12 Angry Men,1957,U,96,9.0,96.0,Sidney Lumet,"[Henry Fonda, Lee J. Cobb, Martin Balsam, John...",689845,4360000.0,...,0,0,0,0,0,0,0,0,0,0
