# IMDB Scraping Data

**Start date:** 9/3/2023 

**Scope:** This analysis has the main goal of praticing cleaning data and peforming an exploratory analysis on this dataset.

In [239]:
#Importing the data
# !kaggle datasets download -d bharatnatrayn/movies-dataset-for-feature-extracion-prediction

In [240]:
# import zipfile

# with zipfile.ZipFile('movies-dataset-for-feature-extracion-prediction.zip', 'r') as zip_ref:
#     zip_ref.extractall()

In [241]:
import pandas as pd
import regex as re

In [242]:
movies_raw_dataset = pd.read_csv('movies.csv')

In [243]:
movies_raw_dataset.head()

Unnamed: 0,MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross
0,Blood Red Sky,(2021),"\nAction, Horror, Thriller",6.1,\nA woman with a mysterious illness is forced ...,\n Director:\nPeter Thorwarth\n| \n Star...,21062.0,121.0,
1,Masters of the Universe: Revelation,(2021– ),"\nAnimation, Action, Adventure",5.0,\nThe war for Eternia begins again in what may...,"\n \n Stars:\nChris Wood, \nSara...",17870.0,25.0,
2,The Walking Dead,(2010–2022),"\nDrama, Horror, Thriller",8.2,\nSheriff Deputy Rick Grimes wakes up from a c...,"\n \n Stars:\nAndrew Lincoln, \n...",885805.0,44.0,
3,Rick and Morty,(2013– ),"\nAnimation, Adventure, Comedy",9.2,\nAn animated series that follows the exploits...,"\n \n Stars:\nJustin Roiland, \n...",414849.0,23.0,
4,Army of Thieves,(2021),"\nAction, Crime, Horror",,"\nA prequel, set before the events of Army of ...",\n Director:\nMatthias Schweighöfer\n| \n ...,,,


On a first glance, there are some issues that will need some work:
- The database does not contain only movies but movies and series
- All text fields contain special characters like '\n'
- RunTime column contains the full length of movies but only the episode length for series
- For beautifying purposes, column names also need to be standardized

Let's start with the columns names and then we have a look at the dataset summary and types

In [244]:
movies_raw_dataset.columns = pd.Series(movies_raw_dataset.columns).apply(lambda x: x[0].upper() + x[1:].lower())

In [245]:
movies_raw_dataset.describe()

Unnamed: 0,Rating,Runtime
count,8179.0,7041.0
mean,6.921176,68.688539
std,1.220232,47.258056
min,1.1,1.0
25%,6.2,36.0
50%,7.1,60.0
75%,7.8,95.0
max,9.9,853.0


Rating column seems fine, as the ratings on IMDB go from 1 to 10. However, 853 for run time seems a little excessive (more than 14 hours). Given that th percentile 75% is 95 minutes, we are likely looking at an outlier here. Let's check it out.

In [246]:
movies_raw_dataset.nlargest(5, columns='Runtime')

Unnamed: 0,Movies,Year,Genre,Rating,One-line,Stars,Votes,Runtime,Gross
1902,El tiempo entre costuras,(2013–2014),"\nAdventure, Drama, History",8.3,\nSira Quiroga is a young Spanish dressmaker e...,"\n \n Stars:\nAdriana Ugarte, \n...",3876,853.0,
1081,Soupçons,(2004–2018),"\nDocumentary, Crime, Drama",7.9,\nThe high-profile murder trial of American no...,"\n \n Stars:\nMichael Peterson, ...",20200,629.0,
2498,The Innocence Files,(2020),"\nDocumentary, Crime",8.0,\nCases of wrongful conviction that the Innoce...,"\n \n Stars:\nPeter Neufeld, \nB...",2335,573.0,
201,The Haunting of Hill House,(2018),"\nDrama, Horror, Mystery",8.6,"\nFlashing between past and present, a fractur...","\n \n Stars:\nMichiel Huisman, \...",195117,572.0,
820,Cosmos: A Spacetime Odyssey,(2014),\nDocumentary,9.3,\nAn exploration of our discovery of the laws ...,\n \n Stars:\nNeil deGrasse Tyso...,114386,557.0,


So apparently we have mixed values on the RunTime column, with some series having the duration per episode and other having the total duration of the series. We will have to find a way to differentiate these. 

In [247]:
movies_raw_dataset.dtypes

Movies       object
Year         object
Genre        object
Rating      float64
One-line     object
Stars        object
Votes        object
Runtime     float64
Gross        object
dtype: object

Both votes and Gross columns are objects, while they could supposedly by numerical values. Let's have a deeper look at Gross. Since on the head there are no values for this column, we don't really know what is its formatting and if it needs any additional work.

In [248]:
movies_raw_dataset.loc[~movies_raw_dataset['Gross'].isna(), 'Gross'].head(20)

77      $75.47M
85     $402.45M
95      $89.22M
111    $315.54M
125     $57.01M
128    $260.00M
132    $132.38M
143    $167.77M
144    $404.52M
145     $15.07M
156     $70.10M
159    $210.61M
161    $327.48M
165    $390.53M
171    $303.00M
172     $56.63M
175     $58.06M
181    $353.01M
189     $46.89M
191      $7.00M
Name: Gross, dtype: object

So the Gross column will also need some formatting. 

We can now work on the column types and formatting

In [249]:
movies_raw_dataset.dtypes

Movies       object
Year         object
Genre        object
Rating      float64
One-line     object
Stars        object
Votes        object
Runtime     float64
Gross        object
dtype: object

In [250]:
movies_raw_dataset['Year']

0            (2021)
1          (2021– )
2       (2010–2022)
3          (2013– )
4            (2021)
           ...     
9994       (2021– )
9995       (2021– )
9996       (2022– )
9997       (2021– )
9998       (2021– )
Name: Year, Length: 9999, dtype: object

In [251]:
#Remove parenthesis from the Year Column
movies_raw_dataset['Year'] = movies_raw_dataset['Year'].str.replace(r'[()]','')

  movies_raw_dataset['Year'] = movies_raw_dataset['Year'].str.replace(r'[()]','')


In [252]:
#Remove new line char from Genre
movies_raw_dataset['Genre'] = movies_raw_dataset['Genre'].str.strip('\n')

In [253]:
#Remove new line char from One-Line
movies_raw_dataset['One-line'] = movies_raw_dataset['One-line'].str.strip('\n')

In [254]:
#Remove new line and special characters from Stars
movies_raw_dataset['Stars'] = movies_raw_dataset['Stars'].str.replace(r'[\n|]','')

  movies_raw_dataset['Stars'] = movies_raw_dataset['Stars'].str.replace(r'[\n|]','')


In [255]:
#Cast Votes as float to preserve the NaN. Converting to integer would make us have to fill those records and filling with 0 or a different number might skew future analysis
movies_raw_dataset['Votes'] = movies_raw_dataset['Votes'].str.replace(',','').astype(float)

In [256]:
#Get the Gross values correctly formatted
def number_formatter(x:str):
    '''Removes the letter identifier of the number magnitude and multiplies it by the value represented, returning a numerical variable'''
    if 'M' in x:
        return float(x.strip('M'))*1000000
    elif 'k' in x:
        return float(x.strip('k'))*1000
    else:
        return float(x)

In [263]:
movies_raw_dataset['Gross'] = movies_raw_dataset['Gross'].str.strip('$')
movies_raw_dataset['Gross'] = movies_raw_dataset['Gross'].apply(lambda x: number_formatter(str(x)))

In [257]:
movies_raw_dataset.head()

Unnamed: 0,Movies,Year,Genre,Rating,One-line,Stars,Votes,Runtime,Gross
0,Blood Red Sky,2021,"Action, Horror, Thriller",6.1,A woman with a mysterious illness is forced in...,Director:Peter Thorwarth Stars:Peri Ba...,21062.0,121.0,
1,Masters of the Universe: Revelation,2021–,"Animation, Action, Adventure",5.0,The war for Eternia begins again in what may b...,"Stars:Chris Wood, Sarah Michel...",17870.0,25.0,
2,The Walking Dead,2010–2022,"Drama, Horror, Thriller",8.2,Sheriff Deputy Rick Grimes wakes up from a com...,"Stars:Andrew Lincoln, Norman R...",885805.0,44.0,
3,Rick and Morty,2013–,"Animation, Adventure, Comedy",9.2,An animated series that follows the exploits o...,"Stars:Justin Roiland, Chris Pa...",414849.0,23.0,
4,Army of Thieves,2021,"Action, Crime, Horror",,"A prequel, set before the events of Army of th...",Director:Matthias Schweighöfer Stars:M...,,,


Looking a lot better!

Let's now check for duplicates on the data.

In [258]:
#First we drop full duplicates
movies_raw_dataset.drop_duplicates(inplace=True, ignore_index=True)

In [259]:
#Checking for partial duplicates
movies_raw_dataset[movies_raw_dataset.duplicated(subset=['Movies','Year','Genre'], keep=False)].sort_values(by='Movies')

Unnamed: 0,Movies,Year,Genre,Rating,One-line,Stars,Votes,Runtime,Gross
8265,13 Reasons Why,2017–2020,"Drama, Mystery, Thriller",5.9,Clay's mental health continues to decline as t...,Director:Sunu Gonera Stars:Dylan Minne...,1420.0,60.0,
8109,13 Reasons Why,2017–2020,"Drama, Mystery, Thriller",6.6,"The school goes into lockdown and Clay, Tony a...",Director:Brenda Strong Stars:Dylan Min...,1694.0,61.0,
8284,13 Reasons Why,2017–2020,"Drama, Mystery, Thriller",6.8,When the dean begins a new investigation and t...,Director:Tommy Lohmann Stars:Dylan Min...,1640.0,60.0,
8108,13 Reasons Why,2017–2020,"Drama, Mystery, Thriller",6.1,The Jensens make the boys take a drug test. Wh...,Director:Brenda Strong Stars:Dylan Min...,1507.0,57.0,
8264,13 Reasons Why,2017–2020,"Drama, Mystery, Thriller",5.6,As the school gears up for the Love Is Love da...,Director:Michael Sucsy Stars:Dylan Min...,1635.0,59.0,
...,...,...,...,...,...,...,...,...,...
8340,ÜberWeihnachten,2020,"Comedy, Drama, Romance",7.2,The Most Wonderful Christmas of the Year - the...,"Stars:Luke Mockridge, Seyneb S...",59.0,45.0,
8339,ÜberWeihnachten,2020,"Comedy, Drama, Romance",7.5,Sausages and Potato Salad - Basti hooks up wit...,"Stars:Luke Mockridge, Seyneb S...",66.0,50.0,
8338,ÜberWeihnachten,2020,"Comedy, Drama, Romance",7.2,"Home Is Where the Tree Is - Bastian, an aspiri...","Stars:Luke Mockridge, Seyneb S...",65.0,48.0,
6352,Far Cry,,"Animation, Action, Adventure",,Plot under wraps. Adaptation of the Ubisoft game.,,,,


It seems like there are series in which each line is an episode because they either have a different plot or a different director/actors. We can group this entries, using the rating and votes average, the sum of the runtime and concatenating the text fields to process later.

In [265]:
aggregations = {
    'Rating':'mean',
    'One-line':'sum',
    'Stars':'sum',
    'Votes':'mean',
    'Runtime':'mean',
    'Gross':'mean'
}

movies_raw_dataset = movies_raw_dataset.groupby(['Movies','Year','Genre']).agg(aggregations).reset_index()

Let's check again now

In [271]:
movies_raw_dataset[movies_raw_dataset.duplicated(subset=['Movies','Year','Genre'], keep=False)].sort_values(by='Movies')

Unnamed: 0,Movies,Year,Genre,Rating,One-line,Stars,Votes,Runtime,Gross


Perfect! Our dataset looks much more clean!

There are some additional ideas we could implement, but they have significant drawbacks.
- We can try to divided the dataset into a movies dataset and a series dataset. However, our only criteria would have to be the presence of multiple values in the _Year_ column, which would mean that the series went on for several years. This is true for most series, but for series broadcasted within the span of a single year, that classification would be erroneous.
- The _Genre_ column could also be divided, but that could imply a hierarchy and the categories seem o be ordered alphabetically and not by order of importance. As an example, a linear regression model could be influenced by this difference, adding different importance to a genre depending on which column it is.
- Creating columns for the director and stars can be a possibilitiy, but again, we have no information on the order of display. Since it is very unlikely that two different movies/series would have the same cast, having the whole cast could be useless. Isolating the director works in most cases, but won't work on the series episodes that had different directors as we have seen above.

In [273]:
movies_raw_dataset.head()

Unnamed: 0,Movies,Year,Genre,Rating,One-line,Stars,Votes,Runtime,Gross
0,13 Reasons Why,2017–2020,"Drama, Mystery, Thriller",6.14,"The police question Tyler about the guns, leav...",Director:Russell Mulcahy Stars:Dylan M...,1693.6,62.8,
1,1899,2022–,"Drama, History, Horror",,Add a PlotAdd a Plot,Director:Baran bo Odar Stars:Aneurin B...,,,
2,3Below: Tales of Arcadia,2018–2019,"Animation, Action, Adventure",7.95,"Left vulnerable after Omen's attack, the royal...",Director:Andrew L. Schmidt Stars:Tatia...,129.5,20.461538,
3,50M2,2021–,"Comedy, Drama, Thriller",7.3,"While seeking answers about his parents, Shado...",Director:Selçuk Aydemir Stars:Engin Öz...,119.5,49.0,
4,7Seeds,2019–2020,"Animation, Action, Adventure",7.108696,"Convinced that Botan has kidnapped them, Natsu...","Stars:Morgan Berry, Amber Lee ...",30.521739,24.695652,
