# IMDB Scraping Data

**Start date:** 9/3/2023 

**Scope:** This analysis has the main goal of praticing cleaning data and peforming an exploratory analysis on this dataset.

In [1]:
#Importing the data
# !kaggle datasets download -d bharatnatrayn/movies-dataset-for-feature-extracion-prediction

In [2]:
# import zipfile

# with zipfile.ZipFile('movies-dataset-for-feature-extracion-prediction.zip', 'r') as zip_ref:
#     zip_ref.extractall()

In [3]:
import pandas as pd

In [4]:
movies_raw_dataset = pd.read_csv('movies.csv')

In [5]:
movies_raw_dataset.head()

Unnamed: 0,MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross
0,Blood Red Sky,(2021),"\nAction, Horror, Thriller",6.1,\nA woman with a mysterious illness is forced ...,\n Director:\nPeter Thorwarth\n| \n Star...,21062.0,121.0,
1,Masters of the Universe: Revelation,(2021– ),"\nAnimation, Action, Adventure",5.0,\nThe war for Eternia begins again in what may...,"\n \n Stars:\nChris Wood, \nSara...",17870.0,25.0,
2,The Walking Dead,(2010–2022),"\nDrama, Horror, Thriller",8.2,\nSheriff Deputy Rick Grimes wakes up from a c...,"\n \n Stars:\nAndrew Lincoln, \n...",885805.0,44.0,
3,Rick and Morty,(2013– ),"\nAnimation, Adventure, Comedy",9.2,\nAn animated series that follows the exploits...,"\n \n Stars:\nJustin Roiland, \n...",414849.0,23.0,
4,Army of Thieves,(2021),"\nAction, Crime, Horror",,"\nA prequel, set before the events of Army of ...",\n Director:\nMatthias Schweighöfer\n| \n ...,,,


On a first glance, there are some issues that will need some work:
- The database does not contain only movies but movies and series
- All text fields contain special characters like '\n'
- RunTime column contains the full length of movies but only the episode length for series
- For beautifying purposes, column names also need to be standardized

Let's start with the columns names and then we have a look at the dataset summary and types

In [None]:
movies_raw_dataset.columns = pd.Series(movies_raw_dataset.columns).apply(lambda x: x[0].upper() + x[1:].lower())

In [6]:
movies_raw_dataset.describe()

Unnamed: 0,RATING,RunTime
count,8179.0,7041.0
mean,6.921176,68.688539
std,1.220232,47.258056
min,1.1,1.0
25%,6.2,36.0
50%,7.1,60.0
75%,7.8,95.0
max,9.9,853.0


Rating column seems fine, as the ratings on IMDB go from 1 to 10. However, 853 for run time seems a little excessive (more than 14 hours). Given that th percentile 75% is 95 minutes, we are likely looking at an outlier here. Let's check it out.

In [11]:
movies_raw_dataset.nlargest(5, columns='Runtime')

Unnamed: 0,MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross
1902,El tiempo entre costuras,(2013–2014),"\nAdventure, Drama, History",8.3,\nSira Quiroga is a young Spanish dressmaker e...,"\n \n Stars:\nAdriana Ugarte, \n...",3876,853.0,
1081,Soupçons,(2004–2018),"\nDocumentary, Crime, Drama",7.9,\nThe high-profile murder trial of American no...,"\n \n Stars:\nMichael Peterson, ...",20200,629.0,
2498,The Innocence Files,(2020),"\nDocumentary, Crime",8.0,\nCases of wrongful conviction that the Innoce...,"\n \n Stars:\nPeter Neufeld, \nB...",2335,573.0,
201,The Haunting of Hill House,(2018),"\nDrama, Horror, Mystery",8.6,"\nFlashing between past and present, a fractur...","\n \n Stars:\nMichiel Huisman, \...",195117,572.0,
820,Cosmos: A Spacetime Odyssey,(2014),\nDocumentary,9.3,\nAn exploration of our discovery of the laws ...,\n \n Stars:\nNeil deGrasse Tyso...,114386,557.0,


So apparently we have mixed values on the RunTime column, with some series having the duration per episode and other having the total duration of the series. We will have to find a way to differentiate these. 

In [7]:
movies_raw_dataset.dtypes

MOVIES       object
YEAR         object
GENRE        object
RATING      float64
ONE-LINE     object
STARS        object
VOTES        object
RunTime     float64
Gross        object
dtype: object

Both votes and Gross columns are objects, while they could supposedly by numerical values. Let's have a deeper look at Gross. Since on the head there are no values for this column, we don't really know what is its formatting and if it needs any additional work.

In [9]:
movies_raw_dataset.loc[~movies_raw_dataset['Gross'].isna(), 'Gross'].head(20)

77      $75.47M
85     $402.45M
95      $89.22M
111    $315.54M
125     $57.01M
128    $260.00M
132    $132.38M
143    $167.77M
144    $404.52M
145     $15.07M
156     $70.10M
159    $210.61M
161    $327.48M
165    $390.53M
171    $303.00M
172     $56.63M
175     $58.06M
181    $353.01M
189     $46.89M
191      $7.00M
Name: Gross, dtype: object

So the Gross column will also need some formatting. 

We can now work on the column types and formatting

In [32]:
movies_raw_dataset.dtypes

Movies       object
Year        float64
Genre       float64
Rating      float64
One-line     object
Stars        object
Votes        object
Runtime     float64
Gross        object
dtype: object

In [30]:
#Remove parenthesis from the Year Column
movies_raw_dataset['Year'] = movies_raw_dataset['Year'].str.strip(['(',')']).astype(str)

AttributeError: Can only use .str accessor with string values!

In [31]:
#Remove new line char from Genre
movies_raw_dataset['Genre'] = movies_raw_dataset['Genre'].str.strip(['\n'])

Let's now check for duplicates on the data.

In [21]:
#First we drop full duplicates
movies_raw_dataset.drop_duplicates(inplace=True, ignore_index=True)

In [26]:
#Checking for partial duplicates
movies_raw_dataset[movies_raw_dataset.duplicated(subset=['Movies','Year','Genre'], keep=False)].sort_values(by='Movies')

Unnamed: 0,Movies,Year,Genre,Rating,One-line,Stars,Votes,Runtime,Gross
8376,13 Reasons Why,(2017–2020),"\nDrama, Mystery, Thriller",5.9,\nClay's mental health continues to decline as...,\n Director:\nSunu Gonera\n| \n Stars:\n...,1420,60.0,
8197,13 Reasons Why,(2017–2020),"\nDrama, Mystery, Thriller",6.6,"\nThe school goes into lockdown and Clay, Tony...",\n Director:\nBrenda Strong\n| \n Stars:...,1694,61.0,
8395,13 Reasons Why,(2017–2020),"\nDrama, Mystery, Thriller",6.8,\nWhen the dean begins a new investigation and...,\n Director:\nTommy Lohmann\n| \n Stars:...,1640,60.0,
8196,13 Reasons Why,(2017–2020),"\nDrama, Mystery, Thriller",6.1,\nThe Jensens make the boys take a drug test. ...,\n Director:\nBrenda Strong\n| \n Stars:...,1507,57.0,
8375,13 Reasons Why,(2017–2020),"\nDrama, Mystery, Thriller",5.6,\nAs the school gears up for the Love Is Love ...,\n Director:\nMichael Sucsy\n| \n Stars:...,1635,59.0,
...,...,...,...,...,...,...,...,...,...
8457,ÜberWeihnachten,(2020),"\nComedy, Drama, Romance",7.2,\nThe Most Wonderful Christmas of the Year - t...,"\n \n Stars:\nLuke Mockridge, \n...",59,45.0,
8456,ÜberWeihnachten,(2020),"\nComedy, Drama, Romance",7.5,\nSausages and Potato Salad - Basti hooks up w...,"\n \n Stars:\nLuke Mockridge, \n...",66,50.0,
8455,ÜberWeihnachten,(2020),"\nComedy, Drama, Romance",7.2,"\nHome Is Where the Tree Is - Bastian, an aspi...","\n \n Stars:\nLuke Mockridge, \n...",65,48.0,
6352,Far Cry,,"\nAnimation, Action, Adventure",,\nPlot under wraps. Adaptation of the Ubisoft ...,\n,,,


It seems like there are series in which each line is an episode because they either have a different plot or a different director/actors. We can group this entries, using the rating and votes average, the sum of the runtime and concatenating the text fields to process later.

In [27]:
aggregations = {
    'Rating':'mean',
    'One-line':'sum',
    'Stars':'sum',
    'Votes':'mean',
    'Runtime':'mean',
    'Gross':'mean'
}

movies_raw_dataset = movies_raw_dataset.groupby(['Movies','Year','Genre']).agg(aggregations)

TypeError: Could not convert 1,7981,5971,7141,6191,5071,6941,6351,4201,6402,312 to numeric

Let's start with the correcting the case of our column names in order to make it easier to write our code afterwards.