# Outline of Data Cleaning Process ðŸ§½:

- Documenting assumptions and normalization done for the data
- Preliminary data inspection

**Notes on Data:**
- Understanding **IMDB Rating**
  - IMDB rating is actually weighted by a variety of factors including a voters overall voting reputation in the past (https://www.getafollower.com/blog/imdb-ratings/)
  - This is what makes it a particularly reliable metric when it comes to understanding a movie's success
- Understanding **Certificate**
  - link to certificate documentation: https://help.imdb.com/article/contribution/titles/certificates/GU757M8ZJ9ZPXB39#
- Understanding **Gross**
  - Gross box office earnings for movie

# 1. Handling Imports and Data Access

In [1]:
# Required library imports
import kagglehub
from kagglehub import KaggleDatasetAdapter
import pandas as pd

In [3]:
# Accessing IMDB data from Kaggle and putting into a DataFrame

# Set the path to the file you'd like to load
file_path = "imdb_top_1000.csv"

# Load the latest version
movies = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  "harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows",
  file_path
)

print("First 5 records:", df.head())

  movies = kagglehub.load_dataset(


First 5 records:                                          Poster_Link  \
0  https://m.media-amazon.com/images/M/MV5BMDFkYT...   
1  https://m.media-amazon.com/images/M/MV5BM2MyNj...   
2  https://m.media-amazon.com/images/M/MV5BMTMxNT...   
3  https://m.media-amazon.com/images/M/MV5BMWMwMG...   
4  https://m.media-amazon.com/images/M/MV5BMWU4N2...   

               Series_Title Released_Year Certificate  Runtime  \
0  The Shawshank Redemption          1994           A  142 min   
1             The Godfather          1972           A  175 min   
2           The Dark Knight          2008          UA  152 min   
3    The Godfather: Part II          1974           A  202 min   
4              12 Angry Men          1957           U   96 min   

                  Genre  IMDB_Rating  \
0                 Drama          9.3   
1          Crime, Drama          9.2   
2  Action, Crime, Drama          9.0   
3          Crime, Drama          9.0   
4          Crime, Drama          9.0   

        

# 2. Inspecting Data and Beginning Data Cleaning Process

**Attributes Contained in Original Data:**
- Poster_Link
- Series_Title
- Released_Year
- Certificate (age rating)
- Runtime
- Genre
- IMDB_Rating (score of movie from IMDB; seems to be out of 10)
- Overview
- Meta_score (score for movie; seems to be out of 100)
- Director
- Star1
- Star2
- Star3
- Star4 --- Star1 - Star4 = top 4 star actors for the movie
- No_of_Votes
- Gross -- money made by movie

In [22]:
# looking at the columns

movies.columns


Index(['Poster_Link', 'Series_Title', 'Released_Year', 'Certificate',
       'Runtime', 'Genre', 'IMDB_Rating', 'Overview', 'Meta_score', 'Director',
       'Star1', 'Star2', 'Star3', 'Star4', 'No_of_Votes', 'Gross'],
      dtype='str')

**Attributes I will discard:** The poster link is not useful for this project so I will get rid of it and keep everything else for now

In [23]:
movies = movies.drop("Poster_Link", axis=1)
movies.head()

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000
