### **Introduction**
This project is my first step into the world of data analysis and understanding what it takes to work with data.

The idea behind this project is simple, pick a dataset, learn to clean, analyze and extract statistical insights, then create visualizations that communicate some of the findings.

LINK TO DATASET: https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows

For this mini project, I decided to use a dataset containing the top 1000 IMDB movies from Kaggle.

### **Part 1: Data Exploration and Cleaning**

In [142]:
# importing Python libraries
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns

##### **Data Overview**
- how many rows and columns does the dataset have?
- what are the data types of each column? are there any missing values?

In [143]:
# reading data from .csv file
data = pd.read_csv("imdb_top_1000.csv")

In [144]:
data.shape

(1000, 16)

There are 1000 rows and 16 columns, in the dataset. This is prior to cleaning the dataset of unnecessary columns.

In [145]:
data.dtypes

Poster_Link       object
Series_Title      object
Released_Year     object
Certificate       object
Runtime           object
Genre             object
IMDB_Rating      float64
Overview          object
Meta_score       float64
Director          object
Star1             object
Star2             object
Star3             object
Star4             object
No_of_Votes        int64
Gross             object
dtype: object

Majority of the data types in the dataset are objects (which can contain both characters and numbers). Columns such as `Released_Year`, `Runtime`, `Gross` should be converted to `int64` to make it easier to work with. Additionally, columns such as `Poster_Link`, `Certificate`, `Overview` and `Meta_score` are unnecessary for my use case and could be removed. 

In [146]:
data.isna().sum()

Poster_Link        0
Series_Title       0
Released_Year      0
Certificate      101
Runtime            0
Genre              0
IMDB_Rating        0
Overview           0
Meta_score       157
Director           0
Star1              0
Star2              0
Star3              0
Star4              0
No_of_Votes        0
Gross            169
dtype: int64

The table above shows the number of null values for each column. I will have to find a way to handle null values for `Gross`

In [147]:
data.duplicated().sum()

0

There are no duplicate rows in the dataset.

##### **Data Cleaning**
- identify and handle any missing or inconsistent data
- are there any outliers in the dataset? if so, how would you handle them?

In [148]:
data = data.drop(columns=["Poster_Link", "Certificate", "Overview", "Meta_score", "No_of_Votes"]) # dropping unnecessary columns
data = data.rename(columns={"Series_Title" : "Movie_Title", "Released_Year" : "Release_Year"}) # renaming columns 
data = data.set_index("Movie_Title") # using the Movie_Title as the index instead of 0,1,2...
data

Unnamed: 0_level_0,Release_Year,Runtime,Genre,IMDB_Rating,Director,Star1,Star2,Star3,Star4,Gross
Movie_Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
The Shawshank Redemption,1994,142 min,Drama,9.3,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,28341469
The Godfather,1972,175 min,"Crime, Drama",9.2,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,134966411
The Dark Knight,2008,152 min,"Action, Crime, Drama",9.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,534858444
The Godfather: Part II,1974,202 min,"Crime, Drama",9.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,57300000
12 Angry Men,1957,96 min,"Crime, Drama",9.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,4360000
...,...,...,...,...,...,...,...,...,...,...
Breakfast at Tiffany's,1961,115 min,"Comedy, Drama, Romance",7.6,Blake Edwards,Audrey Hepburn,George Peppard,Patricia Neal,Buddy Ebsen,
Giant,1956,201 min,"Drama, Western",7.6,George Stevens,Elizabeth Taylor,Rock Hudson,James Dean,Carroll Baker,
From Here to Eternity,1953,118 min,"Drama, Romance, War",7.6,Fred Zinnemann,Burt Lancaster,Montgomery Clift,Deborah Kerr,Donna Reed,30500000
Lifeboat,1944,97 min,"Drama, War",7.6,Alfred Hitchcock,Tallulah Bankhead,John Hodiak,Walter Slezak,William Bendix,
