# IMDB Movie Analysis
This project will analyze the IMDB movie dataset from ZTM's provided csv file.
The goals of the project are:
1. Clean and preprocess the data.
2. Determine relationship between genre and movie rating.
3. Analyze factors that contribute to box office success.
4. Predict ratings of movies via confidence intervals.
5. Write up a conclusion that will highlight key insights learned.

Importing libraries

In [26]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as st

In [27]:
data = pd.read_csv("imdb.csv")
data.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


In [28]:
data.describe()

Unnamed: 0,IMDB_Rating,Meta_score,No_of_Votes
count,1000.0,843.0,1000.0
mean,7.9493,77.97153,273692.9
std,0.275491,12.376099,327372.7
min,7.6,28.0,25088.0
25%,7.7,70.0,55526.25
50%,7.9,79.0,138548.5
75%,8.1,87.0,374161.2
max,9.3,100.0,2343110.0


In [29]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Poster_Link    1000 non-null   object 
 1   Series_Title   1000 non-null   object 
 2   Released_Year  1000 non-null   object 
 3   Certificate    899 non-null    object 
 4   Runtime        1000 non-null   object 
 5   Genre          1000 non-null   object 
 6   IMDB_Rating    1000 non-null   float64
 7   Overview       1000 non-null   object 
 8   Meta_score     843 non-null    float64
 9   Director       1000 non-null   object 
 10  Star1          1000 non-null   object 
 11  Star2          1000 non-null   object 
 12  Star3          1000 non-null   object 
 13  Star4          1000 non-null   object 
 14  No_of_Votes    1000 non-null   int64  
 15  Gross          831 non-null    object 
dtypes: float64(2), int64(1), object(13)
memory usage: 125.1+ KB


In [30]:
df = data[["Series_Title" ,"Director", "Genre", "IMDB_Rating", "Released_Year", "Gross", "Runtime", "Meta_score"]]
df.head()

Unnamed: 0,Series_Title,Director,Genre,IMDB_Rating,Released_Year,Gross,Runtime,Meta_score
0,The Shawshank Redemption,Frank Darabont,Drama,9.3,1994,28341469,142 min,80.0
1,The Godfather,Francis Ford Coppola,"Crime, Drama",9.2,1972,134966411,175 min,100.0
2,The Dark Knight,Christopher Nolan,"Action, Crime, Drama",9.0,2008,534858444,152 min,84.0
3,The Godfather: Part II,Francis Ford Coppola,"Crime, Drama",9.0,1974,57300000,202 min,90.0
4,12 Angry Men,Sidney Lumet,"Crime, Drama",9.0,1957,4360000,96 min,96.0


In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Series_Title   1000 non-null   object 
 1   Director       1000 non-null   object 
 2   Genre          1000 non-null   object 
 3   IMDB_Rating    1000 non-null   float64
 4   Released_Year  1000 non-null   object 
 5   Gross          831 non-null    object 
 6   Runtime        1000 non-null   object 
 7   Meta_score     843 non-null    float64
dtypes: float64(2), object(6)
memory usage: 62.6+ KB


In [33]:
# convert Released_Year to number because it's an object/string
df["Released_Year"] = pd.to_numeric(df["Released_Year"], errors = "coerce")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["Released_Year"] = pd.to_numeric(df["Released_Year"], errors = "coerce")


In [34]:
## because we have an error, we check position 966
df.iloc[966]
## changed to NaN

Series_Title                     Apollo 13
Director                        Ron Howard
Genre            Adventure, Drama, History
IMDB_Rating                            7.6
Released_Year                          NaN
Gross                          173,837,933
Runtime                            140 min
Meta_score                            77.0
Name: 966, dtype: object

In [40]:
# create a Decade column
df["Released_Decade"] = np.floor(df["Released_Year"]/10)*10

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["Released_Decade"] = np.floor(df["Released_Year"]/10)*10


In [44]:
# T Transforming the runtime column from "x min" to x as an integer
df["Runtime"] = df["Runtime"].str.replace(" min", "").astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["Runtime"] = df["Runtime"].str.replace(" min", "").astype(int)
