## Final Project Submission

Please fill out:
* Student name: 
* Student pace: self paced / part time / full time
* Scheduled project review date/time: 
* Instructor name: 
* Blog post URL:


In [5]:
# Your code here - remember to use markdown cells for comments as well!
import pandas as pd 
import numpy as np  
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr

In [6]:
#loaded data into dataframes
df_title=pd.read_csv('zippedData/imdb.title.basics.csv.gz')
df_ratings=pd.read_csv('zippedData/imdb.title.ratings.csv.gz')
df_gross=pd.read_csv('zippedData/bom.movie_gross.csv.gz')


In [None]:
df_ratings.info()
df_ratings.head()

# Factor to consider is a film's critical reception, as films with high ratings from critics tend to perform better at the box office 

In [None]:
# Filter the ratings dataframe to include only films with at least 1000 votes
filtered_ratings = df_ratings[df_ratings['numvotes'] >= 1000]

filtered_ratings 

In [None]:
# Sort the ratings dataframe by averagerating in descending order
sorted_ratings = filtered_ratings.sort_values(by='averagerating', ascending=False)
sorted_ratings.head()

# Table show top 5 films with the highest average ratings with Genres

In [None]:
# Join the titles and ratings dataframes on tconst
top_5 = pd.merge(df_title, sorted_ratings, on='tconst')

In [None]:



# Merge the genre column with top_5 on tconst
df_merged = pd.merge(top_5, sorted_ratings, left_on='tconst', right_on='tconst')

# Display the top 5 films with strong critical acclaim
top_5[['tconst','primary_title','averagerating','genres']].sort_values(by='averagerating', ascending=False).head(5)


top_5.info()

#convert to str and Split the str of listed genres on the comma
top_5['genres'] = top_5['genres'].apply(str)

top_5['genres'] = top_5['genres'].str.split(',')

#Break down the genres column into separate rows where each movie is an instance of all genres it belongs to

top_5 = (top_5
 .set_index(['tconst','primary_title',"averagerating"])['genres']
 .apply(pd.Series)
 .stack()
 .reset_index()
 .rename(columns={0:'genres'}))

top_5.head(5)

# Select the averagerating and genres columns
df_new = top_5[["averagerating", "genres"]].sort_values(by='averagerating', ascending=False).head(5)


df_new


# Set the figure size
plt.figure(figsize=(8, 4))

# Set the y-axis to be the genres column and the x-axis to be the averagerating column
# Set the width of the bars to be 0.5
plt.barh(df_new["genres"], df_new["averagerating"])

# Set the x-axis range to be from 9.2 to 9.8
plt.xlim(9.2,9.8)

# Add a title and labels to the x-axis and y-axis
plt.title("Average Rating vs. Genres")
plt.ylabel("Genres")
plt.xlabel("Average Rating")

# Show the plot
plt.show()



# Rename title to primary_title
df_gross.rename(columns={'title':'primary_title'}, inplace=True)

df_gross.info()

df_gross.head()

# Replace 'nan' with null values (i.e., NaN) in the foreign_gross column
df_gross['foreign_gross'] = df_gross['foreign_gross'].replace('nan', np.nan)

# Convert the foreign_gross column to a numeric data type
df_gross['foreign_gross'] = pd.to_numeric(df_gross['foreign_gross'], errors='coerce')

# Identify rows with null values in the foreign_gross column and set them to 0
null_mask = df_gross['foreign_gross'].isnull()
df_gross.loc[null_mask, 'foreign_gross'] = 0

# Create a new total_gross column by adding the domestic_gross and foreign_gross columns
df_gross['total_gross'] = df_gross['domestic_gross'] + df_gross['foreign_gross']

# Display info & gross
df_gross.info()

df_gross.head()

df_title.info()
df_title.head()

#convert to str and Split the str of listed genres on the comma
df_title['genres'] = df_title['genres'].apply(str)

df_title['genres'] = df_title['genres'].str.split(',')

#Break down the genres column into separate rows where each movie is an instance of all genres it belongs to

df_title = (df_title
 .set_index(['tconst','primary_title','original_title','start_year','runtime_minutes'])['genres']
 .apply(pd.Series)
 .stack()
 .reset_index()
 .rename(columns={0:'genres'}))

df_title.head()

# Merge data frames on the tconst column
df_merged = pd.merge(df_ratings, df_title, on='tconst')


# Drop unnecessary column
df_ratings_title = df_merged.drop(columns=['original_title'])

df_ratings_title.head()

# Merge IMDb and Box Office Mojo data

movies = df_ratings_title.merge(df_gross, on="primary_title")

movies.head()

# Top 10 movie Genres based on ratings (released)

top_genres = movies.groupby('genres').size().sort_values(ascending=False).head(10)
top_genres.head(10)

#Divide genres in three subsets for plotting
print(top_genres)
first_five = list(top_genres.index[:5])
next_five = list(top_genres.index[5:10])
final_five = list(top_genres.index[10:15])


#Create a DataFrame for each subset
top_five_df = movies[movies['genres'].isin(first_five)]


#Create a histogram for the first subset using Seaborn visualiztion 
ax_one = sns.histplot(data = top_five_df, x = 'averagerating', hue = 'genres', multiple="stack", palette='pastel')
ax_one.set_title("Top 5 Genres", size = 16)
ax_one.set_xlabel('Average User Rating', size=13)
ax_one.set_ylabel('Count of Movies', size = 13)
sns.set(rc={'figure.figsize':(16,8)})
sns.set_style()

top_genres.plot.barh(title='Top 10 Genres by total number of movies')
plt.xlabel('Number of movies')
plt.ylabel('Genres')
plt.show()


movies.isnull().sum()

# Total NaN Values in percent %
movies.describe()

percentage_nan = movies.isnull().mean() * 100

percentage_nan

# Run-time Analysis to understand run time with movie gross

#Drop rows with missing or invalid data:
movies.dropna(inplace=True)

movies

#Convert runtime and gross revenue to numerical data types:
movies['runtime_minutes'] = pd.to_numeric(movies['runtime_minutes'], errors='coerce')
movies['gross'] = pd.to_numeric(movies['total_gross'], errors='coerce')

movies['runtime_minutes'].plot.hist()
plt.show()

#Scatter Plot --relationship between runtime and average rating
movies.plot.scatter(x='runtime_minutes', y='averagerating')
plt.show()


# The lowest runtime value is 3 minutes.
The highest runtime value is 272 minutes.
The runtime values between 3 and 87 minutes make up the first 11.11% of the data.
The runtime values between 87 and 93 minutes make up the next 11.11% of the data.
And so on.

percentiles = movies['runtime_minutes'].quantile(np.linspace(0, 1, num=10, endpoint=True))
percentiles.describe()


top_10_pct = movies[movies['runtime_minutes'] >= percentiles[1]]

top_10_pct



bottom_10_pct = movies[movies['runtime_minutes'] <= percentiles[0]]
bottom_10_pct

median_runtime = movies['runtime_minutes'].median()

median_runtime

mean_runtime = movies['runtime_minutes'].mean()
mean_runtime



# Conclusion

•	Top Genres that have received highest ratings are Drama, Comedy, Biography and History 
Microsoft should invest in these genres as they have received highest ratings compared to other genres 

•	Based on Top 10 Genres and total number of movies produced under these genres Microsoft should also consider Action, Roman and Thrillers as well they these genres dominate numbers of movies produced 

•	With a median runtime of 105.0 and a mean runtime of 107.29, we can conclude the following about the runtime data:

Half of the movies in the data have a runtime of 105.0 minutes or less, and the other half have a runtime of 105.0 minutes or more.
The average runtime of a movie in the data is 107.29 minutes.
Using these values as a reference point, we can make recommendations about the success of movies with different runtimes. For example: Movies with a runtime close to 105.0 minutes (e.g., 100-110 minutes) might tend to be more successful, as they are close to the "typical" runtime of a movie in the data.
Movies with a runtime significantly above or below 105.0 minutes (e.g., less than 90 minutes or more than 120 minutes) might be less successful, as they are outside the "typical" range of runtime for the data.

