### Introduction

Our company is entering the film industry by launching a new movie studio but lacks experience in film production. We are tasked with analyzing current box office trends to identify successful film types. The goal is to translate these insights into actionable recommendations that will guide the studio head in creating films that align with market preferences and drive success.

### Problem Statement
Your company now sees all the big companies creating original video content and they want to get in on the fun. They have decided to create a new movie studio, but they don’t know anything about creating movies. You are charged with exploring what types of films are currently doing the best at the box office. You must then translate those findings into actionable insights that the head of your company's new movie studio can use to help decide what type of films to create.

### Objectives
1. Evaluate the performance of various film genres in both domestic and international markets.
2. Identify the movie studios that consistently produce high-performing films.
3. Investigate the correlation between production budgets and both domestic and international revenues.
4. Examine trends in film production over time.
5. Assess the impact of a film's popularity score and average rating on its overall performance.

### Data
Datasets used were obtained from:
* [Box Office Mojo](https://www.boxofficemojo.com/)
* [IMDB](https://www.imdb.com/)
* [Rotten Tomatoes](https://www.rottentomatoes.com/)
* [TheMovieDB](https://www.themoviedb.org/)
* [The Numbers](https://www.the-numbers.com/)


### Exploratory Data Analysis

In [1]:
# import the necessary packages
import pandas as pd
import numpy as np
import sqlite3
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import statsmodels.api as sm

#### Loading the Files Both (C.S.V) and (T.S.V)

In [102]:
# lets import the necessary files
movie_gross_df = pd.read_csv('data/bom.movie_gross.csv.gz')
movie_info_df = pd.read_csv('data/rt.movie_info.tsv.gz',delimiter='\t')
movie_reviews_df = pd.read_csv('data/rt.reviews.tsv.gz',delimiter='\t',encoding='latin-1')
movies_df = pd.read_csv('data/tmdb.movies.csv.gz',index_col=0)
movie_budget_df = pd.read_csv('data/tn.movie_budgets.csv.gz')


conn = sqlite3.connect('data/im.db')


# connecting to the database
conn = sqlite3.connect('data/im.db')
cursor = conn.cursor()
cursor.execute("""SELECT name FROM sqlite_master WHERE type = 'table';""")
table_name = cursor.fetchall()
table_name

[('movie_basics',),
 ('directors',),
 ('known_for',),
 ('movie_akas',),
 ('movie_ratings',),
 ('persons',),
 ('principals',),
 ('writers',)]

In [103]:
# check the first few rows 
movie_gross_df.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [104]:
movie_gross_df.shape

(3387, 5)

In [105]:
movie_gross_df.isna().sum().sort_values(ascending = False)

foreign_gross     1350
domestic_gross      28
studio               5
title                0
year                 0
dtype: int64

In [106]:
movie_gross_df1 = movie_gross_df.copy(deep = True)

In [None]:
# lets drop the rows with missing values.
   ...: movie_gross_df1.dropna(subset = ['studio', 'domestic_gross'], inplace =
   ...: True)
   ...: 
   ...: # convert the values in the foreign_gross to a float
   ...: movie_gross_df1['foreign_gross'] = movie_gross_df1['foreign_gross'].str.
   ...: replace(',','')
   ...: movie_gross_df1['foreign_gross'] = pd.to_numeric(movie_gross_df1['foreig
   ...: n_gross'])
   ...: movie_gross_df1['foreign_gross'].fillna(movie_gross_df1['foreign_gross']
   ...: .median(), inplace = True)
   ...: 
   ...: movie_gross_df1.isna().sum().sort_values(ascending = False)