## OVERVIEW

Microsoft is venturing into the film industry by establishing a new movie studio. With limited knowledge about creating movies, Microsoft aims to gain insights into the types of films that are currently succeeding at the box office. By analyzing the prevailing trends and success factors in the film industry, Microsoft aims to make informed decisions that will guide their film creation process. This exploration is crucial for Microsoft to thrive in the highly competitive market by aligning their strategies with the evolving landscape of the film industry.

## PROBLEM STATEMENT
Microsoft's new movie studio lacks the necessary knowledge and understanding of the film industry to make informed decisions about the types of films to create. This knowledge gap hinders their ability to compete with established companies in producing successful and engaging original content. There is a need for actionable insights into current box office trends and success factors to guide Microsoft's decision-making process and ensure their success in the highly competitive film market.

## MAIN OBJECTIVE
The main objective is to provide Microsoft's new movie studio with actionable insights into current box office trends and success factors in the film industry, enabling them to make informed decisions about the types of films to create and compete effectively with other established companies producing original video content.

## SPECIFIC OBJECTIVES
1. Identify the most successful film genres in the current market to inform Microsoft's new movie studio on potential areas of focus for film production.
2. which studio produces the highest grossing movies.
3. Find the average runtime of movies and what would be the reccomended runtime for movies to be produces in the studio.
4. What is the relationship between the movie budget and the gross income generated?
5. How does the original language of a film affect its popularity.

OBJECTIVE 1: Identify the most successful film genres in the current market to inform Microsoft's new movie studio on potential areas of focus for film production.
To find this out, I compared the domestic gross, the foreign gross and the calculated worldwide gross of the top 10 grossing movies in each category(domestic, foreign or worldwide). This will help in identifying which genres generate the highest incomes and also help in knowing where to do more marketing.

In [5]:
#imports
import pandas as pd
import numpy as np
import sqlite3
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


Loading the 'bom.movie_gross.csv.gz' dataset for Gross analysis

In [7]:
bom_movie_df = pd.read_csv('bom.movie_gross.csv.gz')
bom_movie_df

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010
...,...,...,...,...,...
3382,The Quake,Magn.,6200.0,,2018
3383,Edward II (2018 re-release),FM,4800.0,,2018
3384,El Pacto,Sony,2500.0,,2018
3385,The Swan,Synergetic,2400.0,,2018


Understanding the dataset better

In [8]:
bom_movie_df.shape

(3387, 5)

In [9]:
bom_movie_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


## DATA CLEANING

### CHECKING AND DELETING DUPLICATES

In [11]:
bom_movie_df.duplicated().sum()

0

Found that there are no duplicate cells

### DROPPING OF NULL VALLUES IN OUR DATASET

From the information (.info()) of our dataset we have some few null values from studio, domestic_gross and foreign_gross columns. 
The amount of null values are minimal hence to be dropped as it won't affect our sample dataset

In [12]:
bom_movie_df.dropna(subset=['studio','domestic_gross','foreign_gross'], inplace=True)
bom_movie_df

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010
...,...,...,...,...,...
3275,I Still See You,LGF,1400.0,1500000,2018
3286,The Catcher Was a Spy,IFC,725000.0,229000,2018
3309,Time Freak,Grindstone,10000.0,256000,2018
3342,Reign of Judges: Title of Liberty - Concept Short,Darin Southa,93200.0,5200,2018
