![example](images/director_shot.jpeg)

# MICROSOFT MOVIE STUDIO MARKET RESEARCH

**Authors:** Winfred Kabuya
***

## Overview

Microsoft is planning to launch its own movie studio and wants to select the most profitable genre for their first film. The project's goal is to help Microsoft create a successful movie studio by determining the types of films that are currently successful in the box office.To make this decision, they need to analyze movie production data from different datasets to determine which movies have been the most successful in terms of revenue earned compared to production costs. Based on the analysis, Microsoft can then strategically choose a genre that has the potential to generate the highest profits for their new movie studio.

## Business Problem

Summary of the business problem you are trying to solve, and the data questions that you plan to answer to solve them.

The business problem is that Microsoft lacks expertise in creating movies and needs to understand what types of films resonate with audiences and are the most profitable.Microsoft should identify the top-selling movies and the types of movies that yield the highest profits in order to maximize returns and compare them against production costs and other expenses in the production line. We shall use exploratory data analysis to generate insights, to identify patterns and predict future market trends. The results are organized by genre, target audience, and other relevant factors. The recommendations will be crucial for data-driven decision-making, as it is essential to understand the total market and profit potential before investing a significant amount of money.

## Data Understanding

Describe the data being used for this project.
***
Questions to consider:
* Where did the data come from, and how do they relate to the data analysis questions?
* What do the data represent? Who is in the sample and what variables are included?
* What is the target variable?
* What are the properties of the variables you intend to use?
***
The data being used for this project will come from multiple sources. I plan to use box office data to determine the most successful movie genres and examine trends over time.I will use a combination of quantitative and qualitative data to answer our research questions.

The data represent box office performance, audience preferences, and market trends in the movie industry. The sample for box office data will be all movies released in the past decade, while the sample for audience preference data will be a representative sample of moviegoers. The variables included will depend on the data source, but may include genre, budget, production company, release date, audience demographics, and preference data such as preferred genres, actors, and directors.

The target variable will vary depending on the data source and research question. For box office data, the target variable will be revenue or profitability. For audience preference data, the target variable will be the preference for specific genres, actors, or directors.

The properties of the variables we intend to use will depend on the research question and data source. We will need to examine the distribution of variables, identify outliers, and determine if any data cleaning or transformation is necessary. We will also need to consider the quality and completeness of the data and ensure that the data sources are reliable and trustworthy.

In [25]:
# Import standard packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3
import zipfile
%matplotlib inline

In [26]:
# Here you run your code to explore the data
#Loading the zipped file
with zipfile.ZipFile('zippedData/im.db.zip') as my_zip:
    zipfile.ZipFile.extractall(my_zip, path='ZippedData')

In [27]:
con = sqlite3.connect('ZippedData/im.db')


In [28]:
#Reading the SQL table information
pd.read_sql("""
SELECT *
FROM sqlite_schema
WHERE type='table'
""", con)


Unnamed: 0,type,name,tbl_name,rootpage,sql
0,table,movie_basics,movie_basics,2,"CREATE TABLE ""movie_basics"" (\n""movie_id"" TEXT..."
1,table,directors,directors,3,"CREATE TABLE ""directors"" (\n""movie_id"" TEXT,\n..."
2,table,known_for,known_for,4,"CREATE TABLE ""known_for"" (\n""person_id"" TEXT,\..."
3,table,movie_akas,movie_akas,5,"CREATE TABLE ""movie_akas"" (\n""movie_id"" TEXT,\..."
4,table,movie_ratings,movie_ratings,6,"CREATE TABLE ""movie_ratings"" (\n""movie_id"" TEX..."
5,table,persons,persons,7,"CREATE TABLE ""persons"" (\n""person_id"" TEXT,\n ..."
6,table,principals,principals,8,"CREATE TABLE ""principals"" (\n""movie_id"" TEXT,\..."
7,table,writers,writers,9,"CREATE TABLE ""writers"" (\n""movie_id"" TEXT,\n ..."


In [29]:
# loading the data from movie_budgets database
budget_movies= pd.read_csv("zippedData/tn.movie_budgets.csv.gz", index_col=0)
budget_movies.head()

Unnamed: 0_level_0,release_date,movie,production_budget,domestic_gross,worldwide_gross
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [38]:
#searching for missing values in movie budgets
df = pd.read_csv('zippedData/bom.movie_gross.csv.gz', index_col=0)
df.isna().sum()

studio               5
domestic_gross      28
foreign_gross     1350
year                 0
dtype: int64

In [33]:
#Reading data from movie gross 
movie_gross= pd.read_csv("zippedData/bom.movie_gross.csv.gz", index_col=0)
movie_gross.head()

Unnamed: 0_level_0,studio,domestic_gross,foreign_gross,year
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Toy Story 3,BV,415000000.0,652000000,2010
Alice in Wonderland (2010),BV,334200000.0,691300000,2010
Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
Inception,WB,292600000.0,535700000,2010
Shrek Forever After,P/DW,238700000.0,513900000,2010


## Data Preparation

Describe and justify the process for preparing the data for analysis.

***
Questions to consider:
* Were there variables you dropped or created?
* How did you address missing values or outliers?
* Why are these choices appropriate given the data and the business problem?
***

In [34]:
# Here you run your code to clean the data
# To get the count of unique values in a column in the tmdb movies dataframe
df = pd.read_csv('zippedData/tmdb.movies.csv.gz', index_col=0)
df['original_title'].value_counts()

Eden                                   7
Home                                   6
Aftermath                              5
Lucky                                  5
Legend                                 5
                                      ..
Tomorrow                               1
Elizabeth Smart: Questions Answered    1
Legend of the Naked Ghost              1
9 Meter                                1
Home Makeover                          1
Name: original_title, Length: 24835, dtype: int64

In [None]:
#findi

## Data Modeling
Describe and justify the process for analyzing or modeling the data.

***
Questions to consider:
* How did you analyze or model the data?
* How did you iterate on your initial approach to make it better?
* Why are these choices appropriate given the data and the business problem?
***

In [32]:
# Here you run your code to model the data


## Evaluation
Evaluate how well your work solves the stated business problem.

***
Questions to consider:
* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?
***

## Conclusions
Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?
***