<a href="https://colab.research.google.com/github/oigwe-frx/movie-database-analysis/blob/oi%2Fcontext-statement/Movie_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


-------------------------------------
# **Project 1: Movie Data Analysis**
-------------------------------------

--------------------
## **Context**
--------------------

In an era marked by the convergence of technology and entertainment, the film industry is a pillar of global cultural dissemination and economic vitality. Understanding the intricate factors influencing a movie's commercial success is paramount within this dynamic landscape. Leveraging data from the Internet Movie Database (IMDb), this project aims to delve into the multifaceted realm of box office hits, dissecting the variables that potentially shape their triumph or demise.

The cinematic ecosystem is a tapestry woven with diverse elements, ranging from star power and production budget to genre and critical reception. Against this backdrop, the analysis endeavors to discern patterns and correlations within the data, unraveling the enigmatic interplay between various attributes and a movie's financial performance.

As the digital era reshapes audience preferences and consumption patterns, traditional metrics of success undergo a metamorphosis. Thus, this investigation seeks to illuminate established paradigms and explore emerging trends and disruptions catalyzed by technological advancements and shifting audience dynamics.

By scrutinizing the data with a meticulous eye and employing sophisticated analytical methodologies, this project endeavors to offer insights that transcend conventional wisdom, providing stakeholders within the film industry with actionable intelligence to navigate an ever-evolving landscape and maximize their chances of crafting cinematic endeavors that resonate with audiences and thrive at the box office.

------------------
## **Objective**
------------------

This data analysis and visualization project aims to investigate and identify critical attributes influencing the commercial success of movies. Leveraging data sourced from IMDb, the project aims to uncover patterns and correlations between various factors and a movie's box office performance. Through meticulous analysis and visualization techniques, the project seeks to provide actionable insights to stakeholders in the film industry, enabling them to make informed decisions to enhance the financial viability of their cinematic endeavors.

-----------------------------
## **Key Questions**
-----------------------------


* What are the most popular movies?
    * Determine the top-rated or most popular titles based on IMDb ratings or user reviews.
      
* What are the trends in movie genres over time?
    * Analyze how the popularity of different genres has evolved over the years.
      
* Which actors have appeared in the most movies?
    * Identify prolific actors and actresses in the IMDb database.
      
* What are the highest-grossing movies of all time?
    * Investigate box office revenue data to find the most financially successful films.
      
* Are there any correlations between movie budgets and box office performance?
    * Explore whether higher budgets lead to higher box office earnings.
      
* What is the distribution of movie ratings on IMDb?
    * Analyze the distribution of IMDb ratings to understand audience preferences.
      
* Are there any seasonal trends in movie releases?
    * Investigate whether certain genres or movies tend to be released during specific seasons or holidays.
      
* What is the average movie runtime?
    * Calculate the average duration of movies and see if there are any trends over time.
      
* Who are the top-rated directors?
    * Identify directors with the highest-rated movies in the database.
      
* Are there any patterns in user reviews or sentiment analysis?
    * Analyze user reviews to identify patterns in sentiment and opinions about movies or TV shows.
      
* Are there any geographical trends in movie preferences?
    * Explore whether movie preferences vary by region or country.
      
* How has the film industry evolved over the years?
    * Look at historical data to understand film production, technology, and distribution changes.
      
* What are the most common keywords or tags associated with movies?
    * Identify keywords or tags that are frequently used to describe movies.
      
* Based on certain features, Can you predict movie ratings or box office success?
    * Build predictive models to see if you can forecast movie ratings or financial performance.
      
* What are the most influential factors for IMDb ratings?
    * Analyze which factors, such as genre, director, or cast, impact ratings most.

* How do user ratings compare to critic ratings?
    * Compare IMDb user ratings with ratings from professional critics to assess the level of agreement or disagreement.

* Are there any outliers or anomalies in the data?
    * Look for unusual or unexpected patterns in the data that may require further investigation.

------------------------------------
## **Dataset Description**
------------------------------------

...

##  **Importing the necessary libraries and overview of the dataset**

In [None]:
# Library to suppress warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Libraries to help with data visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Library to extract datetime features
import datetime as dt

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### **Loading the dataset**

In [None]:
data = pd.read_csv('/content/drive/...')

In [None]:
# Copying data to another variable to avoid any changes to the original data
df = data.copy()

### **View the first 5 rows of the dataset**

In [None]:
# Looking at head (the first 5 observations)
df.head()

**Observations:**

* ...

### **View the last 5 rows of the dataset**

In [None]:
# Looking at tail (the last 5 observations)
df.tail()

**Observations:**

* ...

### **Checking the shape of the dataset**

In [None]:
df.shape

### **Checking the info()**

In [None]:
df.info()

**Observations:**

* ...

### **Summary of the data**

In [None]:
df.describe().T

**By default, the describe() function shows the summary of numeric variables only. Let's check the summary of non-numeric variables.**  

In [None]:
df.describe(exclude = 'number').T

**Observations:**

* ...

:**Let's check the count of each unique category in each of the categorical variables.**

In [None]:
# Making a list of all categorical variables


# Printing number of count of each unique value in each column


### **Missing value treatment**

In [None]:
# Checking missing values
df.isna().sum()

In [None]:
df.isnull().sum()

## **Exploratory Data Analysis: Univariate**

**Let us explore the numerical variables first.**

In [None]:
def histogram_boxplot(feature, figsize=(15, 10), bins="auto"): #Histogram
    """ Boxplot and histogram combined
    feature: 1-d feature array
    figsize: size of fig (default (15, 10))
    bins: number of bins (default "auto")
    """
    f, (ax_box, ax_hist) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid
        sharex=True,  # The X-axis will be shared among all the subplots
        gridspec_kw={"height_ratios": (.25, .75)},
        figsize=figsize
    )

    # Creating the subplots
    # Boxplot will be created and the mean value of the column will be indicated using some symbol
    sns.boxplot(x=feature, ax=ax_box, showmeans=True, color='red')

    # For histogram
    sns.histplot(x=feature, kde=False, ax=ax_hist, bins=bins)
    ax_hist.axvline(np.mean(feature), color='g', linestyle='--')      # Add mean to the histogram
    ax_hist.axvline(np.median(feature), color='black', linestyle='-') # Add median to the histogram

    plt.show()

### **Observations on [...]**

In [None]:
histogram_boxplot(df.[])

**Observations:**
* ...

[Rinse and Repeat for different plots...]

**Now, let's explore the categorical variables.**

In [None]:
def bar_perc(data, z): #Bar Plot
    total = len(data[z]) # Length of the column
    plt.figure(figsize = (15, 5))

    # Convert the column to a categorical data type
    data[z] = data[z].astype('category')

    ax = sns.countplot(x=z, data=data, palette='Paired', order=data[z].value_counts().index)

    for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height() / total) # Percentage of each class
        x = p.get_x() + p.get_width() / 2 - 0.05                    # Width of the plot
        y = p.get_y() + p.get_height()                              # Height of the plot
        ax.annotate(percentage, (x, y), size = 12)                  # Annotate the percentage

    plt.show()                                                      # Display the plot

### **Observations on [...]**

In [None]:
bar_perc(df, ...)

**Observations:**
* ...

[Rinse and Repeat for different plots...]

## **Exploratory Data Analysis: Multivariate**

[Rinse and Repeat (same idea as the univariate) for different plots...]

## **Conclusion and Recommendations**

-----------------------------------------------------------------
### **Conclusion**
-----------------------------------------------------------------

We analyzed a dataset of nearly ...
The data spanned ...
The main feature of interest here is the ...
From a business perspective, ...
Thus, we determined the factors that affect ...

We have been able to conclude that:

1. ...

--------------------------------------------------
### **Recommendation to business**
--------------------------------------------------

1. ...

---------------------------------
###  **Further Analysis**
---------------------------------
1. Dig deeper to explore the variation of ...