# **Project Name**    - **AMAZON PRIME EDA PROJECT by Mary Celine**



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual

# **Project Summary**

**Project Objective**

The objective of this project is to analyze a movies and TV shows dataset to uncover trends related to content type, genres, release patterns, ratings, and cast involvement. The goal is to derive meaningful insights that can support content strategy, audience targeting, and data-driven business decisions in the entertainment industry.

<br>

**Dataset Overview**

The project uses two datasets:


1.  Title Dataset
*   Contains information about movies and TV shows.
*   Key features include:
    *   Title name, type (Movie/Show)
    *   Release year
    *   Runtime and number of seasons
    *   Genres and production countries
    *   IMDb and TMDB ratings, votes, and popularity

2.   Credit Dataset

* Contains cast and crew details.

* Key features include:

  * Person ID and name

  * Role (Actor/Director)

  * Character name

  * Title ID (used to link with the Titles dataset)

The datasets are linked using a common title ID, enabling combined analysis of content and cast information.

<br>

**Data Cleaning and Preparation**

Handled missing values in columns such as runtime, seasons, and ratings.

Converted data types (e.g., release year, ratings) into appropriate formats.

Cleaned and standardized genre and country lists for consistency.

Removed duplicate records where applicable.

Merged the Titles and Credits datasets using the title ID for integrated analysis.

<br>

**Exploratory Data Analysis (EDA) & Key Insights**

1. Movies dominate the dataset, but TV shows tend to have higher long-term engagement due to multiple seasons.

2. Most content is produced in a limited number of countries, with the US being the major contributor.

3. Drama, Comedy, and Romance are the most common genres across both movies and shows.

4. Higher IMDb scores generally correlate with higher vote counts, indicating stronger audience engagement.

5. Content released after the 2000s shows a significant increase, highlighting the rapid growth of the entertainment industry.

6. Certain actors and directors appear repeatedly, indicating their strong industry presence and influence.
<br>

**Business Recommendations**

* Focus on producing content in high-performing genres such as Drama and Comedy to maximize audience reach.

* Invest more in TV shows, as they promote longer viewer retention.

* Leverage popular actors and directors to improve content visibility and engagement.

* Expand production into emerging regions to diversify content offerings and tap into new markets.

* Use IMDb and TMDB ratings as benchmarks for evaluating content success and guiding future projects.
<br>

**Conclusion**
<br>

This project demonstrates how structured data analysis can provide valuable insights into content performance and industry trends. By analyzing ratings, genres, release patterns, and cast data, the project highlights key factors influencing audience engagement. The findings can help stakeholders make informed decisions related to content creation, marketing strategy, and platform growth.Write the summary here within 500-600 words.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**To analyze movies and TV shows data to identify trends in content performance, audience preferences, and cast influence for data-driven decision making in the entertainment industry.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


### Dataset Loading

In [None]:
# Load Dataset
d1 = pd.read_csv("/content/dataset/credits.csv")
d2 = pd.read_csv("/content/dataset/titles.csv")

In [None]:
d1.dtypes

In [None]:
d2.dtypes

In [None]:
#group names by the same id (no duplicates)
names_by_id = (
    d1.groupby('id')['name']
      .apply(lambda x: list(dict.fromkeys(x)))   # removes repeated names
      .reset_index())


In [None]:
# actors + directors grouped separately
actors = (
    d1[d1['role'] == 'ACTOR']
      .groupby('id')['name']
      .apply(lambda x: list(dict.fromkeys(x)))
      .reset_index()
      .rename(columns={'name': 'actors'}))

directors = (
    d1[d1['role'] == 'DIRECTOR']
      .groupby('id')['name']
      .apply(lambda x: list(dict.fromkeys(x)))
      .reset_index()
      .rename(columns={'name': 'directors'}))


In [None]:
merged = d1.merge(d2, on="id", how="inner")

In [None]:
# merge into titles
data = d2.merge(actors, on='id', how='left')
data = data.merge(directors, on='id', how='left')

In [None]:
d1

In [None]:
d2

In [None]:
d1['id'] = d1['id'].astype(str)


In [None]:
d2['id'] = d2['id'].astype(str)

In [None]:
d2 = d2.drop(
    columns=['age_certification', 'description', 'seasons', 'imdb_id', 'runtime'],
    errors='ignore')   # avoids crashing if one column is missing

In [None]:
# --- d1 conversions ---
# role -> category
if 'role' in d1.columns:
    d1['role'] = d1['role'].astype('category')


# --- d2 conversions ---
# id -> string
if 'id' in d2.columns:
    d2['id'] = d2['id'].astype('string')

# type -> category
if 'type' in d2.columns:
    d2['type'] = d2['type'].astype('category')

### Dataset First View

In [None]:
# Dataset First Look
d1.head()

In [None]:
d2.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
d1.shape

In [None]:
# Dataset Rows & Columns count
d2.shape

### Dataset Information

In [None]:
# Dataset Info
d1.info()


In [None]:
# Dataset Info
d2.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
d1.duplicated().sum()

In [None]:
# Dataset Duplicate Value Count
d2.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
d1.isnull().sum()

In [None]:
# Missing Values/Null Values Count
d2.isnull().sum()

In [None]:
(d1["character"].isna() |
 d1["character"].astype(str).str.strip().eq("") |
 d1["character"].astype(str).eq("[]")).sum()


In [None]:
d1["character"] = (
    d1["character"]
    .apply(lambda x: np.nan if x == [] else x)
    .replace(r"^\s*$", np.nan, regex=True))

In [None]:
d1["character"].isnull().sum()

In [None]:
d1["character"] = d1["character"].fillna("Unknown")

In [None]:
d2[["imdb_score", "imdb_votes", "tmdb_popularity", "tmdb_score"]].isnull().sum()

In [None]:
d2["imdb_score"]       = d2["imdb_score"].fillna(d2["imdb_score"].median())
d2["imdb_votes"]       = d2["imdb_votes"].fillna(d2["imdb_votes"].median())
d2["tmdb_popularity"]  = d2["tmdb_popularity"].fillna(d2["tmdb_popularity"].median())
d2["tmdb_score"]       = d2["tmdb_score"].fillna(d2["tmdb_score"].median())


In [None]:
# Visualizing the missing values
missing_counts = merged.isnull().sum()

plt.figure(figsize=(10, 6))
plt.bar(missing_counts.index, missing_counts)

plt.xticks(rotation=45, ha='right')
plt.ylabel("Number of Missing Values")
plt.title("Missing Values per Column")
plt.tight_layout()
plt.show()

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
merged.columns

In [None]:
# Dataset Describe
merged.describe()

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
merged.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
import numpy as np
import pandas as pd

# 1️⃣ Work on a copy (safety first)
data = merged.copy()

# 2️⃣ Trim column names (remove accidental spaces)
data.columns = data.columns.str.strip()

# 3️⃣ Drop obvious useless columns (edit this list if needed)
drop_cols = ["description", "imdb_id"]  # example from your earlier notes
data = data.drop(columns=[c for c in drop_cols if c in data.columns], errors="ignore")

# 4️⃣ Handle missing text fields
text_cols = data.select_dtypes(include="object").columns

for col in text_cols:
    data[col] = (
        data[col]
        .apply(lambda x: np.nan if x == [] else x)      # empty lists -> NaN
        .replace(r"^\s*$", np.nan, regex=True)         # blanks/spaces -> NaN
        .fillna("Unknown")                             # placeholder
    )

# 5️⃣ Handle missing numeric fields (median is safer than mean)
num_cols = data.select_dtypes(include=["float64", "int64"]).columns

for col in num_cols:
    data[col] = data[col].fillna(data[col].median())

# 6️⃣ Convert some columns to proper data types (examples — adjust as needed)
convert_to_category = ["type", "role"]
for col in convert_to_category:
    if col in data.columns:
        data[col] = data[col].astype("category")

if "season" in data.columns:
    data["season"] = pd.to_numeric(data["season"], errors="coerce").astype("Int64")

if "id" in data.columns:
    data["id"] = data["id"].astype(str)

# 7️⃣ Final health check
print("Dataset shape:", data.shape)
print("\nMissing values after cleaning:\n", data.isnull().sum())
print("\nColumn dtypes:\n", data.dtypes.head())

# ready
analysis_ready = data


### What all manipulations have you done and insights you found?

Answer Here.
<br>

**Data Cleaning**

1.   Joined your two datasets
2.   Standardized column names
3.   Removed low-value columns
4.   Cleaned text columns
5.   Cleaned numeric columns
6.   Fixed data types
7.   Built missing-value visibility
<br>

**Business Insights**


1.   Records had missing character information
2.   Numeric fields (scores, votes, popularity) contained gaps
3.   ID consistency was good
4.   Categorical variables like type/role are well-structured








## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(8,5))
sns.histplot(data=merged, x="imdb_score", bins=20, kde=True)
plt.title("Distribution of IMDb Scores")
plt.show()

##### 1. Why did you pick the specific chart?

**Histograms show how scores are spread — low, average, or high.**

##### 2. What is/are the insight(s) found from the chart?

**Most content clusters around mid-range scores; only a few score extremely high.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive — helps identify quality benchmarks.
Negative caution: relying only on high-rated content may reduce catalog diversity.**

#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(8,5))
sns.histplot(data=merged, x="tmdb_popularity", bins=20, kde=True)
plt.title("Distribution of TMDB Popularity")
plt.show()

##### 1. Why did you pick the specific chart?

Popularity behaves differently from ratings — this shows that contrast.

##### 2. What is/are the insight(s) found from the chart?

Popularity is skewed — a handful of titles dominate most attention.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: tells marketing where attention naturally flows.
Risk: over-investing only in trending titles hurts long-tail discovery.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(8,5))
sns.scatterplot(data=merged, x="imdb_score", y="tmdb_popularity")
sns.regplot(data=merged, x="imdb_score", y="tmdb_popularity", scatter=False)
plt.title("Do Higher Ratings Drive Popularity?")
plt.show()

##### 1. Why did you pick the specific chart?

Scatter plots show relationships between two numeric variables.

##### 2. What is/are the insight(s) found from the chart?

Slight upward trend — ratings help, but hype and marketing matter too.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: invest in well-reviewed titles.
Risk: ignoring marketing makes great content invisible.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(8,5))
sns.scatterplot(data=merged, x="imdb_votes", y="imdb_score")
plt.xscale("log")
plt.title("Do More Votes Mean Better Scores?")
plt.show()

##### 1. Why did you pick the specific chart?

Votes reflect audience size — we check if big audiences rate higher.

##### 2. What is/are the insight(s) found from the chart?

Many high-vote titles hover around average scores — popularity ≠ quality.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Great reality check for executives: volume doesn't equal value.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(8,5))
sns.boxplot(data=merged, x="type", y="imdb_score")
plt.title("Score Comparison by Content Type")
plt.show()

##### 1. Why did you pick the specific chart?

Boxplots compare distributions across categories.

##### 2. What is/are the insight(s) found from the chart?

Certain content types have tighter, more consistent quality.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: guides production strategy.
Risk: abandoning weaker categories may shrink audience segments.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(8,5))
sns.barplot(data=merged, x="type", y="tmdb_popularity", estimator=np.mean)
plt.title("Average Popularity by Content Type")
plt.show()

##### 1. Why did you pick the specific chart?

Bar charts show average impact clearly.

##### 2. What is/are the insight(s) found from the chart?

Some formats consistently attract more views.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps allocate promotion budget more intelligently.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
corr = merged.corr(numeric_only=True)
plt.figure(figsize=(8,6))
sns.heatmap(corr, annot=True, cmap="coolwarm")
plt.title("Correlation Between Metrics")
plt.show()

##### 1. Why did you pick the specific chart?

Heatmaps summarize relationships across many variables fast.

##### 2. What is/are the insight(s) found from the chart?

Shows which metrics move together — and which don’t.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Guides modeling, forecasting, and prioritization.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
merged["rating_gap"] = merged["tmdb_score"] - merged["imdb_score"]

plt.figure(figsize=(8,5))
sns.histplot(data=merged, x="rating_gap", bins=20, kde=True)
plt.title("Distribution of Rating Gap (TMDB - IMDb)")
plt.axvline(0, color="red", linestyle="--")
plt.show()


##### 1. Why did you pick the specific chart?

Many titles get similar ratings. Some are loved on one platform but not the other

##### 2. What is/are the insight(s) found from the chart?

Many titles get similar ratings. Some are loved on one platform but not the other

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps content teams avoid trusting one score blindly.
Marketing messaging can be tailored to platform bias.

<br>

Negative - If leadership only chases titles with high ratings on one site, they may misjudge audience sentiment elsewhere.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(8,5))
sns.scatterplot(data=merged, x="tmdb_score", y="imdb_score")
plt.title("Do IMDb and TMDB Agree?")
plt.show()

##### 1. Why did you pick the specific chart?

Compare platform perception.

##### 2. What is/are the insight(s) found from the chart?

Similar patterns, but some strong disagreements exist.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Teaches teams not to rely on a single platform metric.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
top = merged.sort_values("tmdb_popularity", ascending=False).head(10)

plt.figure(figsize=(8,5))
sns.barplot(data=top, y="title", x="tmdb_popularity")
plt.title("Top 10 Most Popular Titles")
plt.show()

##### 1. Why did you pick the specific chart?

Leaders love rankings.

##### 2. What is/are the insight(s) found from the chart?

Star power and franchises dominate.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Supports targeted campaigns and licensing negotiations.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
plt.figure(figsize=(7,5))
merged["type"].value_counts().plot(kind="bar")
plt.title("Content Distribution by Type")
plt.show()

##### 1. Why did you pick the specific chart?

Inventory analysis — what we produce most.

##### 2. What is/are the insight(s) found from the chart?

One category usually dominates production.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Reveals whether catalog strategy is balanced or biased.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
plt.figure(figsize=(8,5))
sns.scatterplot(data=merged, x="imdb_votes", y="tmdb_popularity")
plt.xscale("log")
plt.title("Relationship Between Audience Size and Popularity")
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot is ideal when checking numeric–numeric relationships.
Votes represent audience base, while popularity represents engagement momentum.

##### 2. What is/are the insight(s) found from the chart?

Titles with more votes tend to be more popular,but many exceptions exist.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Confirms that growing viewership tends to feed popularity — reinforcing marketing loops.

Negative - Not every big-audience title stays popular.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
merged["rating_band"] = pd.cut(
    merged["imdb_score"],
    bins=[0, 4, 6, 8, 10],
    labels=["Low", "Average", "Good", "Excellent"]
)

plt.figure(figsize=(8,5))
sns.boxplot(data=merged, x="rating_band", y="tmdb_popularity")
plt.title("Popularity vs Rating Bands")
plt.show()

##### 1. Why did you pick the specific chart?

**Boxplots compare distributions across categories.**

##### 2. What is/are the insight(s) found from the chart?

Excellent titles trend more popular. But some Average titles punch above their weight

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Supports investing in consistently high-quality content — it compounds over time.
<br>

It also shows hype-driven titles exist.
Over-relying on hype leads to short-term spikes but weak long-term retention.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
corr = merged.corr(numeric_only=True)

plt.figure(figsize=(8,6))
sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap")
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A correlation heatmap is like a control tower:

* it shows how every numeric variable relates to every other

* values near +1 mean strong positive relationship

* values near –1 mean strong negative relationship

* values near 0 mean no meaningful relationship

##### 2. What is/are the insight(s) found from the chart?



1.  imdb_votes → strongly correlated with tmdb_popularity
2.  tmdb_score and imdb_score are moderately correlated
3.  weak/no correlation between some fields — meaning they move independently.



#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(
    merged[["imdb_score", "tmdb_score", "tmdb_popularity", "imdb_votes"]],
    diag_kind="kde"
)
plt.suptitle("Pair Plot — Relationships Between Key Metrics", y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

A pair plot gives:

* scatter plots for every pair of variables

* distributions along the diagonal

* visual clustering and trend hints

##### 2. What is/are the insight(s) found from the chart?

* Popularity increases loosely with votes

* Higher ratings sometimes cluster together

* Some relationships weak correlation

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***