# Business Understanding

## Context
A superstore is planning for the year-end sale. They want to launch a new offer - gold membership, that gives a 20% discount on all purchases, for only $499 which is $999 on other days. It will be valid only for existing customers and the campaign through phone calls is currently being planned for them. The management feels that the best way to reduce the cost of the campaign is to make a predictive model which will classify customers who might purchase the offer.

## Objective
The superstore wants to predict the likelihood of the customer giving a positive response and wants to identify the different factors which affect the customer's response. You need to analyze the data provided to identify these factors and then build a prediction model to predict the probability of a customer will give a positive response.

## About the Dataset
This data was gathered during last year's campaign. The data description is as follows:

- **Response (target)**: 1 if customer accepted the offer in the last campaign, 0 otherwise
- **ID**: Unique ID of each customer
- **Year_Birth**: Age of the customer
- **Complain**: 1 if the customer complained in the last 2 years
- **Dt_Customer**: Date of customer's enrollment with the company
- **Education**: Customer's level of education
- **Marital**: Customer's marital status
- **Kidhome**: Number of small children in customer's household
- **Teenhome**: Number of teenagers in customer's household
- **Income**: Customer's yearly household income
- **MntFishProducts**: The amount spent on fish products in the last 2 years
- **MntMeatProducts**: The amount spent on meat products in the last 2 years
- **MntFruits**: The amount spent on fruits products in the last 2 years
- **MntSweetProducts**: Amount spent on sweet products in the last 2 years
- **MntWines**: The amount spent on wine products in the last 2 years
- **MntGoldProds**: The amount spent on gold products in the last 2 years
- **NumDealsPurchases**: Number of purchases made with discount
- **NumCatalogPurchases**: Number of purchases made using catalog (buying goods to be shipped through the mail)
- **NumStorePurchases**: Number of purchases made directly in stores
- **NumWebPurchases**: Number of purchases made through the company's website
- **NumWebVisitsMonth**: Number of visits to company's website in the last month
- **Recency**: Number of days since the last purchase

## Goals of the Analysis
1. **Identify Factors Influencing Customer Response**: Analyze the data to identify the key factors that influence whether a customer accepts the offer.
2. **Build a Predictive Model**: Develop a model to predict the likelihood of a customer accepting the offer, which will help in targeting the right customers and reducing campaign costs.
3. **Optimize Marketing Strategy**: Use insights from the analysis to optimize the marketing strategy and improve the effectiveness of the campaign.

By understanding the data and identifying the important factors, we can better predict customer behavior and tailor the marketing efforts to maximize the success of the year-end sale.


# Data Understanding

## Overview of the Dataset
In this section, we will explore the dataset to understand its structure and contents. This will involve loading the data, displaying basic information, and performing some initial visualizations.

### 1. Load the Dataset
First, let's load the dataset and take a look at the first few rows to get an idea of what it looks like.

### Import Library

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier

pd.set_option('display.max_columns', None)

In [None]:
df = pd.read_csv('superstore_data.csv')

In [None]:
df.head()

### 2. Data Structure and Types

We will now check the structure of the dataset, including the number of rows and columns, and the data types of each column.

In [None]:
df.info()

### 3. Summary Statistics

Next, let's generate summary statistics for the numerical columns in the dataset to understand their distribution and central tendencies.

In [None]:
df.describe()

### 4. Missing Values

We need to check for any missing values in the dataset, as handling missing data will be an important step in data preparation.

In [None]:
df.isna().sum()

As we can see, there's missing value in income column

### 5. Data Distribution
To better understand the data, let's visualize the distribution of key variables. This includes both numerical and categorical variables.

#### 5.1 Distribution of Numerical Variables
We will plot histograms for the numerical variables to see their distributions.

In [1]:
# Plot histograms for numerical variables
df.hist(bins=20, figsize=(20, 15))
plt.show()

NameError: name 'df' is not defined

#### 5.2 Box Plot of Numerical Variables

We will plot box plots for the numerical variables to identify outliers and understand the spread of the data.

In [None]:
# Plot box plots for numerical variables
numerical_columns = ['Income']

plt.figure(figsize=(20, 15))
df[numerical_columns].boxplot(rot=45)
plt.title('Box Plot of Numerical Variables')
plt.show()

In [None]:
# Plot box plots for numerical variables
numerical_columns = ['Year_Birth']

plt.figure(figsize=(20, 15))
df[numerical_columns].boxplot(rot=45)
plt.title('Box Plot of Numerical Variables')
plt.show()

In [None]:
# Plot box plots for numerical variables
numerical_columns = df.columns[df.dtypes != 'object'].drop(['Income', 'Id', 'Year_Birth'])

plt.figure(figsize=(20, 15))
df[numerical_columns].boxplot(rot=45)
plt.title('Box Plot of Numerical Variables')
plt.show()

#### 5.3 Distribution of Categorical Variables

We will use bar plots to visualize the distribution of categorical variables.

In [None]:
# Plot bar plots for categorical variables
categorical_columns = ['Education', 'Complain', 'Response', 'Marital_Status']

fig, axes = plt.subplots(2, 2, figsize=(15, 10))

for i, col in enumerate(categorical_columns):
    df[col].value_counts().plot(kind='bar', ax=axes[i//2, i%2])
    axes[i//2, i%2].set_title(f'Distribution of {col}')
    axes[i//2, i%2].set_ylabel('Count')

plt.tight_layout()
plt.show()


#### 5.4 Top and Bottom Values

We will examine the top 10 values for Income and mntmeatproduct the bottom 10 values for Year_Birth.

In [None]:
# Top 10 values for Income
top_10_income = df.nlargest(10, 'Income')
print("Top 10 values for Income:")
print(top_10_income[['Id', 'Income']])

# Top 10 values for Income
top_10_income = df.nlargest(10, 'MntMeatProducts')
print("Top 10 values for MntMeatProducts:")
print(top_10_income[['Id', 'MntMeatProducts']])

# Bottom 10 values for Year_Birth
bottom_10_year_birth = df.nsmallest(10, 'Year_Birth')
print("Bottom 10 values for Year_Birth:")
print(bottom_10_year_birth[['Id', 'Year_Birth']])

### 6. Pair Plot

We will use pair plots to visualize the relationships between multiple numerical variables.

In [None]:
import seaborn as sns

# Select a subset of columns for the pair plot
pairplot_columns = ['Income', 'MntFishProducts', 'MntMeatProducts', 'MntFruits', 'MntSweetProducts', 
                    'MntWines', 'MntGoldProds', 'NumDealsPurchases', 'Recency', 'Response']

# Plot pair plot
sns.pairplot(df[pairplot_columns], hue='Response')
plt.show()


### 7. Correlation Analysis

We will perform a correlation analysis to identify relationships between numerical variables and the target variable (Response).

In [None]:
# Calculate correlation matrix
correlation_matrix = df.select_dtypes(include=[np.number]).corr()

# Display the correlation matrix
plt.figure(figsize=(15, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

# Data Preparation

## Introduction
In this section, we will prepare the data for modeling. This includes handling missing values, encoding categorical variables, and normalizing numerical variables. Proper data preparation is crucial for building an effective predictive model.

### 1. Handling Missing Values
First, we will handle any missing values in the dataset. We will decide whether to fill in missing values or drop the corresponding rows/columns based on the amount and significance of the missing data.


In [None]:
# Check for missing values
missing_values = df.isnull().sum()
missing_values

# Fill missing values (example: fill missing income with median income)
df['Income'].fillna(df['Income'].median(), inplace=True)

# # If other columns have missing values, we need to decide on the strategy
# # For simplicity, we will drop rows with any missing values
# df.dropna(inplace=True)

# Verify that there are no missing values left
df.isnull().sum()

### 2. Handling Outliers

We will handle outliers in the Year_Birth and Income columns to ensure they do not adversely affect our model.

In [None]:
# Remove rows where Year_Birth is less than 1920
df = df[df['Year_Birth'] >= 1940]

# Remove rows where Income is greater than or equal to 400000
df = df[df['Income'] < 400000]

# Remove rows where MntMeatProducts is greater than or equal to 1000
df = df[df['MntMeatProducts'] < 1000]

### 3. Encoding Categorical Variables

We will encode categorical variables using appropriate techniques such as one-hot encoding for nominal variables and label encoding for ordinal variables.

In [None]:
# One-hot encode categorical variables
df = pd.get_dummies(df, columns=['Education', 'Marital_Status'], drop_first=True)

# Display the first few rows of the updated dataframe
df.head()

### 4. Feature Engineering

We will create new features or transform existing ones to enhance the predictive power of the model.

#### 4.1 Creating Age from Year_Birth

We will create a new feature Age from Year_Birth.

In [None]:
# Create Age feature
df['Age'] = 2024 - df['Year_Birth']

# Drop the original Year_Birth column
df.drop('Year_Birth', axis=1, inplace=True)

### 5. Splitting the Data

We will split the data into training and testing sets to evaluate the performance of our predictive model.

In [None]:
from sklearn.model_selection import train_test_split

# Define the target variable and features
X = df.drop(['Response', 'Id', 'Dt_Customer'], axis=1)
y = df['Response']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shape of the training and testing sets
X_train.shape, X_test.shape, y_train.shape, y_test.shape

# Modeling

## Introduction
In this section, we will build a predictive model using the Random Forest Classifier. We will train the model on our training dataset

### 1. Training the Model
We will start by training the Random Forest Classifier on the training dataset.


In [None]:
# Initialize the Random Forest Classifier
rfc = RandomForestClassifier(random_state=42, class_weight='balanced')

# Train the model
rfc.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = rfc.predict(X_test)

# Evaluate the = model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("\nAccuracy Score:")
print(accuracy_score(y_test, y_pred))

In [None]:
# Initialize the Random Forest Classifier
rfc = RandomForestClassifier(random_state=42)

# Train the model
rfc.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = rfc.predict(X_test)

# Evaluate the = model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("\nAccuracy Score:")
print(accuracy_score(y_test, y_pred))

### 2. Hyperparameter Tuning

To improve the performance of our model, we will perform hyperparameter tuning using GridSearchCV.

In [None]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'max_depth': [4, 6, 8, 10],
    'criterion': ['gini', 'entropy']
}

# Initialize the GridSearchCV
grid_search = GridSearchCV(estimator=rfc, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)

# Fit the model
grid_search.fit(X_train, y_train)

# Print the best parameters and best score
print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

# Evaluation

## Introduction
In this section, we will evaluate the performance of our optimized Random Forest Classifier model. We will use various evaluation metrics such as the confusion matrix, classification report, and accuracy score to assess how well our model performs on the testing dataset.

### 1. Confusion Matrix
The confusion matrix provides a summary of the prediction results on the testing dataset. It shows the number of true positives, true negatives, false positives, and false negatives.

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import seaborn as sns
import matplotlib.pyplot as plt

# Use the best model to make predictions
best_rfc = grid_search.best_estimator_
y_pred_best = best_rfc.predict(X_test)

# Compute confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred_best)

# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Purchased', 'Purchased'], yticklabels=['Not Purchased', 'Purchased'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix')
plt.show()

### 2. Classification Report

The classification report provides a detailed breakdown of the precision, recall, F1-score, and support for each class.

In [None]:
# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred_best, target_names=['Not Purchased', 'Purchased']))

# Deployment

## Introduction
In this section, we will discuss how to deploy our optimized Random Forest Classifier model. Deployment involves saving the trained model so that it can be used to make predictions on new data. We will use joblib to save the model and provide an example of how to load the model and make predictions.

### 1. Saving the Model
We will use joblib to save the trained model to a file. This allows us to reuse the model without having to retrain it.


In [None]:
# import joblib

# # Save the trained model to a file
# joblib_file = "rfc_model.joblib"
# joblib.dump(rfc, joblib_file)

# print(f"Model saved to {joblib_file}")