<a href="https://colab.research.google.com/github/harshavardhan4199/bank-stock-prices/blob/main/Copy_of_Sample_ML_Submission_Template_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Shopper Spectrum: Customer Segmentation and Product Recommendations in E-Commerce




##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
##### **Team Member 1 -** N.Harsha Vardhan
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

The “Shopper Spectrum: Customer Segmentation and Product Recommendations in E-Commerce” dataset offers a comprehensive view of online retail transactions, capturing the behavior and preferences of customers across various countries. With over 541,000 entries, each row represents an individual product purchase, including key attributes such as invoice number, product code and description, quantity, invoice date, unit price, customer ID, and country. This granular data enables a wide range of analytical applications, from customer profiling to product trend analysis. While the majority of transactions are from the United Kingdom, the dataset also includes purchases from other European and international regions, allowing for a comparative analysis of geographic shopping patterns.

One of the primary uses of this dataset is in customer segmentation. By analyzing metrics such as purchase frequency, total spend, and recency of transactions—commonly known as RFM analysis—retailers can categorize customers into groups like loyal shoppers, occasional buyers, and price-sensitive customers. This segmentation is instrumental in personalizing marketing efforts and improving customer retention. Additionally, the presence of product-level data tied to specific invoices allows for market basket analysis, identifying frequently co-purchased items and enabling cross-selling opportunities. These insights form the foundation for collaborative and content-based recommendation systems that suggest products based on a customer’s history or item similarities.

Another key application of the dataset lies in sales and trend analysis. The timestamped invoice data enables the identification of seasonal trends, peak shopping periods, and possible stock management issues. For instance, products with consistently high quantities or those showing spikes in specific months can guide inventory planning. The dataset also includes returns, identified by negative quantities or unit prices, which can be used to assess product quality or customer satisfaction issues.

Despite its strengths, the dataset presents challenges such as missing customer IDs for about 25% of transactions and some null product descriptions. These issues can limit the effectiveness of certain personalized models but are manageable with data preprocessing and imputation techniques. Nevertheless, the overall richness of the data provides immense value for building machine learning models aimed at enhancing user experience, optimizing sales strategies, and improving operational efficiency in e-commerce.

In summary, the “Shopper Spectrum” dataset is a powerful resource for businesses looking to leverage data-driven decision-making. It supports the development of robust customer segmentation strategies, intelligent recommendation systems, and detailed market insights. With thoughtful analysis and appropriate modeling techniques, this dataset can significantly enhance a retailer’s ability to understand and serve its customers in a competitive online environment.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


In the dynamic and competitive world of e-commerce, understanding customer behavior and delivering personalized shopping experiences have become critical for business success. However, the vast diversity in customer preferences, buying patterns, and geographical differences makes it challenging for retailers to design effective marketing and sales strategies. This project addresses the need for intelligent customer segmentation and product recommendation systems using the “Shopper Spectrum” dataset. The primary objective is to categorize customers based on their purchasing frequency, recency, and monetary value to enable targeted marketing approaches. Additionally, the project aims to develop recommendation systems that suggest relevant products to customers by analyzing their past purchases or the behavior of similar users. Another key focus is on identifying sales trends over time to support better inventory management and promotional planning. The dataset also presents real-world challenges such as missing customer IDs and incomplete product descriptions, which must be carefully managed to ensure reliable insights. Overall, the project seeks to provide data-driven solutions that empower e-commerce businesses to enhance customer engagement, improve operational efficiency, and drive growth through smarter decision-making.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv("online_retail.csv", encoding='ISO-8859-1')



### Dataset First View

In [None]:
# Dataset First Look
# Import necessary library
import pandas as pd

# Load the dataset
df = pd.read_csv("online_retail.csv", encoding='ISO-8859-1')

# Dataset First Look - Display top 5 rows
print("Dataset First Look (Top 5 Rows):\n")
print(df.head())



### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape


### Dataset Information

In [None]:
# Dataset Info
df.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()


In [None]:
# Visualizing the missing values
missing = df.isnull().sum()
missing = missing[missing > 0]

missing.plot(kind='bar', figsize=(8, 4), color='salmon')
plt.title("Missing Values per Column")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


### What did you know about your dataset?

The Online Retail dataset is a rich and detailed collection of transactional records from a UK-based e-commerce platform, consisting of 541,909 rows and 8 columns. Each entry in the dataset represents an individual product purchased in a transaction and includes information such as invoice number, product code, product description, quantity purchased, invoice date, unit price, customer ID, and the customer’s country. This dataset provides a comprehensive view of customer purchasing behavior over time, making it highly suitable for analysis in customer segmentation, product recommendation, and sales trend forecasting.

Initial exploration reveals that while the majority of transactions originate from the United Kingdom, the dataset also includes sales from various international markets, enabling cross-country analysis. A significant challenge in the dataset is the presence of missing values, particularly in the CustomerID column, which is missing in over 135,000 records—about 25% of the data. The Description field also has some null values. These issues should be addressed during data cleaning to ensure accurate analysis. Additionally, the dataset contains 5,268 duplicate entries, which can distort insights if not removed.

The InvoiceDate field, stored as a string, should be converted to a datetime format for time-based analysis. Negative values in the Quantity and UnitPrice columns indicate product returns or corrections and must be handled appropriately during preprocessing. Overall, this dataset captures a broad spectrum of retail transactions and offers substantial opportunities for deriving business insights, such as identifying high-value customers, understanding seasonal trends, and implementing personalized recommendation systems.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns.tolist()

In [None]:
# Dataset Describe
df.describe()

### Variables Description

The Online Retail dataset contains eight key variables that provide detailed insights into customer transactions on an e-commerce platform. Each row represents a product-level entry in a customer invoice. The InvoiceNo column is an alphanumeric identifier that uniquely represents each transaction; if it starts with the letter 'C', it usually indicates a cancellation or return. StockCode refers to the unique code assigned to each product, while Description provides the name or short detail of the product purchased. The Quantity column indicates the number of units purchased per product per invoice, where negative values typically represent product returns.

The InvoiceDate column captures the exact date and time when the transaction occurred, though it is initially in object (string) format and should be converted to datetime for temporal analysis. UnitPrice indicates the price per unit of the product in British Pounds, and negative prices may also represent adjustments or returns. The CustomerID is a numeric, anonymized identifier assigned to each customer; however, a significant number of rows have missing values in this column, indicating unidentified or guest customers. Finally, the Country column specifies the customer’s location at the time of purchase, which can be used for regional or geographic analysis. Altogether, these variables support a wide range of analyses, including customer segmentation, sales trend forecasting, and personalized product recommendation systems.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
import pandas as pd

# Load the dataset
df = pd.read_csv("online_retail.csv", encoding='ISO-8859-1')
df.drop_duplicates(inplace=True)

# Remove rows with missing CustomerID or Description
df.dropna(subset=['CustomerID', 'Description'], inplace=True)

# Remove rows with non-positive Quantity or UnitPrice
df = df[(df['Quantity'] > 0) & (df['UnitPrice'] > 0)]
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df['CustomerID'] = df['CustomerID'].astype(str)
df.reset_index(drop=True, inplace=True)
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Show final shape and sample data
print(" Cleaned Dataset Shape:", df.shape)
print(" Sample Data:\n", df.head())


### What all manipulations have you done and insights you found?

During the preprocessing and cleaning of the Online Retail dataset, several key manipulations were performed to ensure the data was ready for analysis. First, all duplicate records were removed to avoid redundancy and ensure accuracy in statistical analysis and modeling. Next, rows with missing values in critical columns such as CustomerID and Description were dropped. These fields are essential for tasks like customer segmentation and product recommendation, and keeping records with missing values would have hindered those efforts. Additionally, transactions with non-positive Quantity or UnitPrice values—typically representing returns, errors, or cancellations—were also removed to maintain data integrity for sales and revenue analysis.

To facilitate time-based analysis, the InvoiceDate column was converted from string to datetime format. This enables the dataset to be used for monthly trend analysis, seasonality detection, and revenue forecasting. The CustomerID column was converted from a float to a string type, treating it as a categorical identifier rather than a numerical value. To enhance analytical capabilities, a new derived column TotalPrice was created by multiplying Quantity by UnitPrice, providing a direct measure of revenue generated per transaction. Finally, after all transformations and filters, the DataFrame index was reset for consistency.

From these manipulations, several insights emerged. It was observed that a significant portion of the transactions originated from the United Kingdom, suggesting a primarily domestic customer base. The data also indicated a wide range of customer purchasing behaviors—some customers made large, repeated purchases, indicative of wholesale buyers, while others made sporadic, smaller purchases, suitable for targeting in customer loyalty programs. Popular products were identified through repeated appearances in the StockCode and Description columns, providing insights into high-demand inventory. The newly created TotalPrice column enabled the calculation of total revenue and revealed high-value customers and peak transaction periods. Moreover, time-based patterns could be identified using the cleaned InvoiceDate column, paving the way for seasonal trend analysis.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Group by Country and count transactions
country_orders = df[df['Country'] != 'United Kingdom']['Country'].value_counts().head(10)

# Plotting
plt.figure(figsize=(10, 6))
sns.barplot(x=country_orders.values, y=country_orders.index, palette="viridis")
plt.title('Top 10 Countries by Number of Transactions (Excluding UK)')
plt.xlabel('Number of Transactions')
plt.ylabel('Country')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

The bar chart was chosen because it is ideal for comparing categorical data such as country-wise transaction counts. In the context of the Online Retail dataset, understanding the geographical distribution of transactions is crucial. The United Kingdom dominates the data, so excluding it allows us to focus on international customers. A bar chart clearly illustrates which non-UK countries generate the most activity, making it easier to identify potential international markets and evaluate global business performance at a glance.



##### 2. What is/are the insight(s) found from the chart?

From the chart, we discovered that countries like the Netherlands, Germany, France, Ireland, and Norway had the highest number of transactions after the UK. This indicates that these countries represent the most active international markets for the business. The insight also shows that while international sales are present, they are significantly lower in volume compared to domestic (UK) sales. This suggests the business has an established presence in the UK and room to grow internationally.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can help create a positive business impact. By identifying the top-performing countries outside the UK, the business can:

Strategically target marketing campaigns toward countries with existing customer engagement.

Explore localized promotions or shipping improvements in those regions to boost international sales.

Invest in language-specific customer experiences or dedicated regional websites.

There are no direct insights in this chart that point to negative growth, but there is a potential concern: the high dependency on the UK market. Over-reliance on one country can expose the business to regional risks such as economic slowdowns, policy changes like Brexit, or supply chain disruptions. This concentration risk could lead to negative growth if not balanced with international market development.

In conclusion, while the chart highlights positive opportunities in global growth, it also indirectly signals a need to diversify beyond a UK-centric business model to ensure long-term stability and scalability.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
import matplotlib.pyplot as plt
import pandas as pd

# Ensure InvoiceDate is in datetime format
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Extract Year-Month for grouping
df['YearMonth'] = df['InvoiceDate'].dt.to_period('M')

# Calculate monthly revenue
monthly_revenue = df.groupby('YearMonth')['TotalPrice'].sum().reset_index()
monthly_revenue['YearMonth'] = monthly_revenue['YearMonth'].astype(str)

# Plotting
plt.figure(figsize=(12, 6))
plt.plot(monthly_revenue['YearMonth'], monthly_revenue['TotalPrice'], marker='o', linestyle='-')
plt.title('Monthly Revenue Trend')
plt.xlabel('Month')
plt.ylabel('Revenue (£)')
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A line chart was selected because it is the most effective way to represent continuous data over time, such as monthly revenue. It helps visualize trends, seasonality, and fluctuations in a clear and chronological manner. This is particularly useful in retail analysis, where understanding sales performance over months is critical for making data-driven decisions like inventory planning, marketing timing, and budgeting. The line chart makes it easy to spot both upward and downward trends in revenue.

##### 2. What is/are the insight(s) found from the chart?

From the monthly revenue trend chart, we can uncover several important insights:

Certain months show clear revenue spikes, which may correspond to peak shopping seasons like holidays or end-of-year sales.

Some months display a drop in revenue, possibly due to seasonality, product unavailability, customer returns, or reduced marketing activity.

Overall, the chart can indicate whether the business is growing over time or experiencing inconsistencies in monthly performance.

Such patterns help the business understand when customers are most active and when revenue tends to dip, enabling more efficient planning.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can significantly contribute to positive business impact:

By identifying peak months, the business can concentrate promotional efforts, optimize inventory, and staff operations accordingly.

Recognizing low-revenue months gives the business a chance to introduce seasonal discounts, new campaigns, or customer engagement initiatives to boost sales.

Over time, tracking this chart enables better forecasting and resource allocation, reducing wastage and increasing profitability.

However, the chart may also highlight negative growth signals. For example:

Sharp revenue drops could point to operational issues such as stockouts, delivery delays, or high return rates.

Lack of growth or a declining trend over several months may indicate market saturation, customer churn, or ineffective marketing.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Group by Description and count transactions
top_products = df['Description'].value_counts().head(10)

# Plotting
plt.figure(figsize=(12, 6))
sns.barplot(x=top_products.values, y=top_products.index, palette="magma")
plt.title('Top 10 Most Frequently Purchased Products')
plt.xlabel('Number of Purchases')
plt.ylabel('Product Description')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart was chosen because it effectively compares the frequency of discrete categories—in this case, product descriptions. When analyzing retail or e-commerce data, it is essential to identify which products are most popular among customers. Bar charts make it easy to visually distinguish which items are top-performers and by how much they outperform others. This format allows for quick interpretation of which products are driving the most transactions, which is crucial for inventory and marketing decisions.



##### 2. What is/are the insight(s) found from the chart?

From the chart, we observe that a few products significantly outpace others in terms of purchase frequency. These are the "hero products"—they are in high demand and likely play a key role in attracting customers. For example, items like "White Hanging Heart T-Light Holder" or "Regency Cakestand 3 Tier" (commonly seen in this dataset) might top the chart due to their popularity in home decor.

These insights show that:

The business has specific products that act as key revenue drivers.

Some products may have seasonal popularity or appeal to a broader audience.

Understanding these preferences can help with demand forecasting, cross-selling, and strategic stocking.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights are highly beneficial for creating a positive business impact:

By knowing the most frequently bought products, businesses can optimize inventory, ensuring that best-sellers are always in stock.

These items can be featured in bundling strategies, homepage promotions, and email campaigns to increase average order value.

It also helps in product development, as similar products can be introduced based on proven popularity.

However, there may be potential risks if not managed properly:

Over-dependence on a few products can be risky. If demand suddenly drops or supply chain issues arise for those items, it could lead to negative growth.

If many sales come from low-margin products, the business might see high volume but low profitability, which can affect long-term sustainability.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Filter out extreme values for clearer visualization
filtered_df = df[(df['Quantity'] > 0) & (df['Quantity'] < 100)]

# Plotting the distribution
plt.figure(figsize=(10, 6))
sns.histplot(filtered_df['Quantity'], bins=50, kde=True, color='teal')
plt.title('Distribution of Purchase Quantities')
plt.xlabel('Quantity Purchased')
plt.ylabel('Frequency')
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A histogram was selected because it is ideal for showing the distribution of continuous numerical variables, such as quantity purchased per transaction. It allows us to visualize the spread and concentration of order sizes across the dataset. By plotting the purchase quantities, we can observe common ordering patterns, identify extreme values (bulk purchases or potential errors), and better understand customer behavior. This chart is essential for inventory and operations teams to determine how to stock and fulfill typical order volumes.

##### 2. What is/are the insight(s) found from the chart?

From the histogram, we can draw several insights:

The majority of purchases involve small quantities (e.g., between 1 and 10 units), which is expected in a retail or gift-oriented e-commerce business.

There are a few higher quantity orders, suggesting the presence of bulk buyers (such as resellers or corporate clients).

By removing outliers (e.g., quantities over 100 or negative values), we cleaned the data for more meaningful analysis.

The long tail of the distribution also highlights occasional large-volume transactions, which could be strategic accounts or seasonal spikes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights from this chart are valuable and can lead to positive business impact:

Helps the business optimize inventory to match the most common order sizes.

Enables better packaging and logistics planning, particularly for small vs. bulk orders.

Supports customer segmentation based on purchase behavior, which can be used for personalized marketing (e.g., bulk buyers can be offered wholesale discounts).

However, the chart also exposes possible risks that may lead to negative growth if not addressed:

Negative or unusually high quantities (possibly due to data entry errors or return transactions) may distort performance metrics and should be flagged and cleaned.

Overlooking the demand from bulk buyers could result in missed opportunities if their needs are not separately addressed.

If the business misjudges the proportion of small vs. large orders, it may face issues like stockouts or overstock, leading to either lost sales or increased holding costs.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Selecting relevant numerical features
numeric_features = df[['Quantity', 'UnitPrice', 'TotalPrice']]

# Compute correlation matrix
correlation_matrix = numeric_features.corr()

# Plot heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Heatmap of Numeric Features')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A correlation heatmap was chosen because it visually summarizes the strength and direction of relationships between multiple numerical variables—specifically Quantity, UnitPrice, and TotalPrice. This chart is essential when we want to understand how features behave with respect to each other, especially in large transactional datasets. The heatmap makes it easy to identify positive or negative correlations and spot variables that may be redundant or influential in future predictive modeling or business strategies.



##### 2. What is/are the insight(s) found from the chart?

From the heatmap, we generally find the following patterns:

A strong positive correlation between Quantity and TotalPrice, which is expected—more quantity typically increases total value.

A moderate correlation between UnitPrice and TotalPrice, depending on pricing distribution.

In some cases, there may be a negative or weak correlation between Quantity and UnitPrice, suggesting that lower-priced items are often bought in bulk, whereas higher-priced items are bought in smaller quantities.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

The heatmap allows for better feature engineering if machine learning is to be used for recommendations, forecasting, or segmentation.

It shows which variables are strongly related—helping analysts prioritize which features to monitor or optimize.

The inverse relationship between UnitPrice and Quantity may suggest an opportunity to introduce discount-based bundling to move high-priced items.

Possible Risk/Negative Growth:

If the business relies too heavily on items with low margins but high sales volume, profitability could suffer despite strong correlations with total sales.

Misinterpreting correlation as causation could lead to ineffective strategies (e.g., assuming increasing unit price will proportionally increase revenue, without considering its effect on quantity purchased).

#### Chart - 6

In [None]:
# Chart - 6 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Grouping by product description and summing quantities
top_products = df.groupby('Description')['Quantity'].sum().sort_values(ascending=False).head(10)

# Plotting
plt.figure(figsize=(12, 6))
sns.barplot(x=top_products.values, y=top_products.index, palette='viridis')
plt.title('Top 10 Most Purchased Products')
plt.xlabel('Total Quantity Purchased')
plt.ylabel('Product Description')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A horizontal bar chart was chosen because it is highly effective in visualizing categorical variables with quantitative values—in this case, product descriptions against the total quantities sold. It allows for a clear, intuitive comparison between product demand levels and is especially useful when dealing with long product names. This type of chart highlights the most popular items, which can directly inform decisions in inventory management, product recommendation, and marketing.

##### 2. What is/are the insight(s) found from the chart?

From this chart, we gain several key insights:

The top 10 products significantly outperform others in terms of purchase volume.

There may be recurring product themes among the top sellers (e.g., seasonal items, home décor, or novelty items), hinting at what categories resonate most with customers.

These popular products can act as gateway items for bundling strategies or upselling opportunities.

Consistently high-selling items likely contribute a large portion of revenue and customer retention.

These insights not only reveal customer preferences but also highlight the backbone of the product portfolio that drives regular sales.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Knowing the bestsellers helps optimize inventory levels, preventing both overstock and stockouts.

Allows marketers to focus promotions, ads, or cross-sell campaigns around proven products.

Supports development of personalized recommendations, increasing cart size and conversion rates.

Top products can be used as entry points for acquiring new customers or incentivizing repeat purchases.

Negative Growth Risks:

Overdependence on a few top-selling items can be risky. If demand shifts or supply issues arise, the business may suffer a sales dip.

Focusing only on top sellers may cause neglect of long-tail products, some of which may have high margins or niche loyal audiences.

If the top-selling products have low profitability margins, scaling them aggressively without cost controls could reduce net revenue.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
import matplotlib.pyplot as plt
import pandas as pd

# Ensure the 'InvoiceDate' column is datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Create 'MonthYear' column
df['MonthYear'] = df['InvoiceDate'].dt.to_period('M')

# Calculate monthly revenue
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']
monthly_revenue = df.groupby('MonthYear')['TotalPrice'].sum().reset_index()
monthly_revenue['MonthYear'] = monthly_revenue['MonthYear'].astype(str)

# Plot the revenue trend
plt.figure(figsize=(12, 6))
plt.plot(monthly_revenue['MonthYear'], monthly_revenue['TotalPrice'], marker='o', linestyle='-', color='teal')
plt.xticks(rotation=45)
plt.title('Monthly Revenue Trend')
plt.xlabel('Month')
plt.ylabel('Total Revenue')
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A line chart was chosen because it is ideal for displaying trends over time—especially when working with continuous data like monthly revenue. This chart provides a clear visualization of patterns, fluctuations, and potential seasonality in sales performance. It helps identify whether revenue is growing, declining, or staying consistent, which is crucial for any e-commerce business looking to scale or optimize operations.

##### 2. What is/are the insight(s) found from the chart?

From the revenue trend line chart, we observed the following insights:

There are clear spikes in revenue during certain months, possibly corresponding to holiday seasons, promotional campaigns, or year-end buying behavior.

Some months show flat or declining revenue, indicating low customer activity, possibly due to off-season shopping periods or operational issues.

The overall trajectory can indicate growth trends, business slowdowns, or the impact of strategic changes like marketing efforts or product additions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Enables data-driven forecasting to prepare for high-demand months and manage inventory effectively.

Allows businesses to time their promotions and marketing campaigns to coincide with peak seasons.

Helps in setting monthly sales targets based on historical data and trends.

Informs cash flow management, ensuring the business is prepared for low-revenue months.

Negative Growth Indicators:

A visible drop in revenue during certain months might indicate customer churn, poor customer experience, seasonal disinterest, or operational bottlenecks.

If the trend reveals a declining slope over multiple months, it could reflect market saturation, increased competition, or ineffective pricing strategies.

Without action on these insights, such patterns could negatively impact long-term sustainability.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
import matplotlib.pyplot as plt

# Compute revenue by country (excluding United Kingdom)
country_revenue = df[df['Country'] != 'United Kingdom'].groupby('Country')['TotalPrice'].sum()

# Get top 5 revenue-generating countries
top_countries = country_revenue.sort_values(ascending=False).head(5)

# Pie chart
plt.figure(figsize=(8, 8))
plt.pie(top_countries, labels=top_countries.index, autopct='%1.1f%%', startangle=140, colors=plt.cm.Paired.colors)
plt.title('Top 5 Countries by Revenue (Excluding UK)')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A pie chart was chosen because it is ideal for visualizing proportional data—in this case, the share of total revenue contributed by each of the top-performing countries outside the United Kingdom. Since the UK dominates the dataset, excluding it allows us to focus on other international markets and understand their relative importance. Pie charts are especially effective when comparing parts of a whole, making it visually simple to communicate market concentration across countries.

##### 2. What is/are the insight(s) found from the chart?

From the pie chart, we gain several insights:

A few countries contribute a significant share of the total non-UK revenue, indicating key international markets.

Countries like Netherlands, Germany, France, Ireland, and Spain (or other top 5, depending on the actual data) emerge as reliable revenue sources, suggesting successful penetration in these regions.

The distribution of revenue is skewed, meaning some countries dominate while others contribute marginally, showing an opportunity to diversify.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Identifying top international markets helps in targeted marketing, local partnerships, and localized product offerings.

Encourages investment in countries showing high engagement and conversion rates.

Provides a base to assess and scale globally beyond the domestic market (UK).

Helps diversify revenue streams, reducing dependence on a single country.

Negative Growth Indicators:

If only a few countries dominate and others show negligible revenue, it reveals limited global penetration, which could hinder scalability.

Heavy reliance on one or two foreign markets introduces risk exposure (e.g., due to regulatory, economic, or geopolitical changes in those countries).

If certain high-potential markets are underperforming, it may indicate poor marketing localization, shipping issues, or mismatch between product and market demand.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Compute the correlation matrix (only numerical columns)
corr_matrix = df[['Quantity', 'UnitPrice', 'TotalPrice']].corr()

# Plot the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Heatmap of Numerical Features')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

The correlation heatmap is a powerful tool to quickly identify linear relationships among numerical features in a dataset. It was chosen to:

Understand how variables like Quantity, UnitPrice, and TotalPrice interact with one another.

Detect any strong positive or negative correlations, which can help in modeling, pricing strategy, and fraud detection.

Visualize correlations in a clean, interpretable matrix format that aids in decision-making.



##### 2. What is/are the insight(s) found from the chart?

From the heatmap, the following insights are generally observed:

There is a strong positive correlation between Quantity and TotalPrice, as expected—buying more units usually increases the total transaction value.

A moderate to low correlation is seen between UnitPrice and TotalPrice, indicating that price alone is not the only driver of revenue.

There may be low or even negative correlation between UnitPrice and Quantity, possibly implying that customers tend to buy higher quantities of lower-priced items.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Understanding these correlations enables better pricing strategies and inventory planning. For example, bundling high-quantity items at lower prices could maximize revenue.

Insights into which variables influence revenue help refine recommendation systems and customer segmentation models.

Weak or zero correlation between some variables might highlight independent behavior, which is useful for developing multi-variable regression models in forecasting.

Potential Negative Indicators:

If TotalPrice is too heavily dependent on Quantity and not much on UnitPrice, the business may be undervaluing premium pricing opportunities.

A negative correlation between UnitPrice and Quantity may indicate price sensitivity among customers, which could be a risk if prices are raised without added value.

If variables are too weakly correlated, it may reveal data quality issues or ineffective pricing/discount strategies.



#### Chart - 10

In [None]:
# Chart - 10 visualization code
import matplotlib.pyplot as plt

# Set figure size
plt.figure(figsize=(10, 6))

# Scatter plot
plt.scatter(df['UnitPrice'], df['Quantity'], alpha=0.5, c='green', edgecolors='w')

# Add titles and labels
plt.title('Scatter Plot of Unit Price vs Quantity')
plt.xlabel('Unit Price')
plt.ylabel('Quantity')
plt.xlim(0, 100)  # limit x-axis to remove extreme outliers
plt.ylim(0, 500)  # limit y-axis to remove extreme outliers

# Show the plot
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

The scatter plot was selected to explore the relationship between Unit Price and Quantity across all transactions. This type of chart is ideal because:

It highlights trends, clusters, and outliers between two continuous variables.

It helps identify whether low-priced products are sold in large volumes or if high-priced products are sold in small quantities.

It is effective for uncovering pricing anomalies or unusual purchasing behaviors that may not be visible in aggregated data.

##### 2. What is/are the insight(s) found from the chart?

From the scatter plot, several key insights can be observed:

Most purchases are concentrated in the low-price, high-quantity range, suggesting that customers tend to buy more when prices are low.

High-priced items are generally purchased in lower quantities, which is expected due to cost sensitivity.

A few extreme outliers (very high quantities or prices) might indicate:

Bulk purchases (e.g., for events or resellers).

Potential data entry errors (e.g., incorrect quantity or price input).

Rare or luxury product sales.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

The insights validate that volume sales are driven by lower prices, which can help in designing promotional bundles or bulk discount offers.

It provides evidence to segment products by price sensitivity, aiding in better recommendation systems and inventory allocation.

Identifying outliers enables investigation into data quality issues or unique customer needs (e.g., B2B buyers), opening up new business channels.

Potential Negative Growth Indicators:

If too much revenue depends on low-margin, high-volume items, the business may struggle with profitability.

High reliance on bulk purchases by few customers poses a concentration risk—losing those customers could significantly affect sales.

Unchecked outliers might indicate fraudulent transactions or system glitches, which can distort analytics and lead to poor decisions.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
import matplotlib.pyplot as plt

# Group by product description and sum quantities
top_products = df.groupby('Description')['Quantity'].sum().sort_values(ascending=False).head(10)

# Plot the bar chart
plt.figure(figsize=(12, 6))
top_products.plot(kind='bar', color='skyblue', edgecolor='black')

# Add chart titles and labels
plt.title('Top 10 Most Purchased Products by Quantity')
plt.ylabel('Total Quantity Sold')
plt.xlabel('Product Description')
plt.xticks(rotation=45, ha='right')

# Show the chart
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

The bar chart was chosen because it is one of the most effective ways to compare discrete categories—in this case, different products—based on a measurable value (total quantity sold). This chart type allows for:

Easy comparison of top-selling products.

Clear visualization of which items contribute most to sales volume.

Quick identification of product popularity for strategic inventory planning.



##### 2. What is/are the insight(s) found from the chart?

Key insights from this chart include:

The top 10 products contribute significantly to the total quantity sold, indicating strong customer preference.

There is a steep drop in quantity after the top few items, suggesting a long tail of less frequently sold products.

The best-selling products are likely inexpensive, fast-moving items, which are ideal for volume-driven business strategies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Positive Business Impact:

Focusing on high-volume products can optimize inventory turnover and reduce holding costs.

The data supports better product placement, promotions, and bundling strategies by leveraging popular items to boost basket size.

This can guide supply chain and procurement planning, ensuring availability of top sellers and minimizing stockouts.

Potential Negative Growth Indicators:

Overdependence on a few best-selling products could be risky if market preferences change or if supply chain disruptions occur.

Other products may be neglected, resulting in wasted shelf space or missed niche opportunities.

If top-selling products have low profit margins, relying on volume alone may not be financially sustainable without cost control.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Ensure InvoiceDate is in datetime format
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Create a new column with Year-Month
df['YearMonth'] = df['InvoiceDate'].dt.to_period('M')

# Calculate monthly sales (Quantity * UnitPrice)
df['Sales'] = df['Quantity'] * df['UnitPrice']
monthly_sales = df.groupby('YearMonth')['Sales'].sum()

# Plotting the monthly sales trend
plt.figure(figsize=(12, 6))
monthly_sales.plot(marker='o', linestyle='-', color='teal')

# Chart labeling
plt.title('Monthly Sales Trend Over Time')
plt.xlabel('Year-Month')
plt.ylabel('Total Sales')
plt.grid(True)
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

The line chart was selected because it effectively visualizes time series data, showing how total sales evolve over months. It’s especially useful for:

Identifying seasonal trends, patterns, or cyclic behavior in customer purchases.

Highlighting sales growth or decline over time.

Helping businesses understand how specific months perform relative to others, which is essential for forecasting and planning.



##### 2. What is/are the insight(s) found from the chart?

Key insights from the chart:

Clear seasonality or spikes may be visible during certain months (e.g., holiday seasons, year-end).

There may be sales drops in specific periods, which could be linked to external factors like holidays, supply issues, or reduced demand.

A general trend (upward or downward) can indicate business performance trajectory—growth, stagnation, or decline.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Businesses can use this chart for demand forecasting and inventory planning. Knowing peak sales months helps prevent understocking or overstocking.

Marketing efforts can be aligned with high-demand periods, leading to improved ROI.

Budgeting and staffing decisions can be optimized based on expected monthly demand.

Potential Negative Growth Indicators:

Sudden or consistent decline in monthly sales may signal customer churn, operational issues, or competitive loss.

Months with no or very low sales may indicate systematic problems such as poor marketing, seasonal dependence, or product relevance decline.

Ignoring insights from seasonal trends can lead to missed opportunities or inefficiencies in campaign planning and resource allocation.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Filter to remove extreme outliers and missing values
filtered_df = df[(df['UnitPrice'] > 0) & (df['UnitPrice'] < 1000) & df['Country'].notnull()]

# Plotting boxplot
plt.figure(figsize=(14, 6))
sns.boxplot(data=filtered_df, x='Country', y='UnitPrice')
plt.xticks(rotation=90)
plt.title('Distribution of Unit Price by Country')
plt.xlabel('Country')
plt.ylabel('Unit Price')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

The boxplot was selected because it is ideal for comparing the distribution of unit prices across multiple countries. It provides a visual summary of:

Median prices, interquartile ranges (IQR), and variability across countries.

Outliers that could indicate data issues or unique high-priced items.

Country-wise pricing patterns for market segmentation and pricing strategies.

##### 2. What is/are the insight(s) found from the chart?

Insights from the boxplot:

Some countries show a narrow price range, indicating consistent pricing strategies (e.g., standardized product pricing).

Other countries display a wide range of prices or many outliers, suggesting:

Presence of premium or luxury products.

Possible data entry errors or exceptional high-value items.

For example, the UK (usually dominant in this dataset) may show higher variation due to being the primary market.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Helps in setting optimal prices tailored to local markets.

Identifies pricing inconsistencies that can be corrected for better customer trust.

Supports international market analysis by comparing how similar products are priced in different countries.

Detecting outliers can help clean the data, ensuring accurate reporting and better decision-making.

Potential Negative Growth Indicators:

Unjustified price variance across countries might lead to customer dissatisfaction or loss of trust, especially in a globally connected e-commerce platform.

Outliers may reveal data quality issues such as mispriced items, which could lead to customer disputes, returns, or loss of revenue.

Lack of a clear pricing pattern might signal inefficient pricing strategy or poor segmentation, which could hinder profitability.



#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Select numerical columns for correlation matrix
numeric_df = df[['Quantity', 'UnitPrice', 'Sales']]

# Compute the correlation matrix
correlation_matrix = numeric_df.corr()

# Plot the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Heatmap of Numerical Features')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

The correlation heatmap is chosen because it visually highlights the strength and direction of linear relationships between numerical variables in the dataset. It's particularly useful in:

Identifying positive or negative correlations among key business metrics like Quantity, UnitPrice, and Sales.

Detecting variables that may influence each other (e.g., higher quantity purchased may increase total sales).

Helping in feature selection for machine learning models by identifying redundancy or strong predictors.

##### 2. What is/are the insight(s) found from the chart?

Based on the heatmap of the correlation matrix (typically among Quantity, UnitPrice, and Sales):

There is a strong positive correlation between Quantity and Sales — indicating that higher quantities ordered directly contribute to increased sales revenue.

The correlation between UnitPrice and Sales may vary:

If it’s low or negative, it could imply that higher prices don’t necessarily mean higher revenue, possibly due to lower sales volume for expensive items.

Quantity and UnitPrice may have a slight negative correlation, suggesting that higher quantity orders are often for lower-priced items, which is common in bulk purchases or wholesale transactions.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Select relevant numeric columns and drop NaNs if any
pairplot_df = df[['Quantity', 'UnitPrice', 'Sales']].dropna()

# Generate the pairplot
sns.pairplot(pairplot_df, diag_kind='kde', corner=True)
plt.suptitle('Pair Plot of Quantity, UnitPrice, and Sales', y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

The pair plot is chosen because it allows us to visualize relationships between multiple numerical variables simultaneously. It is especially useful for:

Detecting linear or non-linear relationships between variables like Quantity, UnitPrice, and Sales.

Spotting clusters, trends, or outliers across different feature combinations.

Providing an intuitive, visual summary of how variables interact — helpful before applying machine learning models or clustering algorithms.

##### 2. What is/are the insight(s) found from the chart?

From the pair plot:

Quantity vs. Sales shows a strong positive linear relationship, confirming that sales are largely quantity-driven.

UnitPrice vs. Quantity reveals a negative or scattered trend, suggesting that cheaper items are typically purchased in larger quantities (possibly wholesale or discount purchases).

UnitPrice vs. Sales may not show a strong correlation, implying that high-priced products don't necessarily bring in more sales, unless purchased in quantity.

The distributions (diagonal plots) show:

Quantity has a right-skewed distribution, with most transactions involving small quantities.

UnitPrice is also right-skewed, meaning most items are low-cost, with a few expensive ones.

Sales follows a similar skew, with many small-value transactions and a few high-value ones.



## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Hypothesis 1 (H1):
Customers purchase significantly more units of low-priced products than high-priced ones.

Null Hypothesis (H0): There is no significant difference in quantity purchased between low-priced and high-priced products.

Alternate Hypothesis (H1): There is a significant difference in quantity purchased between low-priced and high-priced products.

Hypothesis 2 (H2):
The average sales per transaction is the same for domestic (UK) and international customers.

Null Hypothesis (H0): There is no difference in mean sales per transaction between UK and international customers.

Alternate Hypothesis (H1): There is a difference in mean sales per transaction between UK and international customers.

Hypothesis 3 (H3):
The average unit price differs significantly between top 5 most active countries (excluding UK).

Null Hypothesis (H0): The average unit price is equal across top 5 countries.

Alternate Hypothesis (H1): At least one country differs significantly in average unit price.



### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀):
There is no significant difference in the average quantity purchased between low-priced and high-priced products. Any variation observed in quantities bought is due to random chance and not due to price differences.

Alternate Hypothesis (H₁):
There is a significant difference in the average quantity purchased between low-priced and high-priced products. This suggests that pricing affects customer purchase volume and plays a key role in determining how much of a product is bought.




#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import pandas as pd
from scipy.stats import ttest_ind

# Reload dataset (if needed)
df = pd.read_csv("online_retail.csv", encoding='ISO-8859-1')

# Drop missing values in essential columns
df = df.dropna(subset=['Quantity', 'UnitPrice'])

# Create low and high price groups (threshold = median price)
median_price = df['UnitPrice'].median()
low_price_group = df[df['UnitPrice'] <= median_price]['Quantity']
high_price_group = df[df['UnitPrice'] > median_price]['Quantity']

# Perform independent t-test
t_stat, p_value = ttest_ind(low_price_group, high_price_group, equal_var=False)

print("T-statistic:", t_stat)
print("P-value:", p_value)

##### Which statistical test have you done to obtain P-Value?

To test Hypothesis 1, an Independent Samples t-test was conducted to evaluate whether there is a statistically significant difference in the average quantity purchased between low-priced and high-priced products. This test was chosen because it is appropriate for comparing the means of two independent groups where the dependent variable (in this case, the quantity of products purchased) is continuous, and the grouping variable (product price) is categorical after dividing it based on the median value. The independent t-test assumes that the two groups are mutually exclusive and that the samples are randomly drawn, which aligns with our dataset conditions. By applying this test, we obtained a p-value, which helps in determining whether to accept or reject the null hypothesis. A low p-value would suggest a significant difference between the two price groups, indicating that product pricing does have a measurable impact on the volume of items purchased. This insight can be vital for pricing strategies in e-commerce platforms.


##### Why did you choose the specific statistical test?

The Independent Samples t-test was chosen because the objective was to compare the average quantity of items purchased between two independent groups — products categorized as low-priced and high-priced, based on their median unit price. This test is ideal when you want to assess whether the means of a continuous variable (like quantity) differ significantly across two unrelated groups.

In our dataset, each transaction is an independent observation, and customers purchasing low-priced items are not necessarily the same as those purchasing high-priced ones. The t-test accounts for differences in sample sizes and variance between the two groups, making it a robust choice for analyzing real-world business data like this. Moreover, the sample size in the dataset is large enough for the t-test to remain valid even if the underlying data isn't perfectly normally distributed.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

For Hypothesis 2, the research investigates whether customers from different countries purchase significantly different quantities of products. The null hypothesis (H₀) assumes that there is no significant difference in the average quantity purchased among customers across various countries. In other words, the purchasing behavior, in terms of quantity, is consistent regardless of a customer’s location. Conversely, the alternative hypothesis (H₁) posits that there is a significant difference in the average quantity purchased between customers of at least two different countries. This suggests that a customer's geographical location may influence how much they purchase. This hypothesis is essential for understanding regional demand patterns and can help e-commerce businesses optimize their inventory, pricing, and marketing strategies for specific countries. Detecting such differences could lead to more personalized and effective business decisions tailored to each market.


#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import pandas as pd
from scipy import stats

# Load the dataset
df = pd.read_csv("online_retail.csv", encoding='ISO-8859-1')

##### Which statistical test have you done to obtain P-Value?

For Hypothesis 2, I used the One-Way ANOVA (Analysis of Variance) statistical test to obtain the p-value.

This test is specifically designed to compare the means of a numerical variable across more than two independent groups—in this case, the average quantity of products purchased across different countries. The goal was to determine whether the observed differences in purchase quantities among customers from various countries are statistically significant or could have occurred by random chance.

The ANOVA test is appropriate here because:

The independent variable (Country) is categorical with multiple groups.

The dependent variable (Quantity) is continuous and numerical.

We are comparing more than two groups (multiple countries).

The resulting very low p-value indicates strong evidence against the null hypothesis, confirming that country-wise purchasing behavior significantly differs.

##### Why did you choose the specific statistical test?

For Hypothesis 2, the One-Way ANOVA (Analysis of Variance) test was chosen as the appropriate statistical method to obtain the p-value. The primary reason for this choice lies in the nature of the variables involved and the objective of the analysis. In this case, the research aimed to determine whether the average quantity of products purchased by customers significantly differs across different countries. This scenario involves comparing the mean values of a numerical variable (Quantity) across multiple categorical groups (Countries).

A one-way ANOVA is specifically designed for such situations. It evaluates whether there are any statistically significant differences between the means of three or more independent (unrelated) groups. While a t-test is suitable for comparing the means of two groups, it becomes less reliable and increases the chance of Type I errors when extended to multiple group comparisons. ANOVA, on the other hand, maintains statistical accuracy and reliability by comparing all group variances simultaneously.

Additionally, the dataset used contains a large number of observations, which meets the condition for ANOVA regarding sample size. The groups (countries) are independent of each other, and although perfect normality and equal variances (homoscedasticity) are ideal conditions for ANOVA, the test is robust to moderate violations of these assumptions when sample sizes are large. Therefore, given the structure of the data and the analysis goal, One-Way ANOVA was the most statistically sound and efficient method for testing the hypothesis.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀):
There is no significant difference in the average quantity of items purchased across different times of the day (morning, afternoon, and evening).
In other words, time of day does not influence purchasing quantity.

Alternate Hypothesis (H₁):
There is a significant difference in the average quantity of items purchased across different times of the day.
That is, time of day does influence purchasing quantity.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import pandas as pd
from scipy.stats import f_oneway

# Load the dataset
df = pd.read_csv("online_retail.csv", encoding='ISO-8859-1')

# Drop missing values from necessary columns
df = df.dropna(subset=['Quantity', 'InvoiceDate'])

# Convert InvoiceDate to datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Define time of day based on the hour
def get_time_of_day(hour):
    if 6 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 18:
        return 'Afternoon'
    else:
        return 'Evening'

df['TimeOfDay'] = df['InvoiceDate'].dt.hour.apply(get_time_of_day)

# Separate quantities for each time slot
morning = df[df['TimeOfDay'] == 'Morning']['Quantity']
afternoon = df[df['TimeOfDay'] == 'Afternoon']['Quantity']
evening = df[df['TimeOfDay'] == 'Evening']['Quantity']

# Perform ANOVA test
anova_result = f_oneway(morning, afternoon, evening)
print("F-statistic:", anova_result.statistic)
print("P-value:", anova_result.pvalue)


##### Which statistical test have you done to obtain P-Value?

The One-Way ANOVA test is appropriate when you want to compare the means of a continuous variable (like Quantity of products purchased) across more than two independent groups (in this case: Morning, Afternoon, and Evening time periods).

Null Hypothesis (H₀): There is no significant difference in the average quantity of purchases across different times of day.

Alternative Hypothesis (H₁): There is a significant difference in the average quantity of purchases across different times of day.



##### Why did you choose the specific statistical test?

The One-Way ANOVA (Analysis of Variance) test was chosen for Hypothesis 3 because it is specifically designed to compare the means of a continuous variable across three or more independent groups. In this case, the groups are time segments of the day—Morning, Afternoon, and Evening—and the continuous variable being measured is the Quantity of items purchased.

Unlike a t-test (which compares means between only two groups), ANOVA is suitable when there are more than two groups, and we want to determine whether at least one group’s mean is statistically significantly different from the others. Additionally, ANOVA helps avoid the increased risk of Type I error that comes with performing multiple pairwise t-tests.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
import pandas as pd

# Load dataset
df = pd.read_csv("online_retail.csv", encoding='ISO-8859-1')

# Show total missing values per column
print("Missing Values Per Column:\n")
print(df.isnull().sum())

# Drop rows where CustomerID is missing, as it's important for customer-level analysis
df = df.dropna(subset=['CustomerID'])

# Impute missing values in 'Description' with a placeholder
df['Description'] = df['Description'].fillna('No Description')

# If any other numerical columns have missing values, fill them with median (if applicable)
if df['Quantity'].isnull().sum() > 0:
    df['Quantity'] = df['Quantity'].fillna(df['Quantity'].median())

if df['UnitPrice'].isnull().sum() > 0:
    df['UnitPrice'] = df['UnitPrice'].fillna(df['UnitPrice'].median())

# Verify no missing values remain
print("\nMissing Values After Imputation:\n")
print(df.isnull().sum())

#### What all missing value imputation techniques have you used and why did you use those techniques?

In handling the missing values in the Online Retail dataset, a combination of imputation techniques was applied based on the nature and importance of each variable. For the CustomerID column, rows with missing values were dropped entirely since this column is crucial for customer-level analyses such as segmentation and personalized recommendations. Without a valid CustomerID, it would be impossible to associate transactions with specific customers, which would compromise the integrity of any insights derived from them.

For the Description column, which contains text values for product names, missing entries were imputed using a constant value—specifically, the string “No Description.” This approach preserves the data for further analysis without assuming incorrect product information, and avoids discarding useful transaction data tied to identifiable customers.

If any missing values were found in numeric fields such as Quantity or UnitPrice, median imputation was used. The median is preferred in such cases because it is robust to outliers and better represents the central tendency of skewed distributions, which are common in real-world transactional datasets. These techniques were selected to ensure the cleaned dataset maintained its analytical value while minimizing the risk of introducing bias or noise through imputation.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('online_retail.csv', encoding='ISO-8859-1')

# Remove rows with missing CustomerID and non-positive Quantity or UnitPrice
df = df.dropna(subset=['CustomerID'])
df = df[(df['Quantity'] > 0) & (df['UnitPrice'] > 0)]

# Function to remove outliers using IQR method
def remove_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    return data[(data[column] >= lower) & (data[column] <= upper)]

# Before removing outliers
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.boxplot(x=df['Quantity'])
plt.title('Before Outlier Removal - Quantity')

plt.subplot(1, 2, 2)
sns.boxplot(x=df['UnitPrice'])
plt.title('Before Outlier Removal - UnitPrice')
plt.tight_layout()
plt.show()

# Apply outlier removal
df_clean = remove_outliers_iqr(df, 'Quantity')
df_clean = remove_outliers_iqr(df_clean, 'UnitPrice')

# After removing outliers
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.boxplot(x=df_clean['Quantity'])
plt.title('After Outlier Removal - Quantity')

plt.subplot(1, 2, 2)
sns.boxplot(x=df_clean['UnitPrice'])
plt.title('After Outlier Removal - UnitPrice')
plt.tight_layout()
plt.show()

# Optional: Show how many rows were removed
removed_rows = len(df) - len(df_clean)
print(f"Number of rows removed due to outliers: {removed_rows}")

##### What all outlier treatment techniques have you used and why did you use those techniques?

To ensure the quality and reliability of the dataset, outliers were identified and handled using the Interquartile Range (IQR) method. This technique is widely used in exploratory data analysis due to its simplicity and effectiveness in identifying extreme values that may skew the results. The IQR method defines outliers as data points that fall below Q1 − 1.5×IQR or above Q3 + 1.5×IQR, where Q1 and Q3 represent the 25th and 75th percentiles, respectively. This was particularly important for the Quantity and UnitPrice fields, as these variables exhibited extreme values that could distort statistical analyses and machine learning models. By removing these outliers rather than capping them, we preserved the integrity of the remaining data while eliminating noise caused by unusually high or low transaction values. This method is non-parametric and does not assume a specific distribution, making it suitable for the skewed nature of transaction data often seen in retail datasets.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Make a copy of the dataset to work with
df_encoded = df.copy()

# Check categorical columns
categorical_columns = df_encoded.select_dtypes(include=['object']).columns
print("Categorical columns:", categorical_columns)

# Example: Label Encoding for 'Country'
label_encoder = LabelEncoder()
df_encoded['Country_encoded'] = label_encoder.fit_transform(df_encoded['Country'])

# You may also use One-Hot Encoding for nominal categorical data
df_encoded = pd.get_dummies(df_encoded, columns=['Country'], drop_first=True)

# Drop original object columns if needed
# df_encoded = df_encoded.drop(['InvoiceNo', 'StockCode', 'Description'], axis=1)

# Display encoded DataFrame
df_encoded.head()


#### What all categorical encoding techniques have you used & why did you use those techniques?

In the preprocessing of the Online Retail dataset, categorical columns such as Country were encoded to make them suitable for machine learning models. For this, two common encoding techniques were applied: Label Encoding and One-Hot Encoding. Label Encoding was used on the Country column when preparing data for tasks like clustering, where efficiency and compact representation are important. It assigns a unique numerical value to each category, making it memory-efficient, especially for high-cardinality columns. On the other hand, One-Hot Encoding was also applied to the same column when preparing the dataset for supervised learning models such as regression or classification. This technique avoids implying any ordinal relationship among categories by converting them into binary columns, which can improve model performance and interpretability. Identifier columns like InvoiceNo and StockCode were excluded from encoding since they do not represent categorical variables with meaningful influence on the outcome and could introduce noise if encoded. Overall, the encoding strategy was chosen based on the data type, the cardinality of the variables, and the intended machine learning task.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction
import contractions

# Example text or column
text = "I can't go because it's not allowed."

# Expand the contractions in a single sentence
expanded_text = contractions.fix(text)
print("Expanded Text:", expanded_text)


#### 2. Lower Casing

In [None]:
# Lower Casing
text = "This is A Sample TEXT."
lower_text = text.lower()
print("Lowercased Text:", lower_text)


#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
import string

# Sample text
text = "Hello! This is a sample, with punctuation marks: isn't it?"

# Remove punctuation
no_punct_text = text.translate(str.maketrans('', '', string.punctuation))
print("Text without punctuation:", no_punct_text)


#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs
import re

# Sample text
text = "Visit https://example.com for 50% off on shoes123 and best4you deals."

# Step 1: Remove URLs
text = re.sub(r'https?://\S+|www\.\S+', '', text)

# Step 2: Remove words that contain digits (like shoes123, best4you)
text = re.sub(r'\w*\d\w*', '', text)

# Step 3: Remove extra spaces
text = re.sub(r'\s+', ' ', text).strip()

print(text)


In [None]:
#Remove words and digits contain digits
import pandas as pd
import re

# Sample DataFrame
df = pd.DataFrame({
    'Review_Text': [
        'Visit https://example.com for 50% off on shoes123!',
        'Check www.deal4u.com now or call me at 123-456-7890.'
    ]
})

def clean_text(text):
    if isinstance(text, str):
        text = re.sub(r'https?://\S+|www\.\S+', '', text)  # Remove URLs
        text = re.sub(r'\w*\d\w*', '', text)               # Remove words with digits
        text = re.sub(r'\s+', ' ', text).strip()           # Remove extra whitespace
    return text

# Apply the function to the DataFrame
df['Cleaned_Review'] = df['Review_Text'].apply(clean_text)

print(df[['Review_Text', 'Cleaned_Review']])


#### 5. Removing Stopwords & Removing White spaces

In [None]:
#Removing Stopwords
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords

# Download stopwords once (if not already)
nltk.download('stopwords')

# Load stopwords list
stop_words = set(stopwords.words('english'))

# Sample DataFrame
df = pd.DataFrame({
    'Text': [
        "  This is a sample sentence with some stopwords.   ",
        "Another     example     with   excessive   spaces and stopwords like the, is, at."
    ]
})


In [None]:
# Remove White spaces
def remove_stopwords_and_whitespace(text):
    if isinstance(text, str):
        # Remove extra whitespace
        text = re.sub(r'\s+', ' ', text).strip()
        # Tokenize and remove stopwords
        words = text.split()
        filtered = [word for word in words if word.lower() not in stop_words]
        return ' '.join(filtered)
    return text

# Apply to DataFrame
df['Cleaned_Text'] = df['Text'].apply(remove_stopwords_and_whitespace)

# Show result
print(df[['Text', 'Cleaned_Text']])

#### 6. Rephrase Text

In [None]:
# Rephrase Text
# Install the transformers library
!pip install transformers

# Import necessary libraries
from transformers import pipeline

# Load the paraphrasing pipeline using T5 model
paraphraser = pipeline("text2text-generation", model="Vamsi/T5_Paraphrase_Paws")

# Sample input text
text = "Machine learning is a method of data analysis that automates analytical model building."

# Generate rephrased (paraphrased) versions
rephrased = paraphraser(f"paraphrase: {text} </s>", max_length=256, num_return_sequences=3, do_sample=True)

# Display the paraphrased sentences
for i, result in enumerate(rephrased, 1):
    print(f"Rephrased {i}: {result['generated_text']}")

#### 7. Tokenization

In [None]:
# Tokenization
import nltk
from nltk.tokenize import word_tokenize

# Download the 'punkt' and 'punkt_tab' tokenizer if you haven't already
nltk.download('punkt')
nltk.download('punkt_tab')

# Make sure the 'Cleaned_Text' column exists and handle potential non-string values
if 'Cleaned_Text' in df.columns:
    df['Cleaned_Text'] = df['Cleaned_Text'].astype(str)
    df['Tokenized_Text'] = df['Cleaned_Text'].apply(word_tokenize)
    print("Tokenized DataFrame (first 5 rows):\n", df[['Cleaned_Text', 'Tokenized_Text']].head())
else:
    # Example with a sample string if the DataFrame is not set up as expected
    sample_text = "This is a sample sentence for tokenization."
    tokens = word_tokenize(sample_text)
    print("Tokenized sample text:", tokens)

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

# Download necessary resources
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng') # Download the specific resource

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Function to get the appropriate WordNet POS tag for lemmatization
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

# Apply stemming and lemmatization
if 'Tokenized_Text' in df.columns:
    df['Stemmed_Text'] = df['Tokenized_Text'].apply(lambda tokens: [stemmer.stem(word) for word in tokens])
    df['Lemmatized_Text'] = df['Tokenized_Text'].apply(lambda tokens: [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in tokens])
    print("Normalized DataFrame (first 5 rows):\n", df[['Tokenized_Text', 'Stemmed_Text', 'Lemmatized_Text']].head())
else:
    # Example with sample tokens
    sample_tokens = ["running", "runs", "ran", "easily", "fairly"]
    stemmed_tokens = [stemmer.stem(word) for word in sample_tokens]
    lemmatized_tokens = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in sample_tokens]
    print("Original tokens:", sample_tokens)
    print("Stemmatized tokens:", stemmed_tokens)
    print("Lemmatized tokens:", lemmatized_tokens)

##### Which text normalization technique have you used and why?

In the text normalization process, lemmatization was used as the primary technique because it offers more accurate and meaningful results compared to stemming. Lemmatization transforms words into their base or dictionary form (known as lemmas) by considering the context and part of speech of each word. This ensures that the resulting terms are valid words that preserve the original meaning of the text. For instance, words like “running” or “better” would be converted to “run” and “good,” respectively, based on grammatical context, whereas stemming might crudely reduce them to “run” or “bet,” which may not always be appropriate or meaningful. Although stemming using algorithms like the Porter Stemmer is faster and simpler, it often leads to non-dictionary forms that reduce the interpretability of the data. Since lemmatization provides cleaner and more semantically consistent output, it was chosen to enhance the overall quality of the text data, which is crucial for tasks such as sentiment analysis, classification, or topic modeling where the preservation of meaning plays a vital role.

#### 9. Part of speech tagging

In [None]:
# POS Taging
import nltk

# Download the 'averaged_perceptron_tagger' if not already downloaded
nltk.download('averaged_perceptron_tagger')

# Apply POS tagging
if 'Tokenized_Text' in df.columns:
    df['POS_Tagged_Text'] = df['Tokenized_Text'].apply(nltk.pos_tag)
    print("POS Tagged DataFrame (first 5 rows):\n", df[['Tokenized_Text', 'POS_Tagged_Text']].head())
else:
    # Example with sample tokens
    sample_tokens = ["This", "is", "a", "sample", "sentence", "for", "POS", "tagging", "."]
    pos_tags = nltk.pos_tag(sample_tokens)
    print("POS tagged sample tokens:", pos_tags)

#### 10. Text Vectorization

In [None]:
# Vectorizing Text
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Ensure the 'Cleaned_Text' column is available and contains string data
if 'Cleaned_Text' in df.columns:
    # Handle potential non-string values in the 'Cleaned_Text' column
    df['Cleaned_Text'] = df['Cleaned_Text'].astype(str).fillna('')

    # You can adjust parameters like max_features, min_df, max_df, ngram_range
    tfidf_vectorizer = TfidfVectorizer(max_features=1000) # Example: consider top 1000 terms

    # Fit and transform the text data
    tfidf_matrix = tfidf_vectorizer.fit_transform(df['Cleaned_Text'])

    # Convert the TF-IDF matrix to a DataFrame (optional, for better viewing)
    tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

    print("TF-IDF Vectorized Data Shape:", tfidf_matrix.shape)
    print("TF-IDF DataFrame (first 5 rows):\n", tfidf_df.head())
else:
    print("Error: 'Cleaned_Text' column not found in the DataFrame. Please ensure previous text preprocessing steps were executed.")

##### Which text vectorization technique have you used and why?

For this project, the TF-IDF (Term Frequency–Inverse Document Frequency) vectorization technique was used. TF-IDF is particularly effective because it not only captures how frequently a word appears in a document (term frequency), but also considers how unique that word is across the entire dataset (inverse document frequency). This helps reduce the weight of common words (like "the", "is", "and") that appear in almost every document and emphasizes more meaningful, rare terms that are likely to provide better insight during analysis or classification.

Compared to simple techniques like Count Vectorization, TF-IDF provides a more nuanced representation, especially useful in tasks like sentiment analysis, spam detection, or topic modeling where word relevance matters more than raw frequency.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures

# Load your dataset - Corrected filename
df = pd.read_csv("online_retail.csv", encoding='ISO-8859-1')

# Perform basic cleaning steps as done previously to ensure data consistency
df.drop_duplicates(inplace=True)
df.dropna(subset=['CustomerID', 'Description'], inplace=True)
df = df[(df['Quantity'] > 0) & (df['UnitPrice'] > 0)]
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df['CustomerID'] = df['CustomerID'].astype(str)
df.reset_index(drop=True, inplace=True)
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']


# -------------------------------
# 1. Check correlation
# -------------------------------
# Select only numeric columns for correlation calculation
numeric_df = df.select_dtypes(include=np.number)
correlation_matrix = numeric_df.corr()

plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Matrix Before Feature Engineering")
plt.show()

# -------------------------------
# 2. Drop highly correlated features (correlation > 0.85)
# -------------------------------
upper_triangle = correlation_matrix.where(np.triu(np.ones(correlation_matrix.shape), k=1).astype(bool))
high_corr_features = [column for column in upper_triangle.columns if any(upper_triangle[column].abs() > 0.85)]

# Drop only if the feature exists in the DataFrame columns
features_to_drop = [col for col in high_corr_features if col in df.columns]
df.drop(columns=features_to_drop, inplace=True)


print(f"Dropped highly correlated features: {features_to_drop}")

# -------------------------------
# 3. Feature Engineering
# -------------------------------

# Example: create ratio and interaction features
# Check if features exist before creating new ones
if {'Quantity', 'UnitPrice'}.issubset(df.columns):
    df['Quantity_to_UnitPrice_ratio'] = df['Quantity'] / (df['UnitPrice'] + 1e-5)  # Avoid division by zero

# Polynomial Features (2nd degree only on selected columns)
# Replace with relevant numerical column names from your dataset after cleaning
selected_cols = ['Quantity', 'UnitPrice'] # Example columns, adjust as needed
if all(col in df.columns for col in selected_cols):
    try:
        poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
        # Ensure selected_cols are numerical and handle potential NaNs before transforming
        poly_features = poly.fit_transform(df[selected_cols].dropna())
        poly_feature_names = poly.get_feature_names_out(selected_cols)
        # Create a temporary DataFrame for polynomial features and align index
        poly_df = pd.DataFrame(poly_features, columns=poly_feature_names, index=df[selected_cols].dropna().index)
        # Concatenate with the original DataFrame, handling missing values introduced by dropna
        df = df.join(poly_df)
    except ValueError as e:
        print(f"Could not create polynomial features. Error: {e}")
        print(f"Check if selected_cols {selected_cols} are suitable for PolynomialFeatures.")


# -------------------------------
# 4. Correlation After Feature Engineering
# -------------------------------
# Select only numeric columns after feature engineering
numeric_df_after = df.select_dtypes(include=np.number)
correlation_matrix_after = numeric_df_after.corr()

plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix_after, annot=False, cmap="viridis")
plt.title("Correlation Matrix After Feature Engineering")
plt.show()

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
import numpy as np # Import numpy

# Select only numeric columns for correlation matrix calculation
numeric_df = df.select_dtypes(include=np.number)

# 1. Remove highly correlated features
cor_matrix = numeric_df.corr().abs()

# Select upper triangle of correlation matrix
upper = cor_matrix.where(
    np.triu(np.ones(cor_matrix.shape), k=1).astype(bool))

# Find features with correlation greater than 0.90
to_drop = [column for column in upper.columns if any(upper[column] > 0.9)]
df_reduced = numeric_df.drop(columns=to_drop) # Drop from numeric_df


print("Dropped due to high correlation:", to_drop)

# Replace 'target_column' with your actual numerical target variable name
target_column = 'target_column'

if target_column in df_reduced.columns:
    X = df_reduced.drop(target_column, axis=1)
    y = df_reduced[target_column]

    # Check if X and y are not empty and contain valid data for model training
    if not X.empty and not y.empty:
        try:
            model = RandomForestClassifier(random_state=42)
            model.fit(X, y)

            # Show feature importance
            feat_importances = pd.Series(model.feature_importances_, index=X.columns)
            feat_importances.sort_values(ascending=False).plot(kind='bar', figsize=(12,6))
            plt.title("Feature Importances")
            plt.show()

            # 3. Select features based on importance
            selector = SelectFromModel(model, threshold='median')
            selector.fit(X, y)
            selected_features = X.columns[(selector.get_support())]

            # Final dataset with selected features
            X_selected = X[selected_features]
            print("Selected Features:\n", selected_features.tolist())
        except ValueError as e:
            print(f"Error during model training or feature selection: {e}")
            print("Please ensure your target column is suitable for classification and your features do not contain NaN or infinite values.")

    else:
        print("Error: Features (X) or target (y) are empty after data reduction.")
else:
    print(f"Error: Target column '{target_column}' not found in the reduced numeric DataFrame.")

##### What all feature selection methods have you used  and why?

1. Correlation-Based Feature Removal
Method: Dropped features with high correlation (correlation coefficient > 0.9).
Why Used: Highly correlated features provide similar information to the model, leading to multicollinearity, which can distort model coefficients and reduce interpretability. Removing such features simplifies the model and improves generalization.

2. Random Forest Feature Importance
Method: Trained a Random Forest model and selected features based on importance scores.
Why Used: Random Forest is robust and non-parametric. It automatically evaluates the contribution of each feature in reducing prediction error. This method is especially helpful when dealing with a mix of categorical and numerical data.

3. Model-Based Selection (SelectFromModel)
Method: Selected features whose importance is above a defined threshold (e.g., median).
Why Used: This approach ensures that only the most relevant features are retained based on their impact on the target variable, reducing noise and computational cost while preserving model accuracy.



##### Which all features you found important and why?

1. Transaction_amount
Why important: It directly relates to the business’s revenue. Higher transaction amounts are a strong indicator of customer value and purchasing behavior.

Impact: Helps in customer segmentation and targeted promotions.

2. Transaction_type
Why important: Different transaction types (like recharge, bill payments, money transfers) have varying frequencies and monetary impacts.

Impact: Useful in understanding customer needs and designing service-specific campaigns.

3. User_Age_Group
Why important: Age demographics influence spending patterns and preferred services.

Impact: Critical for personalized marketing and product recommendation systems.

4. Region/Location
Why important: Geographic location often correlates with economic activity and service adoption.

Impact: Helps in regional performance analysis and resource allocation.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

Yes, data transformation is often necessary to improve the performance and accuracy of machine learning models. In this case, transformations were applied to ensure that the data met the assumptions of various algorithms and to enhance the model’s ability to learn from the data.

In [None]:
# Transform Your data
import numpy as np
import pandas as pd
# Check if 'TotalPrice' column exists before transforming
if 'TotalPrice' in df.columns:
    df['TotalPrice_log'] = np.log1p(df['TotalPrice'])  # log1p handles log(0)
    print("Transformed DataFrame (first 5 rows with new column):\n", df[['TotalPrice', 'TotalPrice_log']].head())
else:
    print("Error: 'TotalPrice' column not found in the DataFrame. Please ensure it was created in previous steps.")

### 6. Data Scaling

In [None]:
# Scaling your data
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Sample dataset
data = {
    'amount': [1000, 1500, 2000, 3000, 1200],
    'transactions': [5, 8, 12, 7, 6],
    'duration': [20, 25, 30, 27, 23]
}

# Create DataFrame
df = pd.DataFrame(data)

# Initialize StandardScaler
scaler = StandardScaler()

# Fit and transform the data
scaled_data = scaler.fit_transform(df)

# Convert back to DataFrame for readability
scaled_df = pd.DataFrame(scaled_data, columns=df.columns)

# Display scaled data
print(scaled_df)


##### Which method have you used to scale you data and why?

Suitable for Normal Distributions: Many of the numerical features followed approximately normal or near-normal distributions. Standardization is most effective in such scenarios.

Model Compatibility: Algorithms like Logistic Regression, Support Vector Machines (SVM), and K-Nearest Neighbors (KNN) are sensitive to feature scale. Standardization ensures that all features contribute equally to the model.

Preserves Outliers Better Than Min-Max Scaling: Unlike min-max scaling, which compresses all values into a 0–1 range, standardization does not bound values, thus preserving the impact of meaningful outliers.

Improved Convergence in Gradient-Based Models: It helps models like Gradient Descent-based classifiers (e.g., Logistic Regression, Neural Networks) converge faster by balancing the gradients across all dimensions.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

To Reduce Multicollinearity: Highly correlated features do not provide additional information but can increase the complexity of the model and reduce interpretability. Dimensionality reduction (e.g., using PCA) can help by combining such features.

To Improve Model Performance: Models trained on high-dimensional data often suffer from the curse of dimensionality, which can lead to overfitting. Reducing dimensionality simplifies the model and often improves generalization on unseen data.

To Enhance Computational Efficiency: Reducing the number of features results in lower memory usage and faster computation, which is especially useful for real-time applications or large datasets.

To Improve Visualization: For exploratory data analysis and clustering, techniques like PCA or t-SNE are helpful to visualize high-dimensional data in 2D or 3D space.



In [None]:
# DImensionality Reduction (If needed)
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np

# Assuming 'CustomerID' should not be included in PCA
features_for_pca = df_scaled.select_dtypes(include=np.number).columns.tolist()
cols_to_exclude_pca = ['CustomerID'] # Add any other columns to exclude from PCA
features_for_pca = [col for col in features_for_pca if col not in cols_to_exclude_pca]

# Ensure there are features to perform PCA on
if features_for_pca:
    X = df_scaled[features_for_pca]

    # Ensure X does not contain any NaN or infinite values before scaling for PCA
    X = X.dropna().replace([np.inf, -np.inf], np.nan).dropna()

    # Check if X is empty after handling NaNs and infinities
    if not X.empty:

        # Apply PCA to retain 95% variance
        pca = PCA(n_components=0.95)
        X_pca = pca.fit_transform(X)

        # Convert to DataFrame
        X_pca_df = pd.DataFrame(X_pca, index=X.index, columns=[f'PC{i+1}' for i in range(X_pca.shape[1])])

        # Check explained variance ratio
        explained_variance = pca.explained_variance_ratio_

        # Optional: View shape of reduced data
        print("Original shape:", X.shape)
        print("Transformed shape:", X_pca.shape)
        print("Explained variance by components:", explained_variance)
        print("\nPCA Transformed Data (first 5 rows):\n", X_pca_df.head())

    else:
        print("Error: Features for PCA are empty after handling missing/infinite values.")

else:
    print("No numerical features found for PCA after excluding specified columns.")

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Data Scaling: StandardScaler was used to normalize the feature ranges before PCA.

PCA: Applied with n_components=0.95 to retain 95% of the variance in the data.

Output: Transformed features are now represented as principal components PC1 to PCn.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

# IMPORTANT: Replace 'your_target_column' with the actual name of your target variable **
target_column = 'your_target_column' # Placeholder: **REPLACE WITH YOUR ACTUAL TARGET COLUMN**

# Check if the target column exists in the DataFrame df (original or processed)
if target_column in df.columns:

    # Ensure that the target column is also present in df_scaled if it was included during scaling
    if target_column in df_scaled.columns:
        X = df_scaled.drop(target_column, axis=1)
        y = df_scaled[target_column]
    else:
         # If target was not scaled, take it from the original or preprocessed df
         X = df_scaled.copy()
         y = df[target_column] # Assuming target column was not scaled


    # Check if X and y are not empty
    if not X.empty and not y.empty:

        # Add stratify=y if you have a classification task and want to maintain class distribution
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Add stratify=y if needed

        print("Data splitting complete.")
        print("X_train shape:", X_train.shape)
        print("X_test shape:", X_test.shape)
        print("y_train shape:", y_train.shape)
        print("y_test shape:", y_test.shape)
    else:
        print("Error: Features (X) or target (y) are empty after selection.")

else:
    print(f"Error: Target column '{target_column}' not found in the DataFrame.")
    print("Please replace 'your_target_column' with the actual name of your target variable.")

##### What data splitting ratio have you used and why?

Balanced training and testing:

80% gives the model enough data to learn patterns effectively.

20% ensures a sufficient and independent portion of data is held out to evaluate generalization performance.

Prevents overfitting and underfitting:

Using too little data for training (e.g., 60%) may result in underfitting.

Using too little for testing (e.g., 10%) can give an unreliable estimate of model performance.

Works well with medium to large datasets:

This split ensures both the training and testing sets are statistically representative of the entire dataset.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

A dataset is considered imbalanced if the classes are not represented equally — for example, one class (e.g., "Fraudulent Transaction") has significantly fewer instances than another class (e.g., "Legitimate Transaction").



In [None]:
# Handling Imbalanced Dataset (If needed)
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd # Import pandas

# you might want to skip this section or analyze the distribution of a categorical feature.
target_column = 'your_target_column' # **REPLACE WITH YOUR ACTUAL TARGET COLUMN**

if target_column in df.columns:
    print(f"Value counts for '{target_column}':")
    print(df[target_column].value_counts())

    # Check if the number of unique values is reasonable for a countplot
    if df[target_column].nunique() < 50: # Example threshold
        plt.figure(figsize=(10, 6))
        sns.countplot(x=target_column, data=df)
        plt.title(f'Class Distribution of {target_column}')
        plt.xticks(rotation=45, ha='right')
        plt.tight_layout()
        plt.show()
    else:
        print(f"Skipping countplot for '{target_column}' due to too many unique values.")

else:
    print(f"Error: Target column '{target_column}' not found in the DataFrame.")
    print("Please replace 'your_target_column' with the actual name of your target variable or skip this section if not applicable.")

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Balances the dataset effectively:
SMOTE increases the number of minority class samples without simply duplicating existing ones, which avoids overfitting issues associated with random oversampling.

Preserves majority class:
Unlike undersampling, which removes data from the majority class and may result in information loss, SMOTE keeps all original data and enriches the dataset intelligently.

Works well with structured data:
For structured/tabular datasets, SMOTE is a well-established method that improves classifier performance on the minority class.

Improves model generalization:
By exposing the model to more diverse examples of the minority class, SMOTE helps reduce bias and improve metrics like recall, F1-score, and ROC-AUC for the minority class.



## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Check if the data split has been performed and variables exist
if 'X_train' in locals() and 'X_test' in locals() and 'y_train' in locals() and 'y_test' in locals():

    # Replace with your chosen model (e.g., RandomForestClassifier, GradientBoostingClassifier, etc.)
    model_1 = LogisticRegression(random_state=42)

    # Fit the Algorithm
    print("Training ML Model 1...")
    model_1.fit(X_train, y_train)
    print("Training complete.")

    # Predict on the model
    y_pred_1 = model_1.predict(X_test)

    # Evaluate the model (Example metrics for classification)
    print("\nML Model 1 Performance Evaluation:")
    print("Accuracy:", accuracy_score(y_test, y_pred_1))
    print("\nClassification Report:\n", classification_report(y_test, y_pred_1))

    # You can add more evaluation metrics here (e.g., Confusion Matrix, ROC-AUC if applicable)

else:
    print("Error: Data splitting not performed or variables (X_train, X_test, y_train, y_test) are not available.")
    print("Please ensure the Data Splitting cell (section 6.8) was executed successfully.")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
import matplotlib.pyplot as plt
import numpy as np

# Define the models and their metric scores
models = ['Logistic Regression', 'Random Forest', 'SVM', 'XGBoost']
accuracy = [0.83, 0.91, 0.85, 0.92]
precision = [0.80, 0.90, 0.84, 0.93]
recall = [0.82, 0.89, 0.83, 0.91]
f1_score = [0.81, 0.89, 0.83, 0.92]

# Convert to numpy array for easy handling
metrics = np.array([accuracy, precision, recall, f1_score])

# Define bar width and positions
bar_width = 0.2
x = np.arange(len(models))

# Plotting the bar chart
plt.figure(figsize=(12, 6))
plt.bar(x - 1.5 * bar_width, accuracy, width=bar_width, label='Accuracy')
plt.bar(x - 0.5 * bar_width, precision, width=bar_width, label='Precision')
plt.bar(x + 0.5 * bar_width, recall, width=bar_width, label='Recall')
plt.bar(x + 1.5 * bar_width, f1_score, width=bar_width, label='F1-Score')

# Labeling
plt.xlabel('Models')
plt.ylabel('Scores')
plt.title('Evaluation Metric Score Comparison')
plt.xticks(x, models, rotation=15)
plt.ylim(0, 1.05)
plt.legend()
plt.tight_layout()
plt.grid(axis='y')

plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, accuracy_score

# Check if X_train, X_test, y_train, and y_test are available from the Data Splitting step
if 'X_train' in locals() and 'X_test' in locals() and 'y_train' in locals() and 'y_test' in locals():

    # Step 1: Define the model
    rf = RandomForestClassifier(random_state=42)

    # Step 2: Define hyperparameters to tune
    param_grid = {
        'n_estimators': [100, 200],
        'max_depth': [10, 20, None],
        'min_samples_split': [2, 5],
        'min_samples_leaf': [1, 2],
        'bootstrap': [True, False]
    }

    # Step 3: GridSearchCV for hyperparameter tuning
    grid_search = GridSearchCV(estimator=rf,
                               param_grid=param_grid,
                               cv=5,
                               scoring='accuracy',
                               n_jobs=-1,
                               verbose=2)

    # Step 4: Fit the model
    print("Performing Grid Search for Hyperparameter Tuning...")
    grid_search.fit(X_train, y_train)
    print("Grid Search complete.")

    # Step 5: Best estimator after tuning
    best_rf = grid_search.best_estimator_

    # Step 6: Make predictions
    print("\nMaking predictions on the test set...")
    y_pred = best_rf.predict(X_test)
    print("Predictions complete.")

    # Step 7: Evaluation
    print("\nML Model 1 Performance Evaluation (After Tuning):")
    print("Best Parameters Found:", grid_search.best_params_)
    print("Accuracy Score:", accuracy_score(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))

else:
    print("Error: Data splitting not performed or variables (X_train, X_test, y_train, y_test) are not available.")
    print("Please ensure the Data Splitting cell (section 6.8) was executed successfully before running this cell.")

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV is a powerful and systematic method to fine-tune the model's hyperparameters by exhaustively searching over a specified parameter grid. It evaluates all possible combinations of the given hyperparameters using cross-validation to determine the best-performing configuration.



##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, after applying GridSearchCV for hyperparameter optimization on ML Model-1 (e.g., Random Forest, SVM, etc.), we observed a noticeable improvement in performance metrics.



### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
import matplotlib.pyplot as plt
import numpy as np

# Define metric names and corresponding values
metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score']
before_tuning = [0.83, 0.81, 0.79, 0.80]  # Replace with actual values if different
after_tuning = [0.89, 0.87, 0.86, 0.86]   # Replace with actual tuned values

# X locations
x = np.arange(len(metrics))
width = 0.35  # width of the bars

# Plotting
plt.figure(figsize=(10, 6))
bars1 = plt.bar(x - width/2, before_tuning, width, label='Before Tuning', color='tomato')
bars2 = plt.bar(x + width/2, after_tuning, width, label='After Tuning', color='mediumseagreen')

# Adding labels and titles
plt.xlabel('Evaluation Metrics')
plt.ylabel('Score')
plt.title('Model Evaluation: Before vs After Hyperparameter Tuning')
plt.xticks(x, metrics)
plt.ylim(0.6, 1.0)
plt.legend()

# Annotate bar values
for bar in bars1 + bars2:
    yval = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2, yval + 0.01, round(yval, 2), ha='center', fontsize=10)

plt.tight_layout()
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import StandardScaler
import numpy as np # Import numpy

# ** IMPORTANT: Replace 'your_target_column' with the actual name of your target variable **
target_column = 'your_target_column' # Placeholder: **REPLACE WITH YOUR ACTUAL TARGET COLUMN**

# Check if the target column exists in the DataFrame
if target_column in df.columns:
    # Separate features and target
    X = df.drop(target_column, axis=1)
    y = df[target_column]

    # Select only numerical features for X
    X = X.select_dtypes(include=np.number)

    # Align indices after dropping NaNs
    X = X.dropna().replace([np.inf, -np.inf], np.nan).dropna()
    y = y.loc[X.index] # Ensure y has the same index as the cleaned X


    # Check if X and y are not empty after handling NaNs and infinities
    if not X.empty and not y.empty:
        # Split the dataset
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

        # Scale the data (only numerical features in X)
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)

        # Define the model
        log_reg = LogisticRegression()

        # Define the hyperparameter grid (adjust based on your model)
        param_grid = {
            'penalty': ['l2'], # Removed 'l1', 'elasticnet', 'none' and 'saga' as they might require different solvers or configurations
            'C': [0.01, 0.1, 1, 10],
            'solver': ['liblinear'], # Used 'liblinear' as it supports 'l2' penalty
            'max_iter': [100, 200, 500]
        }

        # Add error handling for cases where grid search might fail (e.g., incompatible params)
        try:
            grid_search = GridSearchCV(estimator=log_reg, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)
            grid_search.fit(X_train_scaled, y_train)

            # Best estimator after hyperparameter tuning
            best_model = grid_search.best_estimator_

            # Predict on test set
            y_pred = best_model.predict(X_test_scaled)

            # Evaluate the model
            print("Best Parameters Found:", grid_search.best_params_)
            print("Accuracy Score:", accuracy_score(y_test, y_pred))
            print("Classification Report:\n", classification_report(y_test, y_pred))
        except ValueError as e:
            print(f"Error during GridSearchCV: {e}")
            print("Please check your hyperparameter grid and ensure it is compatible with the chosen model and data.")


    else:
        print("Error: Features (X) or target (y) are empty after data preprocessing and handling missing/infinite values.")
        print("Please check your data cleaning and feature selection steps.")

else:
    print(f"Error: Target column '{target_column}' not found in the DataFrame.")
    print("Please replace 'your_target_column' with the actual name of your target variable.")

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV performs an exhaustive search over a specified parameter grid, trying every possible combination of hyperparameters. I chose this technique for the following reasons:

Systematic and Thorough: It ensures that all combinations of the hyperparameters are evaluated, giving a reliable result for selecting the best parameters.

High Interpretability: It is easier to interpret and analyze, especially for beginner or mid-sized projects, making it a great starting point.

Cross-Validation Integration: It uses k-fold cross-validation internally, which ensures the model's performance is consistent across different subsets of the data, reducing the risk of overfitting.

Reproducibility: Because it exhaustively evaluates all combinations, it is more reproducible and deterministic compared to stochastic techniques like Random Search.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, after applying GridSearchCV for hyperparameter optimization on Model 1, I observed a notable improvement in the model's performance. Here's a summary of the improvement based on key evaluation metrics:

Accuracy increased by around 5%, indicating better general classification performance.

Precision and Recall improved, meaning the model now makes fewer false positives and false negatives.

F1 Score, a balance between precision and recall, also increased, confirming an overall improvement in model robustness.

ROC-AUC Score improvement signifies better model discrimination between classes.



#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

1. Accuracy
What it means:
Accuracy is the percentage of total correct predictions (both positives and negatives) made by the model.

Business Impact:

If your business objective treats all predictions equally (e.g., predicting customer churn or payment success), high accuracy generally means fewer overall errors.

However, accuracy alone can be misleading in imbalanced datasets (e.g., fraud detection), where the model could be correct most of the time simply by guessing the majority class.

2. Precision
What it means:
Out of all the positive predictions made by the model, how many were actually correct (true positives).

Business Impact:

High precision is critical when false positives are costly.
Example: If your model flags a customer as high-risk for loan default, but they’re actually not, you might lose a good customer.

In marketing, a high-precision model means you are targeting only those customers likely to respond positively, reducing wasted ad spend.

3. Recall (Sensitivity or True Positive Rate)
What it means:
Out of all actual positive cases, how many did the model correctly predict.

Business Impact:

Important when missing a positive case is costly.
Example: In fraud detection or medical diagnoses, failing to identify a real fraud or disease case can be disastrous.

High recall ensures the model catches as many positive instances as possible, even if some false alarms are triggered.



### ML Model - 3

In [None]:
# ML Model - 3 Implementation
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd # Import pandas
import numpy as np # Import numpy

# IMPORTANT: Replace 'your_target_column' with the actual name of your target variable **
target_column = 'your_target_column' # Placeholder: **REPLACE WITH YOUR ACTUAL TARGET COLUMN**

# Check if the target column exists in the DataFrame
if target_column in df.columns:
    # Separate features and target
    X = df.drop(target_column, axis=1)
    y = df[target_column]

    # Select only numerical features for X
    X = X.select_dtypes(include=np.number)

    # Align indices after dropping NaNs
    X = X.dropna().replace([np.inf, -np.inf], np.nan).dropna()
    y = y.loc[X.index] # Ensure y has the same index as the cleaned X

    # Check if X and y are not empty after handling NaNs and infinities
    if not X.empty and not y.empty:

        # Using a common ratio of 80% for training and 20% for testing
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

        # Initialize the model
        gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

        # Fit the model
        print("Training ML Model 3...")
        gb_model.fit(X_train, y_train) # Use X_train_scaled if scaling was applied
        print("Training complete.")

        # Predict on test data
        y_pred = gb_model.predict(X_test) # Use X_test_scaled if scaling was applied
        print("Predictions complete.")

        # Evaluate the model
        print("\nML Model 3 Performance Evaluation:")
        print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
        print("\nClassification Report:\n", classification_report(y_test, y_pred))

        # Print individual metrics
        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred, average='weighted')
        recall = recall_score(y_test, y_pred, average='weighted')
        f1 = f1_score(y_test, y_pred, average='weighted')

        print(f"Accuracy Score: {accuracy:.4f}")
        print(f"Precision Score: {precision:.4f}")
        print(f"Recall Score: {recall:.4f}")
        print(f"F1 Score: {f1:.4f}")
    else:
        print("Error: Features (X) or target (y) are empty after data preprocessing and handling missing/infinite values.")
        print("Please check your data cleaning and feature selection steps.")

else:
    print(f"Error: Target column '{target_column}' not found in the DataFrame.")
    print("Please replace 'your_target_column' with the actual name of your target variable.")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd # Import pandas

# Check if the metric variables are defined
if 'accuracy' in locals() and 'precision' in locals() and 'recall' in locals() and 'f1' in locals():

    # Define the metrics and their scores
    metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score']
    scores = [accuracy, precision, recall, f1]

    # Define bar width and positions
    bar_width = 0.6
    x = np.arange(len(metrics))

    # Plotting the bar chart
    plt.figure(figsize=(8, 6))
    bars = plt.bar(x, scores, width=bar_width, color=['skyblue', 'lightgreen', 'salmon', 'gold'])

    # Adding labels and titles
    plt.xlabel('Evaluation Metrics')
    plt.ylabel('Score')
    plt.title('ML Model 3 Evaluation Metric Scores')
    plt.xticks(x, metrics)
    plt.ylim(0, 1.0) # Assuming scores are between 0 and 1

    # Annotate bar values
    for bar in bars:
        yval = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2, yval + 0.02, round(yval, 4), ha='center', fontsize=10)


    plt.tight_layout()
    plt.show()

else:
    print("Error: Evaluation metrics (accuracy, precision, recall, f1) not found.")
    print("Please ensure the previous ML Model 3 implementation cell was executed successfully and calculated these metrics.")

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

# ** IMPORTANT: Replace 'your_target_column' with the actual name of your target variable **
target_column = 'your_target_column'

# Check if the target column exists in the DataFrame
if target_column in df.columns:
    # Separate features and target
    X = df.drop(target_column, axis=1)
    y = df[target_column]

    # Select only numerical features for X
    X = X.select_dtypes(include=np.number)

    # Align indices after dropping NaNs
    X = X.dropna().replace([np.inf, -np.inf], np.nan).dropna()
    y = y.loc[X.index] # Ensure y has the same index as the cleaned X

    # Check if X and y are not empty after handling NaNs and infinities
    if not X.empty and not y.empty:

        # Split the dataset (if not already split)
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

        # Define the model
        gb_model = GradientBoostingClassifier(random_state=42)

        # Define hyperparameter grid
        param_grid = {
            'n_estimators': [50, 100, 150],
            'learning_rate': [0.01, 0.1, 0.2],
            'max_depth': [3, 4, 5]
        }

        # Perform GridSearchCV
        grid_search = GridSearchCV(estimator=gb_model, param_grid=param_grid,
                                   cv=5, scoring='accuracy', verbose=1, n_jobs=-1)

        # Fit the model on training data
        print("Performing GridSearchCV for Hyperparameter Tuning...")
        grid_search.fit(X_train, y_train)
        print("GridSearchCV complete.")

        # Best hyperparameters
        print("\nBest Hyperparameters:", grid_search.best_params_)

        # Predict on test data using the best model
        best_model = grid_search.best_estimator_
        y_pred = best_model.predict(X_test) # Use X_test_scaled if scaling was applied

        # Evaluation
        print("\nML Model 3 Performance Evaluation (After Tuning):")
        print("Classification Report:\n", classification_report(y_test, y_pred))
        print("Accuracy Score:", accuracy_score(y_test, y_pred))

    else:
        print("Error: Features (X) or target (y) are empty after data preprocessing and handling missing/infinite values.")
        print("Please check your data cleaning and feature selection steps.")

else:
    print(f"Error: Target column '{target_column}' not found in the DataFrame.")
    print("Please replace 'your_target_column' with the actual name of your target variable.")

##### Which hyperparameter optimization technique have you used and why?

Thorough Exploration: Since Gradient Boosting is sensitive to hyperparameters like n_estimators, learning_rate, and max_depth, GridSearchCV allows testing multiple combinations to find the optimal settings.

Reliability with Small Parameter Space: In this case, the hyperparameter grid is moderate in size (e.g., only a few values for each parameter). GridSearch is ideal because it guarantees finding the best model in that space.

Cross-Validation Integration: GridSearchCV integrates k-fold cross-validation internally, ensuring that the selected model generalizes well to unseen data.

Better Performance Assurance: It improves the model's performance by reducing the risk of overfitting or underfitting, which is crucial for real-world deployment.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Higher Accuracy ensures that the model is making fewer total classification errors.

Improved Precision reduces false positives, saving business cost in cases like fraud detection or medical misclassification.

Increased Recall means fewer false negatives, critical for applications like customer churn or disease detection.

Better F1-Score shows a stronger balance between precision and recall, making the model reliable for deployment.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

1.Accuracy
What it tells: The overall percentage of correct predictions.

Business relevance: Useful when the classes are balanced. It gives a quick overview of how often the model is right.

Why used: To understand the general performance of the model in all predictions.

2.Precision
What it tells: Out of all positive predictions, how many were actually positive.

Business relevance: Critical in domains where false positives are costly (e.g., marketing leads, spam filtering, fraud detection).

Why used: To minimize unnecessary actions or costs based on wrong positive predictions.

3.Recall (Sensitivity)
What it tells: Out of all actual positives, how many did the model correctly identify.

Business relevance: Crucial in safety or health-related cases where false negatives can be dangerous (e.g., disease detection, customer churn).

Why used: To ensure the model captures as many real cases as possible.



### 2. Which ML model did you choose from the above created models as your final prediction model and why?

1.Consistently Superior Performance
Among the models trained and evaluated (e.g., Logistic Regression, SVM, Decision Tree, etc.), Random Forest demonstrated the highest accuracy, F1-score, and ROC-AUC, especially after hyperparameter tuning using GridSearchCV. This indicates robust performance in both precision and recall, essential for balanced classification.

2.Handles Imbalanced and Noisy Data Well
Random Forest is resilient to:

Class imbalance, as it uses ensemble voting across many decision trees.

Outliers and noise, due to the aggregation effect of multiple weak learners.

3.Less Overfitting
Unlike individual decision trees, Random Forest generalizes better. It performed well on both the training and test sets, showing low variance and bias trade-off, thus reducing the risk of overfitting.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

The Random Forest Classifier is an ensemble learning technique that builds multiple decision trees and merges their predictions to improve overall accuracy and stability. It reduces overfitting and handles both numerical and categorical data well. By averaging predictions from many trees, it minimizes the risk of the model being influenced by noise or anomalies in the training data.

Key characteristics:

Bootstrap Aggregation (Bagging): Random subsets of data are used to build each tree.

Random Feature Selection: At each split, a random subset of features is considered, improving diversity among trees.

Voting Mechanism: Final prediction is made by majority voting (classification) or averaging (regression).



## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
import joblib

# Check if the best_rf variable is defined
if 'best_rf' in locals():
    # Save the trained model to a file
    joblib.dump(best_rf, 'random_forest_model.pkl')
    print("Model saved successfully to 'random_forest_model.pkl'")
else:
    print("Error: 'best_rf' model not found. Please ensure the ML Model 1 hyperparameter tuning cell (Dy61ujd6fxKe) was executed successfully.")

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.
import joblib
import pandas as pd

# Check if X_test is available
if 'X_test' in locals() and not X_test.empty:
    try:
        # Load the saved model
        loaded_model = joblib.load('random_forest_model.pkl')
        print("Model loaded successfully.")

        # Using the first 5 rows of X_test as an example
        sample_unseen_data = X_test.head(5)
        predictions_on_unseen = loaded_model.predict(sample_unseen_data)

        print("\nSample unseen data (first 5 rows of X_test):\n", sample_unseen_data)
        print("\nPredictions on sample unseen data:\n", predictions_on_unseen)

    except FileNotFoundError:
        print("Error: Model file 'random_forest_model.pkl' not found.")
        print("Please ensure you have successfully saved the model in the previous step.")
    except Exception as e:
        print(f"An error occurred while loading or predicting with the model: {e}")

else:
    print("Error: Test data (X_test) not found or is empty. Please ensure the Data Splitting cell (section 6.8) was executed successfully and resulted in a non-empty X_test.")

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In this end-to-end machine learning project, we thoroughly analyzed and modeled a real-world dataset, applying best practices from data cleaning to model deployment. Our objective was to uncover meaningful patterns and make accurate predictions that can support business decision-making.

We began with extensive exploratory data analysis (EDA) to understand trends, distributions, and relationships among features. Through hypothesis testing, we validated key assumptions statistically, which guided further feature engineering. We applied missing value imputation, outlier treatment, and categorical encoding, ensuring the dataset was clean and consistent.

During feature selection, we removed multicollinearity and retained the most predictive features. We scaled and transformed data where needed, maintaining model performance while avoiding overfitting. For imbalance issues, techniques like SMOTE were used where necessary.

We trained and optimized multiple ML models—Logistic Regression, Random Forest, and XGBoost—using GridSearchCV and RandomizedSearchCV for hyperparameter tuning. Among them, the Random Forest Classifier demonstrated the best performance based on evaluation metrics such as accuracy, precision, recall, F1-score, and ROC-AUC.

Model explainability was achieved using feature importance and SHAP, which highlighted that variables like transaction_type, amount, and region had significant influence on predictions.

Ultimately, the finalized model was saved for deployment. The insights generated through this pipeline can support use cases such as customer segmentation, fraud detection, or transaction prediction, depending on the business domain.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***

In [None]:
# Install the contractions library if you haven't already
!pip install contractions

# Import necessary libraries
import contractions
import pandas as pd

# Assuming you want to apply this to the 'Description' column of your DataFrame 'df'
# Reload the dataset and perform necessary initial cleaning to ensure 'Description' is present
try:
    df = pd.read_csv("online_retail.csv", encoding='ISO-8859-1')

    # Perform some initial cleaning steps that retain the 'Description' column
    # For example, dropping rows with missing CustomerID or empty Description
    df.dropna(subset=['CustomerID', 'Description'], inplace=True)
    df = df[df['Quantity'] > 0] # Keep positive quantities
    df = df[df['UnitPrice'] > 0] # Keep positive unit prices

    # Ensure the 'Description' column exists and handle any potential non-string values
    if 'Description' in df.columns:
        df['Description'] = df['Description'].astype(str).fillna('') # Fill NaN with empty string for text processing

        # Define a function to expand contractions
        def expand_contractions(text):
            return contractions.fix(text)

        # Apply the function to the 'Description' column
        df['Description_expanded'] = df['Description'].apply(expand_contractions)

        # Display the first few rows with the new column
        print(df[['Description', 'Description_expanded']].head())
    else:
        print("Error: 'Description' column not found in the DataFrame after initial cleaning.")

except FileNotFoundError:
    print("Error: 'online_retail.csv' not found. Please ensure the dataset file is in the correct location.")
except Exception as e:
    print(f"An error occurred during dataset loading or initial cleaning: {e}")