Methodoly hypothesis: 
- We will not work with deep learning as it does not provide insights into the patterns. <- no explainability.We will also conclude with an analysis in explainability. 
- We will work with regression and likeability (⁠xgboost regressor e classification)- and ARIMA for prediction.
- Qualitative data analysis must also identify outliers 
- If too many features, proceed with [feature selection](https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/)
- Check for class imbalance: If one class is much less frequent than others, traditional accuracy may not be a good indicator of performance.
- Use appropriate metrics: Consider using F1 score or precision instead of accuracy to evaluate your model, especially for the underrepresented class.
- Visualize performance: Creating an ROC curve helps to understand the trade-offs between true positive and false positive rates across different threshold settings. 
- We will validate the model with bootstrap if there is time. 
- Use library [SHAP](https://shap.readthedocs.io/en/latest/index.html) 
- Using XGboost, we will make analysis of feature importance, which derive from feature trees
- Clustering will be done in the end




# Data Insights 
Our aim is to uncover insights into customer satisfaction, delivery efficiency, and purchasing patterns through statistical analysis, machine learning, and data visualization.

## Analyzed Questions
- Sales Trends Over Time: Identifying periods of high demand to inform inventory and marketing strategies.
- Popular Product Categories: Highlighting top-selling categories to prioritize stock and marketing efforts.
- Geographic Distribution of Sales: Tailoring logistics strategies to improve delivery times in key markets.
- Influence of Payment Types on Purchases: Enhancing the checkout process to increase conversion rates.
- Customer Satisfaction Distribution: Identifying improvement areas in product quality, service, and delivery.
- Factors Affecting Delivery Times: Analyzing and optimizing delivery performance to reduce late deliveries.
- Time to Delivery vs. Customer Satisfaction: Examining the impact of delivery times on satisfaction levels.
- Key Predictors of Customer Satisfaction: Using machine learning to focus efforts on aspects crucial to customers.
- Actual vs. Estimated Delivery Times: Assessing delivery estimate accuracy and its satisfaction impact.
- Strategies for Business Improvement: Formulating recommendations based on data insights to enhance customer experience and business performance.

### Factors that influence customer behavior
1. Demographic; 
2. Psychographic;
3. Social; 
4. Cultural; 
5. Economical; 

The following code addresses all of these, given a dataset and the following hypothesis: 

#### Loading the dataset 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import os

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
def load_datasets(dataset_names, base_path):
    datasets = {}
    for name in dataset_names:
        file_path = os.path.join(base_path, name + '.csv')
        try:
            datasets[name] = pd.read_csv(file_path)
        except FileNotFoundError:
            print(f"File not found: {file_path}")
            datasets[name] = None
    return datasets

In [None]:
# Assuming /kaggle/input/brazilian-ecommerce/ is the correct directory
base_path = '/kaggle/input/brazilian-ecommerce/'

# Dataset names
dataset_names = [
    'olist_customers_dataset', 
    'olist_order_items_dataset', 
    'olist_order_payments_dataset', 
    'olist_order_reviews_dataset', 
    'olist_orders_dataset', 
    'olist_sellers_dataset', 
    'olist_products_dataset', 
    'product_category_name_translation'
]

# Assuming you have a function load_datasets that loads the datasets given the names and path
datasets = load_datasets(dataset_names, base_path)

# Displaying the first few rows of each dataset to understand their structure
for name, df in datasets.items():
    if df is not None:
        print(f"\nFirst few rows of {name}:")
        print(df.head())
    else:
        print(f"Failed to load {name}")

In [None]:
import pandas as pd

def merge_datasets(datasets):
    # Start by merging orders with customers
    merged = pd.merge(datasets['olist_orders_dataset'], datasets['olist_customers_dataset'], on='customer_id', how='left')

    # Add other datasets with the correct merge keys
    merge_keys = {
        'order_items': 'order_id',
        'order_payments': 'order_id',
        'order_reviews': 'order_id',
        'sellers': 'seller_id',
        'products': 'product_id',
        'product_category_name_translation': 'product_category_name'  # Make sure this dataset is loaded correctly
    }

    for name, key in merge_keys.items():
        if name in datasets:
            merged = pd.merge(merged, datasets[name], on=key, how='left')

    return merged

import pandas as pd

# Load individual datasets
olist_customers_df = pd.read_csv('/kaggle/input/brazilian-ecommerce/olist_customers_dataset.csv')
order_items_df = pd.read_csv('/kaggle/input/brazilian-ecommerce/olist_order_items_dataset.csv')
order_payments_df = pd.read_csv('/kaggle/input/brazilian-ecommerce/olist_order_payments_dataset.csv')
order_reviews_df = pd.read_csv('/kaggle/input/brazilian-ecommerce/olist_order_reviews_dataset.csv')
olist_orders_df = pd.read_csv('/kaggle/input/brazilian-ecommerce/olist_orders_dataset.csv')
sellers_df = pd.read_csv('/kaggle/input/brazilian-ecommerce/olist_sellers_dataset.csv')
products_df = pd.read_csv('/kaggle/input/brazilian-ecommerce/olist_products_dataset.csv')
product_category_name_translation_df = pd.read_csv('/kaggle/input/brazilian-ecommerce/product_category_name_translation.csv')

# Define the datasets dictionary
datasets = {
    'olist_customers_dataset': olist_customers_df,
    'order_items': order_items_df,
    'order_payments': order_payments_df,
    'order_reviews': order_reviews_df,
    'olist_orders_dataset': olist_orders_df,
    'sellers': sellers_df,
    'products': products_df,
    'product_category_name_translation': product_category_name_translation_df
}

# Now, you can call the merge_datasets function with the loaded datasets
merged_df = merge_datasets(datasets)

In [None]:
# applying data cleaning to new csv file
merged_df.to_csv('olist_merged_data.csv', index=False)
merged_df.info()
# check for duplicates
merged_df.duplicated().sum()
# check for missing values by percentage in each column
merged_df.isnull().sum() / len(merged_df) * 100
# drop missing values column with more than 50% missing values
merged_df = merged_df.dropna(thresh=len(merged_df) * 0.5, axis=1)

# drop rows with missing values
merged_df = merged_df.dropna()

# check for missing values by percentage in each column
merged_df.info()

# Clean and preprocess data
def preprocess_data(df):
    # Drop columns with more than 50% missing values
    df.dropna(thresh=len(df) * 0.5, axis=1, inplace=True)
    
    # Convert datetime columns
    datetime_cols = ['order_purchase_timestamp', 'order_approved_at', 'order_delivered_carrier_date', 
                    'order_delivered_customer_date', 'order_estimated_delivery_date', 
                    'shipping_limit_date', 'review_creation_date', 'review_answer_timestamp']
    for col in datetime_cols:
        df[col] = pd.to_datetime(df[col], errors='coerce')
    
    # Calculate new features
    df['time_to_delivery'] = (df['order_delivered_customer_date'] - df['order_approved_at']).dt.days
    df['order_processing_time'] = (df['order_approved_at'] - df['order_purchase_timestamp']).dt.days
    df['estimated_vs_actual_shipping'] = (df['order_estimated_delivery_date'] - df['order_delivered_customer_date']).dt.days
    df['product_volume_m3'] = (df['product_length_cm'] * df['product_width_cm'] * df['product_height_cm']) / 1000000
    df['satisfaction'] = (df['review_score'] >= 4).astype(int)
    df['order_value'] = df['price'] + df['freight_value']

    # create late delivery flag
    df['late_delivery'] = (df['order_delivered_customer_date'] > df['order_estimated_delivery_date']).astype(int)


    # Drop rows with missing values
    df.dropna(inplace=True)

    # create seasonal features from order_purchase_timestamp
    df['order_month'] = df['order_purchase_timestamp'].dt.month
    df['order_day'] = df['order_purchase_timestamp'].dt.dayofweek
    df['order_hour'] = df['order_purchase_timestamp'].dt.hour

    return df

merged_df = preprocess_data(merged_df)

# drop unnecessary columns
merged_df.drop(['product_name_lenght', 'product_description_lenght', 'product_photos_qty', 'review_score', 'seller_zip_code_prefix']
               , axis=1, inplace=True) 
# save the cleaned dataset
merged_df.to_csv('olist_merged_data_clean.csv', index=False)

merged_df.info()

# check summary statistics
merged_df.describe()

# Check the distribution of the CSAT percentage
merged_df['satisfaction'].value_counts() / len(merged_df) * 100

In [None]:
# Select only numeric columns from the DataFrame
numeric_columns = merged_df.select_dtypes(include=[np.number])

# Plot the correlation matrix heatmap with numeric columns
plt.figure(figsize=(20, 8))
sns.heatmap(numeric_columns.corr(), annot=True, cmap='coolwarm')
plt.show()

#### Univariate Analysis

In [None]:
# Ensure that the order_purchase_timestamp is in datetime format
merged_df['order_purchase_timestamp'] = pd.to_datetime(merged_df['order_purchase_timestamp'])

# Extract year and month for aggregation
merged_df['year_month'] = merged_df['order_purchase_timestamp'].dt.to_period('M')

# Aggregate data by year and month
sales_trends = merged_df.groupby('year_month').size()

# Plotting
plt.figure(figsize=(15, 6))
sales_trends.plot(kind='line', marker='o')
plt.title('Sales Trends Over Time (Monthly)')
plt.xlabel('Year-Month')
plt.ylabel('Number of Orders')
plt.grid(True)
plt.show()

In [None]:
# Grouping data by product category
category_analysis = merged_df.groupby('product_category_name_english').size().sort_values(ascending=False)

# Plotting the top 10 product categories by sales volume
plt.figure(figsize=(12, 6))
category_analysis.head(10).plot(kind='bar')
plt.title('Top 10 Product Categories by Sales Volume')
plt.xlabel('Product Category')
plt.ylabel('Number of Orders')
plt.xticks(rotation=45)
plt.show()


In [None]:
# Grouping data by customer state
state_sales = merged_df.groupby('customer_state').size().sort_values(ascending=False)

# Plotting sales distribution by state
plt.figure(figsize=(15, 6))
state_sales.plot(kind='bar')
plt.title('Geographic Distribution of Sales by State')
plt.xlabel('State')
plt.ylabel('Number of Orders')
plt.xticks(rotation=45)
plt.show()

In [None]:
# payment type
merged_df['payment_type'].value_counts().plot(kind='pie', autopct='%1.1f%%')
plt.title('Payment Type Distribution')
plt.show()

In [None]:
# Distribution of CSAT percentages on pie chart
plt.figure(figsize=(12, 6))
merged_df['satisfaction'].value_counts().plot(kind='pie', autopct='%1.1f%%')
plt.title('Distribution of CSAT Scores')
plt.show()

In [None]:
#Calculate the average delivery time
average_delivery_time = merged_df['time_to_delivery'].mean()

# Plot the distribution of delivery times
plt.figure(figsize=(12, 6))
sns.histplot(merged_df['time_to_delivery'].dropna(), bins=30, kde=True, label='Delivery Time')
plt.axvline(average_delivery_time, color='green', linestyle='dashed', linewidth=2, label='Average Delivery Time')
plt.text(average_delivery_time + 0.5, plt.ylim()[1] / 2, f'Average: {average_delivery_time:.2f} days', color='green', fontsize=12, ha='left')
plt.title('Distribution of Delivery Times')
plt.xlabel('Delivery Time (Days)')
plt.ylabel('Frequency')
plt.legend()
plt.show()

In [None]:
# Calculate the average delivery time
average_delivery_time = merged_df['time_to_delivery'].mean()

# Initialize counts and labels lists for CSAT categories
csat_labels_above, csat_labels_below = [], []
csat_counts_above, csat_counts_below = [], []

# Filter the DataFrame for CSAT values above and below average time_to_delivery
csat_above_avg = merged_df[merged_df['time_to_delivery'] >= average_delivery_time]
csat_below_avg = merged_df[merged_df['time_to_delivery'] < average_delivery_time]

# Calculate the counts for CSAT categories above average time_to_delivery
csat_counts_above = csat_above_avg['satisfaction'].value_counts().tolist()
csat_labels_above = csat_above_avg['satisfaction'].value_counts().index.tolist()

# Calculate the counts for CSAT categories below average time_to_delivery
csat_counts_below = csat_below_avg['satisfaction'].value_counts().tolist()
csat_labels_below = csat_below_avg['satisfaction'].value_counts().index.tolist()

# Create data for the pie charts
colors = ['lightgreen', 'lightcoral']

# Create the two pie charts
plt.figure(figsize=(12, 5))

# Pie chart for CSAT above average time_to_delivery
plt.subplot(1, 2, 1)
plt.pie(csat_counts_above, labels=csat_labels_above, colors=colors, autopct='%1.1f%%', startangle=140)
plt.title('CSAT Above Avg Time to Delivery')

# Pie chart for CSAT below average time_to_delivery
plt.subplot(1, 2, 2)
plt.pie(csat_counts_below, labels=csat_labels_below, colors=colors, autopct='%1.1f%%', startangle=140)
plt.title('CSAT Below Avg Time to Delivery')

plt.tight_layout()
plt.show()

#### Bivariate Analysis

In [None]:
# Group by month to calculate average Time to Delivery and proportion of late deliveries
monthly_delivery_stats = merged_df.groupby('order_month').agg(
    avg_time_to_delivery=('time_to_delivery', 'mean'),
    proportion_late=('late_delivery', lambda x: (x > 0).mean())
).reset_index()

# Create a plot using Plotly
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add a bar for average Time to Delivery
fig.add_trace(
    go.Bar(x=monthly_delivery_stats['order_month'], y=monthly_delivery_stats['avg_time_to_delivery'], 
           name='Avg. Time to Delivery (days)'),
    secondary_y=False,
)

# Add a line for the proportion of late deliveries
fig.add_trace(
    go.Scatter(x=monthly_delivery_stats['order_month'], y=monthly_delivery_stats['proportion_late'], 
               name='Proportion of Late Deliveries', mode='lines+markers', marker=dict(color='red')),
    secondary_y=True,
)

# Add titles and labels
fig.update_layout(
    title='Proportion of Late Delivery vs Average Time to Delivery in Monthly Trends',
    xaxis_title='Month',
    yaxis_title='Average Time to Delivery (days)',
    yaxis2_title='Proportion of Late Deliveries',
    legend=dict(y=0.5, traceorder='reversed', font_size=16)
)

# Show the plot
fig.show()

In [None]:
# Preparing the data
grouped_data = merged_df.groupby(['order_month', 'late_delivery'])['product_weight_g'].count().reset_index()
grouped_data = grouped_data.pivot(index='order_month', columns='late_delivery', values='product_weight_g').fillna(0)

# Creating the bar for each 'late_delivery' status
bars = []
for late_delivery_status in grouped_data.columns:
    bars.append(go.Bar(name=str(late_delivery_status),
                       x=grouped_data.index,
                       y=grouped_data[late_delivery_status]))

# Creating the figure and adding the bars
fig = go.Figure(data=bars)

# Change the bar mode to stacked
fig.update_layout(barmode='stack', 
                  title='Count of Product Weight by Month and Late Delivery Status',
                  xaxis_title='Order Month',
                  yaxis_title='Count of Product Weight')

# Show the plot
fig.show()

In [None]:
from matplotlib.ticker import PercentFormatter

# Calculate the percentage of late deliveries by category
late_deliveries_by_category = merged_df[merged_df['late_delivery'] == 1]['product_category_name_english'].value_counts()
total_late_deliveries = late_deliveries_by_category.sum()
late_deliveries_by_category_percent = (late_deliveries_by_category / total_late_deliveries).cumsum() * 100

# Modifying the Pareto Chart to include lines for 80% and 20% thresholds

# Recreating the Pareto Chart with modifications
fig, ax = plt.subplots(figsize=(14, 9))
late_deliveries_by_category.plot(kind='bar', ax=ax, color='skyblue')
ax2 = ax.twinx()
ax2.plot(late_deliveries_by_category_percent.index, late_deliveries_by_category_percent.values, color='red', marker='D', ms=7)
ax2.yaxis.set_major_formatter(PercentFormatter())

# Adding lines for the 80% and 20% thresholds
ax2.axhline(80, color='green', linestyle='--', linewidth=2)
# Identifying the point where cumulative percentage surpasses 80%
category_80_idx = late_deliveries_by_category_percent[late_deliveries_by_category_percent >= 80].index[0]
category_80_position = late_deliveries_by_category.index.get_loc(category_80_idx)

ax.axvline(category_80_position, color='purple', linestyle='--', linewidth=2)

ax.tick_params(axis='y', colors='skyblue')
ax2.tick_params(axis='y', colors='red')
ax.set_xlabel('Product Category')
ax.set_ylabel('Number of Late Deliveries', color='skyblue')
ax2.set_ylabel('Cumulative Percentage', color='red')
plt.title('Pareto Chart of Late Deliveries by Product Category with 80/20 Threshold')

plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

### We can use above insight in targeted Marketing efforts by :
Focus on improving the logistics and delivery processes for the product categories that significantly contribute to late deliveries, as they could negatively impact customer satisfaction.
Develop specialized marketing campaigns that address and reassure timely delivery for these high-late-delivery categories.


In [None]:
# Calculate the total order value by category
order_value_by_category = merged_df.groupby('product_category_name_english')['order_value'].sum()
total_order_value = order_value_by_category.sum()
order_value_by_category_percent = (order_value_by_category / total_order_value).cumsum() * 100

# Correcting the approach for the Pareto Chart of Order Value by Product Category

# Sorting the order values by category in descending order
sorted_order_value_by_category = order_value_by_category.sort_values(ascending=False)
cumulative_order_value_percent = sorted_order_value_by_category.cumsum() / total_order_value * 100

# Recreating the Pareto Chart with the correct approach
fig, ax = plt.subplots(figsize=(14, 9))
sorted_order_value_by_category.plot(kind='bar', ax=ax, color='green')
ax2 = ax.twinx()
ax2.plot(cumulative_order_value_percent.index, cumulative_order_value_percent.values, color='red', marker='D', ms=7)
ax2.yaxis.set_major_formatter(PercentFormatter())

# Adding lines for the 80% threshold
ax2.axhline(80, color='green', linestyle='--', linewidth=2)
category_80_idx_corrected = cumulative_order_value_percent[cumulative_order_value_percent >= 80].index[0]
category_80_position_corrected = cumulative_order_value_percent.index.get_loc(category_80_idx_corrected)

ax.axvline(category_80_position_corrected, color='purple', linestyle='--', linewidth=2)

ax.tick_params(axis='y', colors='orange')
ax2.tick_params(axis='y', colors='red')
ax.set_xlabel('Product Category')
ax.set_ylabel('Total Order Value', color='orange')
ax2.set_ylabel('Cumulative Percentage', color='red')
plt.title('Pareto Chart of Order Value by Product Category with 80/20 Threshold')

plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

Pareto Chart of Order Value by Product Category with 80/20 Threshold: A small number of categories generate most of the revenue, showing market focus areas or best-sellers.¶
### We can use above insight in targeted Marketing efforts by :
Allocate more budget to advertise the top-performing product categories that contribute most to your sales volume to maximize ROI.
Create bundles or promotions that include high-value items with other products to increase the overall order value.

In [None]:
from matplotlib import dates as mdates

# Preparing data for Control Chart: Calculate daily average time to delivery
daily_delivery_times = merged_df.copy()
daily_delivery_times['order_approved_at'] = pd.to_datetime(daily_delivery_times['order_approved_at'])
daily_delivery_times.set_index('order_approved_at', inplace=True)
daily_avg_delivery_time = daily_delivery_times['time_to_delivery'].resample('D').mean().dropna()

# Control Chart calculations
mean_delivery_time = daily_avg_delivery_time.mean()
std_dev_delivery_time = daily_avg_delivery_time.std()
upper_control_limit = mean_delivery_time + (std_dev_delivery_time * 3)
lower_control_limit = mean_delivery_time - (std_dev_delivery_time * 3)

# Plotting the Control Chart
fig, ax = plt.subplots(figsize=(14, 8))
daily_avg_delivery_time.plot(ax=ax, marker='o', linestyle='-', color='blue', markersize=4)
ax.axhline(mean_delivery_time, color='green', linestyle='--')
ax.axhline(upper_control_limit, color='red', linestyle='--')
ax.axhline(lower_control_limit, color='red', linestyle='--')

# Formatting the plot
ax.set_title('Control Chart for Daily Average Time to Delivery')
ax.set_xlabel('Date')
ax.set_ylabel('Average Time to Delivery (Days)')
ax.legend(['Daily Avg Time to Delivery', 'Mean', 'Upper Control Limit', 'Lower Control Limit'])
ax.xaxis.set_major_locator(mdates.MonthLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))

plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Control Chart for Daily Average Time to Delivery: Delivery times are generally consistent, but spikes suggest occasional delays that could impact customer satisfaction.¶

In [None]:
# Analyzing the Variability of Time to Delivery Across Different Product Categories
category_delivery_times = merged_df.groupby('product_category_name_english')['time_to_delivery'].mean().sort_values()

plt.figure(figsize=(12, 10))
sns.barplot(x=category_delivery_times.values, y=category_delivery_times.index, palette="viridis")
plt.title('Average Time to Delivery by Product Category')
plt.xlabel('Average Time to Delivery (Days)')
plt.ylabel('Product Category')
plt.tight_layout()
plt.show()

In [None]:
# Calculate the average satisfaction score over weeks
merged_df['order_approved_at'] = pd.to_datetime(merged_df['order_approved_at'])
merged_df['week_year'] = merged_df['order_approved_at'].dt.strftime('%Y-%U')

# Calculating average satisfaction score over weeks
weekly_satisfaction = merged_df.groupby('week_year')['satisfaction'].mean().reset_index()

# Adding more features for analysis: Average time to delivery and order value per week
weekly_features = merged_df.groupby('week_year').agg({
    'time_to_delivery': 'mean',
    'order_value': 'mean',
    'satisfaction': 'mean'
}).reset_index()

# Additional feature: Number of Late Deliveries per week
weekly_late_deliveries = merged_df.groupby('week_year')['late_delivery'].sum()

# Normalize the additional features to compare them on the same scale as satisfaction scores
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

weekly_features_scaled = weekly_features.copy()
weekly_features_scaled[['time_to_delivery', 'order_value', 'satisfaction']] = scaler.fit_transform(
    weekly_features[['time_to_delivery', 'order_value', 'satisfaction']]
)

# Adding normalized number of late deliveries
weekly_features_scaled['late_deliveries_normalized'] = scaler.fit_transform(weekly_late_deliveries.values.reshape(-1, 1))

# Time to Delivery and Satisfaction Score
plt.figure(figsize=(14, 5))
plt.plot(weekly_features_scaled['week_year'], weekly_features_scaled['satisfaction'], label='Satisfaction Score', color='red', marker='o')
plt.plot(weekly_features_scaled['week_year'], weekly_features_scaled['time_to_delivery'], label='Time to Delivery', color='blue', linestyle='--')
plt.title('Time to Delivery vs. Satisfaction Score Over Weeks')
plt.xlabel('Week of the Year')
plt.ylabel('Normalized Scores and Values')
plt.legend()
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

Time to Delivery vs. Satisfaction Score: This graph emphasizes the relationship between delivery times and customer satisfaction, allowing us to identify patterns where longer or shorter delivery times might impact satisfaction levels

In [None]:
from scipy.stats import linregress

# Calculating the linear regression between Time to Delivery and Satisfaction Score
slope, intercept, r_value, p_value, std_err = linregress(weekly_features_scaled['time_to_delivery'], weekly_features_scaled['satisfaction'])

# Calculate the best fit line
line = slope * weekly_features_scaled['time_to_delivery'] + intercept

# Scatter Plot with Best Fit Line
plt.figure(figsize=(14, 8))
plt.scatter(weekly_features_scaled['time_to_delivery'], weekly_features_scaled['satisfaction'], color='blue', alpha=0.6)
plt.plot(weekly_features_scaled['time_to_delivery'], line, color='red', label=f'Best Fit Line\nR={r_value:.2f}, R²={r_value**2:.2f}')

plt.title('Time to Delivery vs. Satisfaction Score with Best Fit Line')
plt.xlabel('Normalized Time to Delivery')
plt.ylabel('Normalized Satisfaction Score')
plt.legend()
plt.grid(True)
plt.show()


Correlation Score (R): The correlation score of -0.79 indicates a strong negative relationship between Time to Delivery and Satisfaction Score. This suggests that as the time to delivery increases, the satisfaction score tends to decrease, and vice versa.¶

R² Value: The R² value of 0.62 means that approximately 62% of the variability in the satisfaction scores can be explained by the variability in the time to delivery. This is a substantial proportion, highlighting the significant impact of delivery time on customer satisfaction on a weekly level.
These scores underscore the importance of efficient delivery processes as a key driver of customer satisfaction. Efforts to reduce delivery times could therefore be expected to have a positive effect on overall satisfaction scores.



#### Machine Learning

In [None]:
# Identifying top 10 features with highest correlation with 'satisfaction'
# Select only the numeric columns for correlation calculation
numeric_cols = merged_df.select_dtypes(include=[np.number])

# Compute the correlation matrix for numeric columns only
corr_matrix = numeric_cols.corr()

# Print the top 10 features correaltion score
print(corr_matrix['satisfaction'].sort_values(ascending=False)[1:-1])

In [None]:
# Set the correlation threshold
threshold = 0.05

# Get the features with correlation greater than 7% or less than -7% with 'satisfaction'
high_corr_features = corr_matrix.index[(corr_matrix['satisfaction'].abs() > threshold) & (corr_matrix.index != 'satisfaction')].tolist()

# Print the highly correlated features
print(high_corr_features)

# check data types for top 10 features
merged_df[high_corr_features].dtypes



In [None]:
# need to take only 5 features
top_4_features = ['payment_value', 'time_to_delivery', 'estimated_vs_actual_shipping', 'late_delivery']
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Selecting only the top 6 features
top_6_features = ['estimated_vs_actual_shipping', 'order_month', 'order_hour', 'price', 'payment_sequential', 'order_value', 'payment_installments']
X = merged_df[top_4_features]
y = merged_df['satisfaction']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Pipeline for numerical features
numerical_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

# Applying ColumnTransformer to preprocess the data
preprocessor = ColumnTransformer(
    transformers=[('num', numerical_transformer, top_4_features)]
)

# Preprocessing the data
X_train_preprocessed = preprocessor.fit_transform(X_train)
X_test_preprocessed = preprocessor.transform(X_test)

# Initialize models
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42, max_depth=10, min_samples_split=50),
    'Random Forest': RandomForestClassifier(random_state=42, n_estimators=100, max_depth=10, min_samples_split=10, min_samples_leaf=4),
    'XGBoost': xgb.XGBClassifier(random_state=42)
}

# Function to fit models, make predictions, and evaluate them
def evaluate_model(model, X_train, y_train, X_test, y_test, model_name):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    conf_matrix = confusion_matrix(y_test, y_pred)
    class_report = classification_report(y_test, y_pred)
    
    # Plotting confusion matrix
    plt.figure(figsize=(6, 5))
    sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', cbar=False)
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title(f'{model_name} Confusion Matrix')
    plt.show()
    
    print(f"{model_name} Classification Report:")
    print(class_report)

# Evaluate each model
for model_name, model in models.items():
    print(f"Evaluating {model_name}")
    evaluate_model(model, X_train_preprocessed, y_train, X_test_preprocessed, y_test, model_name)



In [None]:
from sklearn.model_selection import GridSearchCV

# Initialize the XGBoost Classifier
xgb_model = xgb.XGBClassifier(random_state=42)

# Define the parameter grid to search
param_grid = {
    'n_estimators': [50, 100, 150],  # Fewer trees to keep the model simpler
    'max_depth': [3, 4, 5],          # Shallow trees to prevent overfitting
    'learning_rate': [0.1, 0.01, 0.05],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(
    estimator=xgb_model,
    param_grid=param_grid,
    scoring='accuracy',  # or another scoring metric
    cv=3,
    verbose=1,
    n_jobs=-1
)

# Fit GridSearchCV
grid_search.fit(X_train_preprocessed, y_train)

# Retrieve the best model
best_model = grid_search.best_estimator_

# Predictions
train_preds = best_model.predict(X_train_preprocessed)
test_preds = best_model.predict(X_test_preprocessed)


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Initializing models
log_reg = LogisticRegression(random_state=42)
decision_tree = DecisionTreeClassifier(random_state=42)
random_forest = RandomForestClassifier(random_state=42)

# A function to fit models, make predictions, and evaluate them
def evaluate_model(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    conf_matrix = confusion_matrix(y_test, y_pred)
    class_report = classification_report(y_test, y_pred)
    
    # Plotting confusion matrix
    plt.figure(figsize=(6, 5))
    sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', cbar=False)
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title(f'{model.__class__.__name__} Confusion Matrix')
    plt.show()
    
    print(f"{model.__class__.__name__} Classification Report:")
    print(class_report)
    return model

# Evaluating Logistic Regression
evaluate_model(log_reg, X_train_preprocessed, y_train, X_test_preprocessed, y_test)

# Evaluating Decision Tree
evaluate_model(decision_tree, X_train_preprocessed, y_train, X_test_preprocessed, y_test)

# Evaluating Random Forest
evaluate_model(random_forest, X_train_preprocessed, y_train, X_test_preprocessed, y_test)


In [None]:
# Function to plot confusion matrix
def plot_confusion_matrix(true_values, predictions, set_name):
    matrix = confusion_matrix(true_values, predictions)
    plt.figure(figsize=(6,5))
    sns.heatmap(matrix, annot=True, fmt='d', cmap='Blues', cbar=False)
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title(f'{set_name} Confusion Matrix')
    plt.show()

# Function to print classification report
def print_classification_report(true_values, predictions, set_name):
    report = classification_report(true_values, predictions)
    print(f"{set_name} Classification Report:")
    print(report)

# Visualize and print reports for both sets
plot_confusion_matrix(y_train, train_preds, "Training")
print_classification_report(y_train, train_preds, "Training")

plot_confusion_matrix(y_test, test_preds, "Testing")
print_classification_report(y_test, test_preds, "Testing")

# Print best parameters
print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)


In [None]:
# Create the final pipeline using the best model
final_pipeline = Pipeline(steps=[('preprocessing', preprocessor),
    ('classifier', xgb_model)
])

final_pipeline

In [None]:
# Fit the pipeline to your data
final_pipeline.fit(X_train, y_train)

In [None]:
X_train.columns

In [None]:
X_train.describe()

In [None]:
# 3) Save pipeline as pkl file
import joblib

joblib.dump(final_pipeline, 'final_ECommerce_model.pkl')


In [None]:
model = joblib.load('final_ECommerce_model.pkl')

import random

class SatisfactionFinder:
    def __init__(self, model, preprocessor, features, trials=50):
        self.model = model
        self.preprocessor = preprocessor
        self.features = features
        self.trials = trials

    def random_input(self):
        """Generate a random input within plausible ranges for each feature."""
        ranges = {
            'estimated_vs_actual_shipping': (-189, 146),
            'time_to_delivery': (-7, 208),  # Updated with correct range
            'payment_value': (0.0, 13664.08),  # Updated with correct range
            'order_item_id': (1.0, 21.0),  # Updated with correct range
            'late_delivery': (0, 1)  # Binary feature
        }

        # Generate a random value within each range
        return {feature: random.uniform(*ranges[feature]) if isinstance(ranges[feature][0], float)
                else random.randint(*ranges[feature]) for feature in self.features}

    def find_not_satisfied(self):
        """Loop to find a set of values that predict 'Not Satisfied'."""
        for _ in range(self.trials):
            # Generate random input
            user_data = self.random_input()

            # Convert to DataFrame
            input_df = pd.DataFrame([user_data])

            # Preprocess and predict
            input_preprocessed = self.preprocessor.transform(input_df)
            prediction = self.model.predict(input_preprocessed)

            # Check if prediction is 'Not Satisfied'
            if prediction[0] == 0:
                return user_data, "Not Satisfied"

        return None, "Not found"

# Assuming xgb_model, preprocessor, and top_6_features are previously defined
finder = SatisfactionFinder(xgb_model, preprocessor, ['estimated_vs_actual_shipping', 'time_to_delivery', 'payment_value', 'late_delivery'], trials=200)

# Find a 'Not Satisfied' prediction
user_data, result = finder.find_not_satisfied()

print("User Data:", user_data)
print("Result:", result)

In [None]:
model.predict(pd.DataFrame([{
    'estimated_vs_actual_shipping': 130,
    'time_to_delivery': 133, 
    'payment_value': 9591,
    'late_delivery': 1 
}], dtype=float))

In [None]:
model.predict(pd.DataFrame([{
    'estimated_vs_actual_shipping': 5,
    'time_to_delivery': 7,
    'payment_value': 300,
    'late_delivery': 0 
}], dtype=float))

#### Making app

In [None]:
%%writefile ECB.py

import streamlit as st
import joblib
import numpy as np

# Load your trained pipeline
model = joblib.load('final_ECommerce_model.pkl')

# Define the structure of your app
def main():
    st.title('Customer Satisfaction Prediction App')

   # Define inputs with appropriate ranges and default values based on your data
    estimated_vs_actual_shipping = st.number_input('Estimated vs Actual Shipping Days', min_value=-189, max_value=146, value=11)
    time_to_delivery = st.number_input('Time to Delivery', min_value=-7, max_value=208, value=9)
    payment_value = st.number_input('Payment Value', min_value=0.0, max_value=13664.08, value=107.78)
    late_delivery = st.number_input('Late Delivery', min_value=0, max_value=1, value=0) 

# Prediction button
    if st.button('Predict Satisfaction'):
        # Create an array with the input data
        # Make sure all inputs are included in the array in the correct order
        input_data = np.array([[estimated_vs_actual_shipping, time_to_delivery, payment_value, late_delivery]])

        # Get the prediction
        prediction = model.predict(input_data)

        # Output the prediction
        if prediction[0] == 1:
            st.success('The customer is satisfied.')
        else:
            st.error('The customer is not satisfied')

if __name__ == '__main__':
    main()

In [None]:
import ipywidgets as widgets
from IPython.display import display, HTML
import joblib
import pandas as pd

# Custom CSS to increase the font size and prevent collapsing
style = """
<style>
.widget-label { min-width: 25ex !important; }
.widget-label p { font-size: 16px !important; }
.slider-width { width: 100% !important; } /* Adjust the width as needed */
</style>
"""

# Display the custom CSS
display(HTML(style))

# Load your trained model
model = joblib.load('final_ECommerce_model.pkl')

# Define layout for the sliders
slider_layout = widgets.Layout(width='500px')  # Adjust the width as needed

# Create input widgets for user input with updated ranges and types
estimated_vs_actual_shipping = widgets.IntSlider(
    value=11, min=-189, max=146, step=1,
    description='Estimated vs Actual Shipping Days:',
    style={'description_width': 'initial'},  # Prevent collapsing
    layout=slider_layout
)

time_to_delivery = widgets.IntSlider(
    value=9, min=-7, max=208, step=1,
    description='Time to Delivery:',
    style={'description_width': 'initial'},  # Prevent collapsing
    layout=slider_layout
)

payment_value = widgets.FloatSlider(
    value=107.78, min=0.0, max=13664.08, step=0.01,
    description='Payment Value:',
    style={'description_width': 'initial'},  # Prevent collapsing
    layout=slider_layout
)

late_delivery = widgets.IntSlider(
    value=0, min=0, max=1, step=1,
    description='Late Delivery:',
    style={'description_width': 'initial'},  # Prevent collapsing
    layout=slider_layout
)

# Create a button widget for making predictions
predict_button = widgets.Button(description='Predict Satisfaction')

# Define a function to make predictions and display the result
def predict_satisfaction(b):
    # Collect values from widgets and create a DataFrame for prediction
    user_input = pd.DataFrame({
        'estimated_vs_actual_shipping': [estimated_vs_actual_shipping.value],
        'time_to_delivery': [time_to_delivery.value],
        'payment_value': [payment_value.value],
        'late_delivery': [late_delivery.value]
    })

    # Predict using the model
    prediction = model.predict(user_input)
    
    # Update the result label based on the prediction
    if prediction[0] == 1:
        result_label.value = 'The customer is satisfied.'
    else:
        result_label.value = 'The customer is not satisfied.'

# Attach the predict_satisfaction function to the button's click event
predict_button.on_click(predict_satisfaction)

# Create a label widget to display the prediction result
result_label = widgets.Label()

# Display the input widgets and the result label
input_widgets = [
    estimated_vs_actual_shipping,
    time_to_delivery,
    payment_value,
    late_delivery,
    predict_button,
    result_label  # This should also be included in the list to be displayed
]

for widget in input_widgets:
    display(widget)

### Types of customer behavior 
1. **Complex:** occurs when customers invest significant time and effort in evaluating products before making a purchase. High-involvement products, such as cars or expensive electronics, often trigger this type of behavior.
2. **Dissonance-reducing**: takes place when customers experience post-purchase anxiety or uncertainty about their decision. This can arise when consumers feel that they had to make a decision quickly, without sufficient time to weigh the pros and cons, or if their choice was informed by limited information.
3. **Habitual buying:** characterized by consumers relying on routines and habits when making purchasing decisions. This type of behavior is commonly found in less involving product categories, such as groceries or personal care items, where consumers are not as inclined to research products extensively before purchase. 
4. **Variety seeking:** arises when customers actively seek new experiences, products, or brands, even if satisfied with their current choices. This behavior typically occurs in categories where products are low-involvement, low-cost commodities, and consumers feel minimal risk in trying new options.

# Market Insights

# Customer Segmentation