# <a id="top"></a>  📊 DM2425_ABCDEats
Authors:<br><br>
Student Name - Gonçalo Custódio<br>
- Student id - 20211643<br>
- Contact e-mail - 20211643@novaims.unl.pt<br>
  
Student Name - Diogo Correia<br>
- Student id - 20211586<br>
- Contact e-mail - 20211586@novaims.unl.pt<br>
  
Student Name - João Santos<br>
- Student id - 20211691<br>
- Contact e-mail - 20211691@novaims.unl.pt<br>
  
Student Name - Nuno Bernardino<br>
- Student id - 20211546<br>
- Contact e-mail - 20211546@novaims.unl.pt<br>

## Index

1. [Exploration of the Dataset](#1-exploration-of-the-dataset)   
   1.2 [Identify Trends, Patterns, or Anomalies](#3-identify-trends-patterns-or-anomalies)  
   3.1 [Descriptive Statistics](#3.1-descriptive-statistics)  
   3.2 [Visualization](#3.2-visualization)  
       3.2.1 [Histograms](#3.2.1-histograms)  
       3.2.2 [Boxplots](#3.2.2-boxplots)  
       3.2.3 [Categorical Variables](#3.2.3-categorical-variables)  
       3.2.4 [Correlation Analysis](#3.2.4-correlation-analysis)  
   3.3 [Trend Analysis](#3.3-trend-analysis) 


2. [Preprocessing](#2-preprocessing)  
   2.1 [Numeric and Categorical Variables](#2.1-numeric-and-categorical-variables)  
   2.2 [Duplicated Values Treatment](#2.2-duplicated-values-treatment)  
   2.3 [Missing Values Treatment](#2.3-missing-values-treatment)  
   2.4 [Impute the Missing Values (KNN)](#2.4-impute-the-missing-values-knn)   

4. [Outliers](#4-outliers)  
   4.1 [Outliers Removal](#4.1-outliers-removal)  
   4.2 [Missing Values Treatment after Outlier Removal](#4.2-missing-values-treatment-after-outlier-removal)  
   4.3 [Outliers Check After Treatment](#4.3-outliers-check-after-treatment)  

5. [Create New Features](#5-create-new-features)  
   5.1 [Visualizations of New Features](#5.1-visualizations-of-new-features)  
   5.2 [Impute Missing Values of New Features (KNN)](#5.2-impute-missing-values-of-new-features-knn)  

6. [Feature Selection](#6-feature-selection)  
   6.1 [Correlation with New Features](#6.1-correlation-with-new-features)  
   6.2 [Cramer V (Categorical Features)](#6.2-cramer-v-categorical-features)  

7. [Variance Evaluation](#7-variance-evaluation)  

8. [Scaling Numerical Data (MinMaxScaler)](#8-scaling-numerical-data-minmaxscaler)  

9. [Export](#9-export)

# Imports

In [147]:
#!pip install kmodes
#!pip install somoclu
#!pip install minisom

In [148]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import KNNImputer
from scipy.stats import chi2_contingency
from scipy.stats import zscore

**Read the Dataset**

In [149]:
data = pd.read_excel("DM2425_ABCDEats_DATASET.xlsx", sheet_name="DM2425_ABCDEats_DATASET")
data_copy = data.copy()

**Set the index to the customer_id column**

In [150]:
data.set_index('customer_id', inplace=True)

# 1. Exploration of the Dataset
[⬆️ Back to Top](#top)

**Initial Analysis**

To kick off our deep exploration, we’ll use the `data.info()` command to get an overview of the dataset. This command provides essential information, including the number of entries, column names, non-null counts, and data types for each variable. This quick summary will allow us to identify any missing values, spot potential data type issues, and gain a high-level understanding of the dataset's structure, setting the stage for further analysis.



In [None]:
data.info()

**Check data types of our variables**

In this step, we’ll use the `data.dtypes` command to examine the data types of each variable in our dataset. This overview will confirm if the variables are appropriately typed (e.g., integers, floats, objects) and will help us spot any inconsistencies or unexpected types that might require adjustment. Understanding the data types at this stage is crucial, as it guides us in selecting suitable preprocessing and analysis techniques for each variable.

In [None]:
data.dtypes

**Summary Stats and Missing Values Check**

In [None]:
data.describe(include='all').transpose()

**Check Missing Values**

In [None]:
data.isnull().sum()

**Total Orders per Hours**

In [None]:
hour_columns = [f'HR_{i}' for i in range(24)]
hourly_order_volume = data[hour_columns].sum()

plt.figure(figsize=(10, 6))
sns.barplot(x=hourly_order_volume.index, y=hourly_order_volume.values, color='orange')
plt.title("Volume of Total Orders per Hour", fontsize=16)
plt.xlabel("Hours", fontsize=12)
plt.ylabel("Orders", fontsize=12)
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

# 2. Preprocessing
[⬆️ Back to Top](#top)

**Since we don't agree with the fact that the variables in question (last_promo, payment_method and customer_region) are of type object we will change them to category in order to facilitate future analysis:**

## 2.1 Numeric and Categorical Variables

In [94]:
data['last_promo'] = data['last_promo'].astype('category')
data['payment_method'] = data['payment_method'].astype('category')
data['customer_region'] = data['customer_region'].astype('category')
data['customer_age'] = data['customer_age'].fillna(0).astype('int64')

**We will divide the variables into lists for categorical and numerical variables to facilitate future interactions:**

In [95]:
category_var = ['customer_region', 'last_promo', 'payment_method']

In [96]:
number_var = ['customer_age', 'vendor_count', 'product_count', 'is_chain', 'first_order', 'last_order', 
              'CUI_American', 'CUI_Asian', 'CUI_Beverages', 'CUI_Cafe', 'CUI_Chicken Dishes', 'CUI_Chinese', 
              'CUI_Desserts', 'CUI_Healthy', 'CUI_Indian', 'CUI_Italian', 'CUI_Japanese', 'CUI_Noodle Dishes', 
              'CUI_OTHER', 'CUI_Street Food / Snacks', 'CUI_Thai', 'DOW_0', 'DOW_1', 'DOW_2', 'DOW_3', 'DOW_4', 
              'DOW_5', 'DOW_6', 'HR_0', 'HR_1', 'HR_2', 'HR_3', 'HR_4', 'HR_5', 'HR_6', 'HR_7', 'HR_8', 'HR_9', 
              'HR_10', 'HR_11', 'HR_12', 'HR_13', 'HR_14', 'HR_15', 'HR_16', 'HR_17', 'HR_18', 'HR_19', 'HR_20', 
              'HR_21', 'HR_22', 'HR_23']

In [None]:
print("Categorical Variable Types:")
print(data[category_var].dtypes)
print("\nNumerical Variable Types:")
print(data[number_var].dtypes)

## 2.2 Duplicated Values Treatment

In [None]:
duplicates = data[data.index.duplicated(keep=False)]

if not duplicates.empty:
    print(duplicates)

In [None]:
data = data[~data.index.duplicated(keep='first')]

print(f"New number of lines in the dataset: {len(data)}")

## 2.3 Missing Values Treatment

In [None]:
data.isnull().sum()

In [None]:
knn_imputer = KNNImputer(n_neighbors=5)
data[number_var] = knn_imputer.fit_transform(data[number_var])
print(data.isnull().sum())

# 3. Identify Trends, Patterns, or Anomalies



[⬆️ Back to Top](#top)

## 3.1 Descriptive Statistics

In [None]:
data[number_var].describe()

In [None]:
for var in category_var:
    print(f"Distribution for {var}:")
    print(data[var].value_counts(normalize=True) * 100)
    print("\n")

## 3.2 Visualization


### 3.2.1 Histograms

In [None]:
data[number_var].hist(figsize=(15, 10), bins=20)

### 3.2.2 Boxplots

In [None]:
other_features = ['customer_region', 'customer_age', 'vendor_count', 'product_count','is_chain', 'first_order', 'last_order', 'last_promo', 'payment_method']

cuisine_columns = [col for col in data.columns if col.startswith('CUI_')]
dow_columns = [col for col in data.columns if col.startswith('DOW_')]
hr_columns = [col for col in data.columns if col.startswith('HR_')]

groups = {
    'Customer Region & Related Features': other_features,
    'Customer Age & Vendor Count': ['customer_age', 'vendor_count'],
    'Cuisines': cuisine_columns,
    'Days of the Week': dow_columns,
    'Hours of the Day': hr_columns
}

for group_name, columns in groups.items():
    plt.figure(figsize=(12, 6))
    data[columns].boxplot(vert=False, grid=False)
    plt.title(f'Boxplots for {group_name}')
    plt.xlabel('Value')
    plt.ylabel('Features')
    plt.show()

### 3.2.3 Categorical Variables

In [None]:
for var in category_var:
    data[var].value_counts().plot(kind='bar', title=f"Distribution of {var}")
    plt.show()

### 3.2.4 Correlation Analysis

**1. Correlation between numeric features**

In [None]:
plt.figure(figsize=(15, 10))
corr = data[number_var].corr()
sns.heatmap(corr[(corr > 0.8) | (corr < -0.8)], annot=True, cmap='coolwarm', mask=(corr <= 0.8) & (corr >= -0.8))
plt.title("Heatmap of High Correlations (|correlation| > 0.5)")
plt.show()

**2. Table Correlation**

In [None]:
correlation_matrix = data[number_var].corr()

high_corr_pairs = []

for i in range(len(correlation_matrix.columns)):
    for j in range(i):
        if abs(correlation_matrix.iloc[i, j]) >= 0.8 or abs(correlation_matrix.iloc[i, j]) < -0.8: 
            high_corr_pairs.append((correlation_matrix.columns[i], correlation_matrix.columns[j], correlation_matrix.iloc[i, j]))

high_corr_df = pd.DataFrame(high_corr_pairs, columns=['Variable 1', 'Variable 2', 'Correlation'])

high_corr_df

**3. Correlation between categorical features (Cramers V)**

In [None]:
# Ensure data has all these columns
assert all(col in data.columns for col in category_var), "Some columns are missing in data"

def cramers_v(x, y):
    contingency_matrix = pd.crosstab(x, y)
    chi2 = chi2_contingency(contingency_matrix)[0]
    n = contingency_matrix.sum().sum()
    r, k = contingency_matrix.shape
    return np.sqrt(chi2 / (n * (min(r, k) - 1)))

correlation_matrix = pd.DataFrame(index=category_var, columns=category_var)

for col1 in category_var:
    for col2 in category_var:
        if col1 == col2:
            correlation_matrix.loc[col1, col2] = 1.0
        else:
            correlation_matrix.loc[col1, col2] = cramers_v(
                data[col1].round(3),
                data[col2].round(3)
            )

correlation_matrix = correlation_matrix.astype(float)

plt.figure(figsize=(15, 8))
sns.heatmap(correlation_matrix,annot=True, fmt=".2f",cmap="coolwarm",vmin=0, vmax=1,cbar=True,)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

## 3.3 Trend Analysis

In [None]:
dow_data = data[[f'DOW_{i}' for i in range(7)]].sum()
dow_data.plot(kind='bar', title="Orders by Day of the Week")
plt.show()

In [None]:
data[[col for col in number_var if 'CUI_' in col]].mean().plot(kind='bar')
plt.title("Average Spending by Cuisine Type")
plt.show()

# 4. Outliers
[⬆️ Back to Top](#top)

In [None]:
z_scores = data[number_var].apply(zscore)
outliers = (z_scores.abs() > 3).sum()
print("Outliers per variable:")
print(outliers)

## 4.1 Outliers Removal

In [None]:
print(data.isnull().sum())

In [114]:
def outliers_removal(data, number_var):
    for col in number_var:
        Q1 = data[col].quantile(0.25)
        Q3 = data[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        data[col] = data[col].where((data[col] >= lower_bound) & (data[col] <= upper_bound), np.nan)
        
    return data

In [None]:
outliers_removal(data, number_var)

In [None]:
print(data.isnull().sum())

## 4.2 Missing Values Treatment after Outlier removal

In [None]:
data[number_var] = knn_imputer.fit_transform(data[number_var])
print(data.isnull().sum())

## 4.3 Outliers Check After Treatment

In [118]:
def detect_outliers_iqr(df, groups, missing_threshold=5):
    results = {'columns_with_outliers': [], 'outlier_counts': {}, 'bounds': {}}

    groups = {
        'Cuisines': cuisine_columns,
        'Days of the Week': dow_columns,
        'Hours of the Day': hr_columns,
        'Other Features': ['customer_age', 'vendor_count', 'product_count', 'is_chain', 'first_order', 'last_order']
    }

    for group_name, columns in groups.items():
        num_cols = len(columns)
        cols_per_row = 4  # Number of columns per row in the grid
        rows = -(-num_cols // cols_per_row)  # Ceiling division for number of rows
        fig, axes = plt.subplots(rows, cols_per_row, figsize=(16, rows * 4))
        axes = axes.flatten()

        for i, column in enumerate(columns):
            if column in df.select_dtypes(include=[np.number]).columns:
                Q1 = df[column].quantile(0.25)
                Q3 = df[column].quantile(0.75)
                IQR = Q3 - Q1
                lower_bound = Q1 - 1.5 * IQR
                upper_bound = Q3 + 1.5 * IQR

                outlier_data = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
                outlier_percent = len(outlier_data) / len(df) * 100

                results['bounds'][column] = {'lower_bound': lower_bound, 'upper_bound': upper_bound}
                results['outlier_counts'][column] = len(outlier_data)
                if outlier_percent > missing_threshold:
                    results['columns_with_outliers'].append(column)

                sns.boxplot(data=df, x=column, color='orange', ax=axes[i], showfliers=False)
                sns.stripplot(data=outlier_data, x=column, color='red', jitter=True, ax=axes[i])
                axes[i].set_title(f"{column}")

        for j in range(len(columns), len(axes)):
            fig.delaxes(axes[j])  # Remove unused subplots

        plt.suptitle(f"Boxplots with Outliers for {group_name}", fontsize=16)
        plt.tight_layout(rect=[0, 0, 1, 0.96])
        plt.show()

    print("Columns with more than {}% outliers:".format(missing_threshold))
    print(results['columns_with_outliers'])

    return results

In [None]:
outlier_results = detect_outliers_iqr(data, groups, missing_threshold=5)

In [None]:
print(data[['CUI_American', 'CUI_Asian']].describe())

**We wont remove these outliers, since they are values we consider reasonable even if the mean of the features are close to zero, we know that the mean is influenced by the large amount of zeros**

# 5. Create New Features
[⬆️ Back to Top](#top)

In [121]:
data['order_activity_duration'] = data['last_order'] - data['first_order']
number_var.append("order_activity_duration")

In [122]:
data['order_frequency'] = data['product_count'] / (data['order_activity_duration'].replace(0, 1))
number_var.append("order_frequency")

In [123]:
data['loyal_customer'] = data['vendor_count'] < data['product_count']
category_var.append("loyal_customer")

In [124]:
data['cuisine_diversity'] = (data[[col for col in number_var if col.startswith('CUI_')]] > 0).sum(axis=1)
number_var.append("cuisine_diversity")

In [125]:
data['favorite_cuisine'] = data[[col for col in number_var if col.startswith('CUI_')]].idxmax(axis=1)
category_var.append("favorite_cuisine")

In [126]:
data['frequent_order_flag'] = data['product_count'] > data['product_count'].mean()
category_var.append("frequent_order_flag")

In [127]:
data['weekend_spending'] = data['DOW_0'] + data['DOW_6']
number_var.append("weekend_spending")

In [128]:
data['week_spending'] = data[['DOW_1', 'DOW_2', 'DOW_3', 'DOW_4', 'DOW_5']].sum(axis=1)
number_var.append("week_spending")

In [129]:
data['supper_spending'] = data[['HR_0', 'HR_1', 'HR_2', 'HR_3', 'HR_4', 'HR_5']].sum(axis=1)
data['breakfast_spending'] = data[['HR_6', 'HR_7', 'HR_8', 'HR_9', 'HR_10']].sum(axis=1)
data['lunch_spending'] = data[['HR_11', 'HR_12', 'HR_13', 'HR_14', 'HR_15']].sum(axis=1)
data['snack_spending'] = data[['HR_16', 'HR_17', 'HR_18']].sum(axis=1)
data['dinner_spending'] = data[['HR_19', 'HR_20', 'HR_21', 'HR_22', 'HR_23']].sum(axis=1)
number_var.append("supper_spending")
number_var.append("breakfast_spending")
number_var.append("lunch_spending")
number_var.append("snack_spending")
number_var.append("dinner_spending")

In [130]:
asian_cuisines = ['CUI_Asian', 'CUI_Chinese', 'CUI_Indian', 'CUI_Japanese', 'CUI_Noodle Dishes', 'CUI_Thai']
western_cuisines = ['CUI_American', 'CUI_Italian', 'CUI_Street Food / Snacks', 'CUI_Cafe']
others_cuisines = ['CUI_Healthy', 'CUI_Beverages', 'CUI_OTHER', 'CUI_Chicken Dishes', 'CUI_Desserts']

data['Asian_Cuisines'] = data[asian_cuisines].sum(axis=1)
data['Western_Cuisines'] = data[western_cuisines].sum(axis=1)
data['Others_Cuisines'] = data[others_cuisines].sum(axis=1)

number_var.append("Asian_Cuisines")
number_var.append("Western_Cuisines")
number_var.append("Others_Cuisines")

In [131]:
data['Total_Cuisine_Orders'] = data['Asian_Cuisines'] + data['Western_Cuisines'] + data['Others_Cuisines']
data['Asian_Cuisines_Ratio'] = data['Asian_Cuisines'] / data['Total_Cuisine_Orders']
data['Western_Cuisines_Ratio'] = data['Western_Cuisines'] / data['Total_Cuisine_Orders']
data['Others_Cuisines_Ratio'] = data['Others_Cuisines'] / data['Total_Cuisine_Orders']

number_var.append("Total_Cuisine_Orders")
number_var.append("Asian_Cuisines_Ratio")
number_var.append("Western_Cuisines_Ratio")
number_var.append("Others_Cuisines_Ratio")

In [132]:
columns_to_remove = ['DOW_0', 'DOW_1', 'DOW_2', 'DOW_3', 'DOW_4', 'DOW_5',
       'DOW_6', 'HR_0', 'HR_1', 'HR_2', 'HR_3', 'HR_4', 'HR_5', 'HR_6', 'HR_7',
       'HR_8', 'HR_9', 'HR_10', 'HR_11', 'HR_12', 'HR_13', 'HR_14', 'HR_15',
       'HR_16', 'HR_17', 'HR_18', 'HR_19', 'HR_20', 'HR_21', 'HR_22', 'HR_23','CUI_Asian', 'CUI_Chinese', 'CUI_Indian', 'CUI_Japanese', 'CUI_Noodle Dishes',
         'CUI_Thai','CUI_American', 'CUI_Italian', 'CUI_Street Food / Snacks', 'CUI_Cafe', 'CUI_Healthy', 'Asian_Cuisines', 'Western_Cuisines', 'Others_Cuisines', 'CUI_Beverages', 'CUI_OTHER', 'CUI_Chicken Dishes', 'CUI_Desserts'] 

data = data.drop(columns=columns_to_remove)

for column in columns_to_remove:
    number_var.remove(column)

In [None]:
data.columns

## 5.1 Visualizations of New Features

In [None]:
# Total Spending
plt.figure(figsize=(10, 6))
sns.histplot(data['total_spending'], bins=30, kde=True, color='blue')
plt.title("Distribution of Total Spending")
plt.xlabel("Total Spending")
plt.ylabel("Frequency")
plt.show()

# Average Spending Per Cuisine
plt.figure(figsize=(10, 6))
sns.histplot(data['avg_spending_per_cuisine'], bins=30, kde=True, color='green')
plt.title("Distribution of Average Spending Per Cuisine")
plt.xlabel("Average Spending Per Cuisine")
plt.ylabel("Frequency")
plt.show()

# Order Activity Duration
plt.figure(figsize=(10, 6))
sns.histplot(data['order_activity_duration'], bins=30, kde=True, color='orange')
plt.title("Distribution of Order Activity Duration")
plt.xlabel("Order Activity Duration (Days)")
plt.ylabel("Frequency")
plt.show()

# Order Frequency
plt.figure(figsize=(10, 6))
sns.histplot(data['order_frequency'], bins=30, kde=True, color='purple')
plt.title("Distribution of Order Frequency")
plt.xlabel("Order Frequency")
plt.ylabel("Frequency")
plt.show()

# High Spender Flag
plt.figure(figsize=(10, 6))
sns.countplot(x='high_spender', data=data, palette='pastel')
plt.title("High Spenders Distribution")
plt.xlabel("High Spender")
plt.ylabel("Count")
plt.show()

# Loyal Customer Flag
plt.figure(figsize=(10, 6))
sns.countplot(x='loyal_customer', data=data, palette='coolwarm')
plt.title("Loyal Customers Distribution")
plt.xlabel("Loyal Customer")
plt.ylabel("Count")
plt.show()

# Cuisine Diversity
plt.figure(figsize=(10, 6))
sns.histplot(data['cuisine_diversity'], bins=15, kde=False, color='purple')
plt.title("Distribution of Cuisine Diversity")
plt.xlabel("Number of Unique Cuisines Ordered")
plt.ylabel("Frequency")
plt.show()

# Favorite Cuisine
plt.figure(figsize=(12, 6))
data['favorite_cuisine'].value_counts().plot(kind='bar', color='skyblue')
plt.title("Favorite Cuisine Distribution")
plt.xlabel("Cuisine")
plt.ylabel("Count")
plt.show()

# Peak Order Hour
plt.figure(figsize=(10, 6))
sns.countplot(x='peak_order_hour', data=data, palette='viridis')
plt.title("Peak Order Hour Distribution")
plt.xlabel("Hour of the Day")
plt.ylabel("Number of Customers")
plt.show()

# Peak Order Day
plt.figure(figsize=(10, 6))
sns.countplot(x='peak_order_day', data=data, palette='coolwarm')
plt.title("Peak Order Day Distribution")
plt.xlabel("Day of the Week (0=Sunday, 6=Saturday)")
plt.ylabel("Number of Customers")
plt.show()

# Frequent Order Flag
plt.figure(figsize=(10, 6))
sns.countplot(x='frequent_order_flag', data=data, palette='pastel')
plt.title("Frequent Order Flag Distribution")
plt.xlabel("Frequent Order Flag")
plt.ylabel("Count")
plt.show()

# Inactive Days
plt.figure(figsize=(10, 6))
sns.histplot(data['inactive_days'], bins=30, kde=True, color='pink')
plt.title("Distribution of Inactive Days")
plt.xlabel("Inactive Days")
plt.ylabel("Frequency")
plt.show()

## 5.2 Impute Missing Values of New Features (KNN)

In [None]:
knn_imputer = KNNImputer(n_neighbors=5)
data[number_var] = knn_imputer.fit_transform(data[number_var])
print(data.isnull().sum())

# 6. Feature Selection
[⬆️ Back to Top](#top)

## 6.1 Correlation w/ New Features

In [None]:
correlation_matrix = data[number_var].corr()

high_corr_pairs = []

for i in range(len(correlation_matrix.columns)):
    for j in range(i):
        if abs(correlation_matrix.iloc[i, j]) >= 0.8 or abs(correlation_matrix.iloc[i, j]) < -0.8: 
            high_corr_pairs.append((correlation_matrix.columns[i], correlation_matrix.columns[j], correlation_matrix.iloc[i, j]))

high_corr_df = pd.DataFrame(high_corr_pairs, columns=['Variable 1', 'Variable 2', 'Correlation'])

high_corr_df

## 6.2 Cramer V (Categorical Features)

In [None]:
# Ensure data has all these columns
assert all(col in data.columns for col in category_var), "Some columns are missing in data"

def cramers_v(x, y):
    contingency_matrix = pd.crosstab(x, y)
    chi2 = chi2_contingency(contingency_matrix)[0]
    n = contingency_matrix.sum().sum()
    r, k = contingency_matrix.shape
    return np.sqrt(chi2 / (n * (min(r, k) - 1)))

correlation_matrix = pd.DataFrame(index=category_var, columns=category_var)

for col1 in category_var:
    for col2 in category_var:
        if col1 == col2:
            correlation_matrix.loc[col1, col2] = 1.0
        else:
            correlation_matrix.loc[col1, col2] = cramers_v(
                data[col1].round(3),
                data[col2].round(3)
            )

correlation_matrix = correlation_matrix.astype(float)

plt.figure(figsize=(15, 8))
sns.heatmap(correlation_matrix,annot=True, fmt=".2f",cmap="coolwarm",vmin=0, vmax=1,cbar=True,)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

**Drop vendor_count based on the correlation analysis**

In [138]:
data = data.drop(columns=['vendor_count'])
number_var.remove('vendor_count')

# 7. Variance Evaluation
[⬆️ Back to Top](#top)

In [139]:
selector = VarianceThreshold(threshold=0.01)
data_num = selector.fit_transform(data[number_var])

In [140]:
filtered_columns = data[number_var].columns[selector.get_support()]
data_num = pd.DataFrame(data_num, columns=filtered_columns, index=data.index)

In [None]:
data_num.columns

## Encoding Categorical Data (?)

In [None]:
data[category_var] = data[category_var].apply(lambda col: col.map(col.value_counts(normalize=True)))
data.head()

In [143]:
data = data.drop(columns=category_var)

# 8. Scaling Numerical Data (MinMaxScaler)
[⬆️ Back to Top](#top)

In [144]:
scaler = MinMaxScaler()
data_num = scaler.fit_transform(data_num)
data_num = pd.DataFrame(data_num, columns=filtered_columns, index=data.index)

In [None]:
data_num.head()

# 9. Export
[⬆️ Back to Top](#top)

In [146]:
data_num.to_excel("Numeric_DM2425_ABCDEats_DATASET.xlsx", index=False)