In [1]:
#importing the necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import mean_squared_error as MSE
from math import sqrt
import warnings
warnings.simplefilter('ignore')
with warnings.catch_warnings():
    warnings.filterwarnings("ignore")
plt.style.use('ggplot')
%matplotlib inline
import seaborn as sns

# Business Understanding


## Overview
As EURO 2024 approaches, Adidas aims to strategically engage football enthusiasts in Great Britain, a key market for sports merchandise. With the excitement building, adidas seeks to harness data-driven insights to optimize its marketing strategies and product offerings specifically for football fans in this region. By focusing on customer segmentation and a recommendation system, the objective is to understand consumer behavior, tailor marketing efforts, and enhance customer experiences, ultimately driving sales and brand loyalty during this high-profile event.

## Problem Statement 
As EURO 2024 approaches, Adidas faces steep competition from their competitors who also seek to engage football enthusiasts during this season. Adidas needs to find an informed way to optimize resource allocation and marketing strategies to stay ahead in this competitive market. The company needs to understand the diverse characteristics and behaviors of its customer base, particularly football fans in Great Britain, to optimize marketing efforts and product offerings

## Challenges
- Data Quality and Availability:
Inconsistent or missing data could affect the accuracy of the customer segmentation and recommendation models.
Limited data for new users (cold start problem) may make it difficult to generate personalized recommendations.
- Customer Segmentation:
Determining the optimal number of customer segments can be challenging, as too few or too many clusters may lead to poor segmentation.
Ensuring that the segments are actionable and meaningful in a business context.
- Real-Time Recommendations:
Implementing a recommendation system that can respond quickly and accurately in a real-time environment.
Ensuring that the system scales well under varying loads, especially during high-traffic periods like during Euro 2024.
- Deployment and Integration:
Linking the recommendation system with the dummy website in a way that accurately simulates real-world usage.
Managing dependencies and ensuring that all components work seamlessly together, particularly in a containerized environment.
- User Experience and Adoption:
Creating a user-friendly interface that showcases the recommendation system effectively.
Ensuring that recommendations are perceived as relevant and helpful by end-users.

## Proposed Solution
It involves integrating and cleaning Great Britain-specific sales, customer, and engagement data to create a unified dataset.
We will perform demographic and engagement analysis to identify key customer segments through clustering techniques.
A recommendation system combining collaborative and content-based filtering will then be developed to deliver personalized product suggestions to these segments.
Continuous evaluation using metrics such as Silhouette Score, Precision, and Recall will ensure the effectiveness of the segmentation and recommendation system, ultimately driving increased sales, customer engagement, and satisfaction.

## Success Metrics 
Model Accuracy: Maintain an overall model accuracy rate of 80% or higher in predicting user preferences.
Functional Storefront: Ensure the website accurately simulates a real e-commerce platform, showcasing product recommendations with names and descriptions.
Model Integration: Successfully integrate the recommendation model into the website, allowing it to dynamically generate and display personalized product suggestions for each user.

## Conclusion
By focusing on Great Britain and leveraging customer segmentation and recommendation systems, Adidas can better understand and engage football fans in this key market. The proposed solution will enable more personalized marketing strategies and product recommendations, driving higher sales and stronger customer loyalty in the lead-up to EURO 2024



# Data Understanding

## Data sources 
There are three datasets that will be used:
- (ConsTable_EU.csv), that contains consumer information.

- (SalesTable_EU.csv), that contains Sales information.

- (EngagementTable_GB.csv) that contains data on customer engagement for Great Britain.

In [2]:
# Load all datasets
cons_eu = pd.read_csv('data/ConsTable_EU.csv')
sales_eu = pd.read_csv('data/SalesTable_EU.csv')
engagement_gb = pd.read_csv('data/EngagementTable_GB.csv')


In [3]:
# Consumer Information
print('Consumer Information'.center(50, '-'))
print(f'Shape: {cons_eu.shape}')
print(f'Info:\n{cons_eu.info()}')
print(f'Description:\n{cons_eu.describe()}')

print('\n' + '-'*50 + '\n')

# Sales
print('Sales'.center(50, '-'))
print(f'Shape: {sales_eu.shape}')
print(f'Info:\n{sales_eu.info()}')
print(f'Description:\n{sales_eu.describe()}')

print('\n' + '-'*50 + '\n')

# Engagement Data
print('Engagement Data'.center(50, '-'))
print(f'Shape: {engagement_gb.shape}')
print(f'Info:\n{engagement_gb.info()}')
print(f'Description:\n{engagement_gb.describe()}')

---------------Consumer Information---------------
Shape: (355461, 8)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 355461 entries, 0 to 355460
Data columns (total 8 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   acid                       355461 non-null  object 
 1   loyalty_memberid           266450 non-null  object 
 2   birth_year                 133642 non-null  float64
 3   consumer_gender            355461 non-null  object 
 4   market_name                355461 non-null  object 
 5   first_signup_country_code  355461 non-null  object 
 6   member_latest_tier         266335 non-null  object 
 7   member_latest_points       266335 non-null  float64
dtypes: float64(2), object(6)
memory usage: 21.7+ MB
Info:
None
Description:
          birth_year  member_latest_points
count  133642.000000         266335.000000
mean     1987.942346            150.370961
std        13.421753           1241.023133
m

 1. **Consumer Information**
- **Shape**: (355,461 rows, 8 columns)
- **Key Columns**:
  - `acid`: Unique identifier for each consumer (non-null).
  - `loyalty_memberid`: Membership ID (missing for about 25% of consumers).
  - `birth_year`: Year of birth (available for about 38% of consumers).
  - `consumer_gender`, `market_name`, `first_signup_country_code`: Demographic and location data.
  - `member_latest_tier`, `member_latest_points`: Loyalty program data, available for around 75% of consumers.
- **Notable Statistics**:
  - `birth_year`: Average year of birth is ~1988, with a range from 1882 to 2009.
  - `member_latest_points`: Points range widely, with some negative values and a max of 377,850.4 points.

 2. **Sales Data**
- **Shape**: (178,334 rows, 20 columns)
- **Key Columns**:
  - `acid`, `order_no`, `order_date`: Order identifiers and dates (non-null).
  - `market_name`, `country`: Geographic data.
  - `quantity_ordered`, `quantity_returned`, `quantity_cancelled`, `quantity_delivered`: Metrics on order fulfilment.
  - `exchange_rate_to_EUR`, `order_item_unit_price_net`: Financial data related to orders.
- **Notable Statistics**:
  - `quantity_ordered`: Average slightly above 1 item per order.
  - `quantity_returned`: 21% of items are returned on average.
  - `order_item_unit_price_net`: Prices range from -€45.76 to €14,628.10, indicating some anomalies.

 3. **Engagement Data**
- **Shape**: (33,148 rows, 29 columns)
- **Key Columns**:
  - `acid`: Consumer ID.
  - `year`, `quarter_of_year`, `month_of_year`, `week_of_year`: Temporal data for tracking engagement.
  - Various `freq_*` columns: Metrics capturing the frequency of consumer interactions (e.g., signups, app usage, purchases).
- **Notable Statistics**:
  - `freq_signup`, `freq_sportsapp`, `freq_email`, etc.: Majority of engagement metrics have low averages, indicating most consumers interact sporadically.
  - `freq_dotcom`, `freq_flagshipapp`: Show more consistent engagement, with some consumers interacting very frequently (e.g., up to 399 times on the flagship app).


# DATA PREPARATION

In [4]:
# Accessing GB Data from Cons_eu and Sale_eu

cons_gb = cons_eu[cons_eu['first_signup_country_code'] == 'GB']
sales_gb = sales_eu[sales_eu['country'] == 'GB']

In [5]:
cons_gb

Unnamed: 0,acid,loyalty_memberid,birth_year,consumer_gender,market_name,first_signup_country_code,member_latest_tier,member_latest_points
1,H24SNEP4HNBA6KVC,,,Unknown,Western Europe,GB,,
2,OUU8CGXKA9WIL7LW,,,Unknown,Western Europe,GB,,
4,LTZVC7YMPXJGLKOW,92D6786B8DB74C2984E382B5E99F5C393588563F269F62...,,Male,Western Europe,GB,Level 1,0.0
8,LVCIKJXEH56NKV7A,29B11577BABDB20529C1D400FDC621824C83D6256F96D4...,,Unknown,Western Europe,GB,Level 1,0.0
10,Z1AOUCZ2VG2FQ3UQ,5E00B3036468BEDCAA8BDAEEAAC9CFC2BAEAF4C09F4267...,,Unknown,Western Europe,GB,Level 1,0.0
...,...,...,...,...,...,...,...,...
355440,UT1RW39PZACT2BD3,,,Unknown,Western Europe,GB,,
355442,PDQUJIFM3L6HPOPM,,,Unknown,Western Europe,GB,,
355444,1TQICCC4WUN38C9B,DBAD013F2B8C5E2593CD88E51A492B1A3E5D24918EC297...,,Male,Western Europe,GB,Level 1,0.0
355446,C8ZGDRZN28M5N7SA,40663F074DC4CC74E8AAF58468A07BB1894B7A4A71EF41...,,Male,Western Europe,GB,Level 1,0.0


In [6]:
sales_gb

Unnamed: 0,acid,order_no,order_date,market_name,country,article_no,key_category_descr,sports_category_descr,product_division,product_type,product_age_group,product_gender,quantity_ordered,quantity_returned,quantity_cancelled,quantity_delivered,no_of_items_after_returns,currency,exchange_rate_to_EUR,order_item_unit_price_net
1,LE9L3EOB8US4OCNM,AUK51130901,2022-06-09 00:00:00+00:00,Western Europe,GB,HB6519,TRAINING APP,TRAINING,APPAREL,SHORTS,ADULT,MEN,7,0,0,7,7,GBP,1.176471,12.537143
13,XNHKABE0B9HBQ9UB,AUK58840020,2022-11-26 00:00:00+00:00,Western Europe,GB,GZ0619,RUNNING FTW,RUNNING,FOOTWEAR,SHOES (LOW),ADULT,WOMEN,1,0,0,1,1,GBP,1.176471,44.380000
14,BQPEN1MTYSPZ8UJI,AUK55004486,2022-09-02 00:00:00+00:00,Western Europe,GB,HL9057,ORIGINALS APP,ORIGINALS,APPAREL,SWEATSHIRT,ADULT,WOMEN,1,0,0,1,1,GBP,1.176471,44.330000
15,LB5Q806DN5B2C34V,AUK53592920,2022-07-31 00:00:00+00:00,Western Europe,GB,GW2962,ORIGINALS FTW,ORIGINALS,FOOTWEAR,SHOES (LOW),ADULT,MEN,1,0,0,1,1,GBP,1.176471,54.160000
16,YFRSHW5AXP0F1VC3,AUK58715695,2022-11-25 00:00:00+00:00,Western Europe,GB,GN1990,TRAINING ACC HW,NOT SPORTS SPECIFIC,ACCESSORIES/HARDWARE,CAP,ADULT,MEN,1,0,0,1,1,GBP,1.176471,4.790000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
178322,5DTX7N77YSTINSR3,AUK58640550,2022-11-24 00:00:00+00:00,Western Europe,GB,GX9527,ORIGINALS FTW,ORIGINALS,FOOTWEAR,SHOES (LOW),ADULT,MEN,1,0,0,1,1,GBP,1.176471,61.100000
178326,Z6J9Z1BLG8C8EMY0,AUK58954873,2022-11-27 00:00:00+00:00,Western Europe,GB,HF1488,FOOTBALL APP LICENSED,FOOTBALL/SOCCER,APPAREL,JERSEY,JUNIOR,KIDS,1,0,1,0,0,GBP,1.176471,50.000000
178329,57U8ESPG67OD3YZW,AUK50152075,2022-05-10 00:00:00+00:00,Western Europe,GB,GW8731,SPORTSWEAR FTW KIDS,RUNNING,FOOTWEAR,SHOES (LOW),KIDS,KIDS,1,1,0,1,0,GBP,1.176471,29.750000
178330,8E5RYMR4XQ6IK8O4,AUK47690042,2022-03-06 00:00:00+00:00,Western Europe,GB,H09117,ORIGINALS APP,ORIGINALS,APPAREL,TRACK PANT,ADULT,MEN,1,1,0,1,0,GBP,1.176471,40.830000


In [7]:
# null values for each dataset
print(cons_gb.isnull().sum())

acid                             0
loyalty_memberid             18169
birth_year                   58280
consumer_gender                  0
market_name                      0
first_signup_country_code        0
member_latest_tier           18169
member_latest_points         18169
dtype: int64


In [8]:
#Fill missing loyalty data with 'Non-member'
cons_gb['loyalty_memberid'].fillna('Non-member', inplace=True)
cons_gb['member_latest_tier'].fillna('Non-member', inplace=True)
cons_gb['member_latest_points'].fillna(0, inplace=True)

#Filling missing birth_year with median age
median_birth_year = cons_gb['birth_year'].median()
cons_gb['birth_year'].fillna(median_birth_year, inplace=True)

#Convert birth_year to age
current_year = 2022
cons_gb['age'] = current_year - cons_gb['birth_year']

#Drop birth year column
cons_gb.drop('birth_year', axis=1, inplace=True)

In [9]:
# Checking null values
print(cons_gb.isnull().sum())

acid                         0
loyalty_memberid             0
consumer_gender              0
market_name                  0
first_signup_country_code    0
member_latest_tier           0
member_latest_points         0
age                          0
dtype: int64


In [10]:
# Checking for duplicates
print(cons_gb.duplicated().sum())

0


In [11]:
#Check for null values in the sales data
print(sales_gb.isnull().sum())

acid                          0
order_no                      0
order_date                    0
market_name                   0
country                       0
article_no                    0
key_category_descr            0
sports_category_descr         0
product_division              0
product_type                  0
product_age_group             0
product_gender                0
quantity_ordered              0
quantity_returned             0
quantity_cancelled            0
quantity_delivered            0
no_of_items_after_returns     0
currency                      0
exchange_rate_to_EUR          0
order_item_unit_price_net    23
dtype: int64


In [12]:
#Filling the misssing values with the median
sales_gb['order_item_unit_price_net'].fillna(sales_gb['order_item_unit_price_net'].median(),inplace=True)

In [13]:
#Checking null value
print(sales_gb.isnull().sum())

acid                         0
order_no                     0
order_date                   0
market_name                  0
country                      0
article_no                   0
key_category_descr           0
sports_category_descr        0
product_division             0
product_type                 0
product_age_group            0
product_gender               0
quantity_ordered             0
quantity_returned            0
quantity_cancelled           0
quantity_delivered           0
no_of_items_after_returns    0
currency                     0
exchange_rate_to_EUR         0
order_item_unit_price_net    0
dtype: int64


In [14]:
#Checking duplicates
print(sales_gb.duplicated().sum())

0


In [15]:
#Checking for missing values in Engagement data
print(engagement_gb.isnull().sum())

acid                       0
country                    0
year                       0
quarter_of_year            0
month_of_year              0
week_of_year               0
freq_signup                0
freq_sportsapp             0
freq_survey                0
freq_raffle                0
freq_reviews               0
freq_email                 0
freq_adiclub_email         0
freq_pn                    0
freq_adiclub_pn            0
freq_transactions          0
freq_earn_points           0
freq_points_redemption     0
freq_rewards_redemption    0
freq_confirmed             0
freq_dotcom                0
freq_flagshipapp           0
freq_hype                  0
freq_pdp                   0
freq_plp                   0
freq_add_to_cart           0
freq_preference            0
freq_wishlist              0
refresh_date               0
dtype: int64


In [16]:
#Checking duplicates
print(engagement_gb.duplicated().sum())

0
