## Understanding problem statement

- Objective: Create a ML model that optimizes the channel distribution by maximizing market reach and revenue

### Phase 1: EDA
- Understand how variables differ across channels and region
- Visualize on week, city and channel levels
- Insights: Action items to optmize distribution channel

## Ingesting the data

In [2]:
# READING ALL THE DATA IN PANDAS DATAFRAME

import pandas as pd
city_df = pd.read_csv('../../data/raw/city.csv', encoding='ISO-8859-1')
competitors_df = pd.read_csv('../../data/raw/competitivelandscape.csv')
consumer_behaviour_df = pd.read_csv('../../data/raw/consumerbehavior.csv')
market_influencers_df = pd.read_csv('../../data/raw/externalmarketinfluencers.csv')
products_df = pd.read_csv('../../data/raw/products.csv')
sales_df = pd.read_csv('../../data/raw/retailsalesdistribution.csv')

## Univariate 

### Sales data

In [2]:
sales_df.head()

Unnamed: 0,Date,City_ID,SKU_ID,Channel,Units_Sold,Sales
0,2023-01-01,CT001,SKU1002,Q Commerce,268,80.4
1,2023-01-01,CT001,SKU1004,Q Commerce,168,50.4
2,2023-01-01,CT001,SKU1001,E Commerce,521,156.3
3,2023-01-01,CT001,SKU1002,E Commerce,247,74.1
4,2023-01-01,CT001,SKU1007,E Commerce,161,12075.0


In [3]:
from ydata_profiling import ProfileReport

sales_data_profile = ProfileReport(sales_df, title='Sales Data Profiling Report', explorative=True)


Matplotlib is building the font cache; this may take a moment.


In [4]:
sales_data_profile.to_file('../../eda/sales_data_profile.html')

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

- 20 cities, 7 SKUs, 5 channels

### Competitive Behavior

In [6]:
from ydata_profiling import ProfileReport

competitors_data_profile = ProfileReport(competitors_df, title='competitors_Data Profiling Report', explorative=True)

competitors_data_profile.to_file('../../eda/competitors_data_profile.html')


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

### Products 

In [7]:
from ydata_profiling import ProfileReport

products_data_profile = ProfileReport(products_df, title='Products Data Profiling Report', explorative=True)

products_data_profile.to_file('../../eda/products_data_profile.html')


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

### External Market Influence

In [8]:
from ydata_profiling import ProfileReport

external_market_influence_data_profile = ProfileReport(market_influencers_df, title='External Market Influence Data Profiling Report', explorative=True)

external_market_influence_data_profile.to_file('../../eda/external_market_influence_data_profile.html')


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

### Customer Behavior 

In [9]:
from ydata_profiling import ProfileReport

customer_behavior_data_profile = ProfileReport(consumer_behaviour_df, title='Customer Behavior Data Profiling Report', explorative=True)

customer_behavior_data_profile.to_file('../../eda/customer_behavior_data_profile.html')


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

### City

In [10]:
from ydata_profiling import ProfileReport

city_data_profile = ProfileReport(city_df, title='City Data Profiling Report', explorative=True)

city_data_profile.to_file('../../eda/city_data_profile.html')


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

# Data Preprocessing

### Sales Data

In [6]:

# STANDARDIZING THE DATE FORMAT IN SALES DATAFRAME
import pandas as pd
sales_df_processed = sales_df.copy()

sales_df_processed['Date'] = pd.to_datetime(sales_df_processed['Date'], format='%Y-%m-%d')

sales_df_processed['Date'] = pd.to_datetime(sales_df_processed['Date']).dt.strftime('%Y-%m-%d')

sales_df_processed.head()



Unnamed: 0,Date,City_ID,SKU_ID,Channel,Units_Sold,Sales
0,2023-01-01,CT001,SKU1002,Q Commerce,268,80.4
1,2023-01-01,CT001,SKU1004,Q Commerce,168,50.4
2,2023-01-01,CT001,SKU1001,E Commerce,521,156.3
3,2023-01-01,CT001,SKU1002,E Commerce,247,74.1
4,2023-01-01,CT001,SKU1007,E Commerce,161,12075.0


In [7]:
# save data
sales_df_processed.to_csv('../../data/processed/sales_processed.csv', index=False)

### Competitor Data

In [8]:

# STANDARDIZING THE DATE FORMAT IN SALES DATAFRAME
import pandas as pd
competitors_df_processed = competitors_df.copy()

competitors_df_processed['Date'] = pd.to_datetime(competitors_df_processed['Date'], format='%Y-%m-%d')

competitors_df_processed['Date'] = pd.to_datetime(competitors_df_processed['Date']).dt.strftime('%Y-%m-%d')

competitors_df_processed.head()



Unnamed: 0,Date,Channel,Brand,Mentions_Count,Sentiment_Score,Share_of_Voice
0,2023-01-02,E Commerce,Amazon Solimo,156,61.3,15.0
1,2023-01-02,E Commerce,Minute Maid,346,70.0,33.2
2,2023-01-02,E Commerce,Real Fruit Juice,542,70.9,51.9
3,2023-01-02,General Trade,Minute Maid,551,72.4,53.8
4,2023-01-02,General Trade,Paper Boat,474,67.0,46.2


In [9]:
# save data
competitors_df_processed.to_csv('../../data/processed/competitors_processed.csv', index=False)

### Products

In [12]:
import pandas as pd
import re
import numpy as np

products_df_processed = products_df.copy()

# Convert "Flavor Variant" column to categorical data type
products_df_processed['Flavor Variant'] = products_df_processed['Flavor Variant'].astype('category')


# 2. Convert Launch Date from MM/DD/YYYY to yyyy-mm-dd
products_df_processed['Launch Date'] = pd.to_datetime(products_df_processed['Launch Date (MM/DD/YYYY)'], format='%m/%d/%Y').dt.strftime('%Y-%m-%d')

def convert_pack_size(pack_str):
    """
    Extract numeric value and convert pack size to litres.
    Assumes the pack size string contains a number and a unit (ml or L).
    """
    match = re.search(r'([\d\.]+)\s*(ml|l)', pack_str, re.IGNORECASE)
    if match:
        value = float(match.group(1))
        unit = match.group(2).lower()
        if unit == 'ml':
            return value / 1000  # convert millilitres to litres
        else:
            return value
    return np.nan

products_df_processed['Pack Size (L)'] = products_df_processed['Pack Size (ml/L)'].apply(convert_pack_size)

channels = {
    'General Trade': r'general trade',
    'E Commerce': r'e[-\s]?commerce',
    'Modern Trade': r'modern trade',
    'HoReCa': r'horeca',
    'Q Commerce': r'q[-\s]?commerce'
}

def encode_distribution(dist_str):
    """
    Returns a Series with one-hot encoding for each distribution channel.
    """
    dist_lower = dist_str.lower()
    result = {}
    for channel, pattern in channels.items():
        result[channel] = 1 if re.search(pattern, dist_lower) else 0
    return pd.Series(result)

dist_df = products_df_processed['Distribution Coverage'].apply(encode_distribution)
products_df_processed = pd.concat([products_df_processed, dist_df], axis=1)  

products_df_processed = products_df_processed[['Product Name', 'Flavor Variant', 'SKU Identification Number', 
          'Launch Date', 'Pack Size (L)', 'General Trade', 'E Commerce', 
          'Modern Trade', 'HoReCa', 'Q Commerce']]

products_df_processed.head(100)






Unnamed: 0,Product Name,Flavor Variant,SKU Identification Number,Launch Date,Pack Size (L),General Trade,E Commerce,Modern Trade,HoReCa,Q Commerce
0,Minute Maid Apple Juice - Honey Infused,Apple,SKU1001,2022-02-19,1.0,1,1,0,0,0
1,Minute Maid Mixed Fruit Juice,Mixed Fruit,SKU1002,2015-04-23,1.0,1,1,1,1,1
2,Minute Maid Gritty Guava,Guava,SKU1003,2023-02-19,1.0,1,0,1,0,0
3,Minute Maid Pulpy Orange,Orange,SKU1004,2015-04-23,1.0,1,1,1,1,1
4,Minute Maid 135ml Mixed Fruit Juice,Mixed Fruit,SKU1006,2015-04-23,0.135,1,0,0,0,0
5,Minute Maid 250ml Pulpy Orange,Orange,SKU1007,2015-04-23,0.25,1,1,1,1,0
6,Minute Maid 250ml Mixed Fruit Juice,Mixed Fruit,SKU1008,2015-04-23,0.25,1,1,1,1,0


In [13]:
# save data
products_df_processed.to_csv('../../data/processed/products_processed.csv', index=False)

### External Market influence

In [18]:
import pandas as pd

market_influencers_df_processed = market_influencers_df.copy()

# convert data to yyyy-mm-dd format
market_influencers_df_processed['Week_Start_Date'] = pd.to_datetime(market_influencers_df_processed['Week_Start_Date'], format='%Y-%m-%d').dt.strftime('%Y-%m-%d')

market_influencers_df_processed['Festival'] = market_influencers_df_processed['Festival'].fillna("No Festival")


market_influencers_df_processed.head()


Unnamed: 0,Week_Start_Date,City_ID,Avg_Temperature,Weather_Type,Festival
0,2023-01-02,CT001,21.0,Cold,No Festival
1,2023-01-02,CT002,19.4,Cold,No Festival
2,2023-01-02,CT003,19.8,Cold,No Festival
3,2023-01-02,CT004,20.8,Cold,No Festival
4,2023-01-02,CT005,21.6,Cold,No Festival


In [19]:
# save data
market_influencers_df_processed.to_csv('../../data/processed/market_influencers_processed.csv', index=False)

### Customer Behavior

In [24]:

import pandas as pd

consumer_behaviour_df_processed = consumer_behaviour_df.copy()
consumer_behaviour_df_processed['Purchase_Frequency'] = consumer_behaviour_df_processed['Purchase_Frequency'].fillna(consumer_behaviour_df_processed['Purchase_Frequency'].median())

consumer_behaviour_df_processed['Preferred_Channel'] = consumer_behaviour_df_processed['Preferred_Channel'].replace('HoReca', 'HoReCa')

consumer_behaviour_df_processed['Active'] = consumer_behaviour_df_processed['Active'].map({True: 1, False: 0})



consumer_behaviour_df_processed.head()


Unnamed: 0,Customer_ID,Age_Group,Income_Level,City_ID,Preferred_Flavor,Purchase_Frequency,Price_Sensitivity,Preferred_Channel,Active
0,cust_000001,26-35,362798.681598,CT015,Guava,2.0,Medium,E-commerce,1
1,cust_000002,36-45,402562.34409,CT002,Orange,3.0,Low,General Trade,1
2,cust_000003,46-55,373492.509082,CT011,Orange,3.0,Medium,Modern Trade,1
3,cust_000004,18-25,442767.523488,CT017,Mixed Fruit,2.0,Low,Modern Trade,0
4,cust_000005,36-45,643181.570314,CT008,Guava,4.0,Medium,Modern Trade,0


In [25]:
# save data
consumer_behaviour_df_processed.to_csv('../../data/processed/consumer_behavior_processed.csv', index=False)

### City

In [28]:

import pandas as pd

city_df_processed = city_df.copy()

city_df_processed = city_df_processed.dropna(axis=1, how='all')
city_df_processed['City_Name'] = city_df_processed['City_Name'].astype('category')

city_df_processed['Per_Capita_Income (INR)'] = city_df_processed['Per_Capita_Income (INR)'].astype(str).str.replace(',', '')
city_df_processed['Per_Capita_Income (INR)'] = pd.to_numeric(city_df_processed['Per_Capita_Income (INR)'], errors='coerce')

city_df_processed.head()



Unnamed: 0,City_ID,City_Name,City_tier,Population_Density(persons/km),Per_Capita_Income (INR)
0,CT001,Delhi,Tier 1,14893,461910
1,CT002,Mumbai,Tier 1,20518,400000
2,CT003,Kolkata,Tier 1,24252,171184
3,CT004,Chennai,Tier 1,14456,585501
4,CT005,Bengaluru,Tier 1,4378,352000


In [29]:
# save data
city_df_processed.to_csv('../../data/processed/city_processed.csv', index=False)

## Generating reports of processed data

In [32]:
# READ CITY, COMPETITOR, CONSUMER BEHAVIOR, MARKET INFLUENCE, PRODUCTS AND SALES DATA FROM PROCESSED FOLDER AND CREATE Y DATA PROFILING REPORTS AND SAVE THEM IN EDA/PROCESSED FOLDER
import pandas as pd
from ydata_profiling import ProfileReport

city_df = pd.read_csv('../../data/processed/city_processed.csv')
competitors_df = pd.read_csv('../../data/processed/competitors_processed.csv')
consumer_behaviour_df = pd.read_csv('../../data/processed/consumer_behavior_processed.csv')
market_influencers_df = pd.read_csv('../../data/processed/market_influencers_processed.csv')
products_df = pd.read_csv('../../data/processed/products_processed.csv')
sales_df = pd.read_csv('../../data/processed/sales_processed.csv')


sales_data_profile = ProfileReport(sales_df, title='Sales Data Profiling Report', explorative=True)

sales_data_profile.to_file('../../eda/reports/processed/sales_data_profile.html')

competitors_data_profile = ProfileReport(competitors_df, title='competitors_Data Profiling Report', explorative=True)

competitors_data_profile.to_file('../../eda/reports/processed/competitors_data_profile.html')

products_data_profile = ProfileReport(products_df, title='Products Data Profiling Report', explorative=True)

products_data_profile.to_file('../../eda/reports/processed/products_data_profile.html')

external_market_influence_data_profile = ProfileReport(market_influencers_df, title='External Market Influence Data Profiling Report', explorative=True)

external_market_influence_data_profile.to_file('../../eda/reports/processed/external_market_influence_data_profile.html')

customer_behavior_data_profile = ProfileReport(consumer_behaviour_df, title='Customer Behavior Data Profiling Report', explorative=True)

customer_behavior_data_profile.to_file('../../eda/reports/processed/customer_behavior_data_profile.html')

city_data_profile = ProfileReport(city_df, title='City Data Profiling Report', explorative=True)

city_data_profile.to_file('../../eda/reports/processed/city_data_profile.html')



Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]