**Feature Selection Rationale**

The goal of feature selection is to choose variables that are likely to be good indicators of customer similarity. We want features that capture:

1. **Customer Demographics:** Basic information about the customer.
2. **Purchase Behavior:** How much, how often, and what they buy.
3. **Engagement/Recency:** How recently they have interacted with the platform.

Here's the rationale for each feature:

**Customer Features:**

*   **Region (One-Hot Encoded):**
    *   **Reasoning:** Customers from the same region might share similar preferences, cultural influences, or economic backgrounds, which could affect their purchasing behavior. One-hot encoding allows us to use this categorical variable in numerical similarity calculations.
*   **SignupYear:**
    *   **Reasoning:** The year a customer signed up can be a proxy for their experience with the platform or their potential loyalty. Customers who signed up around the same time might have been exposed to similar marketing campaigns or platform features.
*   **SignupMonth:**
    *   **Reasoning:** The month of signup might capture some seasonal effects. For example, customers who signed up during a holiday season might have different purchase patterns than those who signed up during other times of the year.
*   **DaysSinceSignup:**
    *   **Reasoning:** This feature captures the customer's tenure or how long they've been a customer. Longer-tenured customers might have different purchase behaviors (e.g., more loyalty, higher spending) compared to newer customers.

**Transaction Features:**

*   **TotalSpending:**
    *   **Reasoning:** This is a direct measure of the customer's overall value to the business. Customers who spend similar amounts are likely to be similar in terms of their purchasing power or their level of engagement with the platform.
*   **NumPurchases:**
    *   **Reasoning:** This feature reflects the customer's purchase frequency. Customers who make a similar number of purchases might have similar needs or shopping habits.
*   **AvgPurchaseValue:**
    *   **Reasoning:** This captures the average amount spent per transaction. It helps to distinguish between customers who make many small purchases and those who make fewer but larger purchases.
*   **Category Purchases (One-Hot Encoded):**
    *   **Reasoning:** The categories a customer buys from are strong indicators of their preferences and interests. Customers who buy from similar categories are likely to be similar in terms of their needs or lifestyle. One-hot encoding allows us to represent these preferences as binary features.
*   **Recency:**
    *   **Reasoning:** This feature indicates how recently a customer has made a purchase. Customers who have purchased recently are more likely to be actively engaged with the platform. Similar recency suggests similar current engagement levels.

**Why These Features?**

*   **Relevance to Similarity:** Each of these features is potentially relevant to determining customer similarity in the context of an eCommerce business. They capture different aspects of a customer's profile and behavior that could influence their preferences and future actions.
*   **Availability:** These features can be derived from the commonly available data in customer and transaction datasets.
*   **Interpretability:** The features are relatively easy to understand and interpret, which makes it easier to explain the model's recommendations.

**Why Cosine Similarity?**

Cosine similarity is a good choice for this problem because:

*   **High-Dimensional Data:** It works well with high-dimensional data, which is common in lookalike models where we have many features (especially after one-hot encoding).
*   **Magnitude Invariance:** It focuses on the angle between vectors, not their magnitudes. This is important because we want to consider customers similar even if one has made many more purchases than the other, as long as their purchase patterns and preferences are similar.

**Why Standardize Numerical Features?**

Standardization (scaling to zero mean and unit variance) is important for numerical features because:

*   **Equal Weighting:** Features with larger values or ranges can disproportionately influence distance-based similarity measures like cosine similarity. Standardization ensures that all numerical features contribute equally to the similarity calculation.


# Data Loading and Preprocessing

## Load dependencies

In [3]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity
from datetime import datetime

## Load the datasets

In [None]:
customers = pd.read_csv('data/Customers.csv')
products = pd.read_csv('data/Products.csv')
transactions = pd.read_csv('data/Transactions.csv')

## Preprocess the datasets

In [5]:
# Convert date columns to datetime objects
customers['SignupDate'] = pd.to_datetime(customers['SignupDate'])
transactions['TransactionDate'] = pd.to_datetime(transactions['TransactionDate'])

# Merge transactions with products to get product category information
transactions = pd.merge(transactions, products[['ProductID', 'Category']], on='ProductID')

In [12]:
transactions

Unnamed: 0,TransactionID,CustomerID,ProductID,TransactionDate,Quantity,TotalValue,Price,Category
0,T00001,C0199,P067,2024-08-25 12:38:23,1,300.68,300.68,Electronics
1,T00112,C0146,P067,2024-05-27 22:23:54,1,300.68,300.68,Electronics
2,T00166,C0127,P067,2024-04-25 07:38:55,1,300.68,300.68,Electronics
3,T00272,C0087,P067,2024-03-26 22:55:37,2,601.36,300.68,Electronics
4,T00363,C0070,P067,2024-03-21 15:10:10,3,902.04,300.68,Electronics
...,...,...,...,...,...,...,...,...
995,T00496,C0118,P037,2024-10-24 08:30:27,1,459.86,459.86,Electronics
996,T00759,C0059,P037,2024-06-04 02:15:24,3,1379.58,459.86,Electronics
997,T00922,C0018,P037,2024-04-05 13:05:32,4,1839.44,459.86,Electronics
998,T00959,C0115,P037,2024-09-29 10:16:02,2,919.72,459.86,Electronics


# Feature Engineering

## Customer Features

In [6]:
# One-hot encode 'Region'
customers = pd.get_dummies(customers, columns=['Region'], prefix=['Region'])

# Extract year and month from SignupDate
customers['SignupYear'] = customers['SignupDate'].dt.year
customers['SignupMonth'] = customers['SignupDate'].dt.month

# Calculate days since signup
today = datetime.now()
customers['DaysSinceSignup'] = (today - customers['SignupDate']).dt.days

## Transaction Features

In [7]:
# Calculate Total Spending
customer_spending = transactions.groupby('CustomerID')['TotalValue'].sum().reset_index()
customer_spending.rename(columns={'TotalValue': 'TotalSpending'}, inplace=True)
customers = pd.merge(customers, customer_spending, on='CustomerID', how='left')

# Calculate Number of Purchases
customer_purchases = transactions.groupby('CustomerID')['TransactionID'].count().reset_index()
customer_purchases.rename(columns={'TransactionID': 'NumPurchases'}, inplace=True)
customers = pd.merge(customers, customer_purchases, on='CustomerID', how='left')

# Calculate Average Purchase Value
customers['AvgPurchaseValue'] = customers['TotalSpending'] / customers['NumPurchases']

# Create binary columns for category purchases
category_purchases = pd.crosstab(transactions['CustomerID'], transactions['Category']).add_prefix('Category_')
customers = pd.merge(customers, category_purchases, on='CustomerID', how='left')

## Recency Features

In [8]:
# Calculate Recency
last_purchase_date = transactions.groupby('CustomerID')['TransactionDate'].max().reset_index()
last_purchase_date.rename(columns={'TransactionDate': 'LastPurchaseDate'}, inplace=True)
customers = pd.merge(customers, last_purchase_date, on='CustomerID', how='left')
customers['Recency'] = (today - customers['LastPurchaseDate']).dt.days
customers.drop('LastPurchaseDate', axis=1, inplace=True)

## Check for NaN values

In [14]:
# Check for NaN values
customers.isnull().sum()

CustomerID              0
CustomerName            0
SignupDate              0
Region_Asia             0
Region_Europe           0
Region_North America    0
Region_South America    0
SignupYear              0
SignupMonth             0
DaysSinceSignup         0
TotalSpending           1
NumPurchases            1
AvgPurchaseValue        1
Category_Books          1
Category_Clothing       1
Category_Electronics    1
Category_Home Decor     1
Recency                 1
dtype: int64

In [16]:
# Drop rows with NaN values
customers.dropna(inplace=True)

# Similarity Calculation

## Feature Calculation

In [18]:
features_for_similarity = ['SignupYear', 'SignupMonth', 'DaysSinceSignup', 'TotalSpending', 
                           'NumPurchases', 'AvgPurchaseValue', 'Recency'] + list(category_purchases.columns)
# Add one-hot encoded region columns
features_for_similarity += [col for col in customers.columns if 'Region_' in col]

## Standardization

In [19]:
# Separate numerical and categorical features
numerical_features = ['SignupYear', 'SignupMonth', 'DaysSinceSignup', 'TotalSpending', 'NumPurchases', 'AvgPurchaseValue','Recency']
categorical_features = list(category_purchases.columns) + [col for col in customers.columns if 'Region_' in col]

# Standardize numerical features
scaler = StandardScaler()
customers[numerical_features] = scaler.fit_transform(customers[numerical_features])

## Cosine Similarity Calculation

In [20]:
# Prepare the feature matrix for similarity calculation
customer_features = customers[features_for_similarity]

# Calculate cosine similarity
similarity_matrix = cosine_similarity(customer_features)

# Lookalike Recommendation

## Lookalike Function

In [21]:
def get_lookalikes(customer_id, similarity_matrix, customers_df, top_n=3):
    customer_index = customers_df[customers_df['CustomerID'] == customer_id].index[0]
    similar_customers = list(enumerate(similarity_matrix[customer_index]))
    similar_customers = sorted(similar_customers, key=lambda x: x[1], reverse=True)
    top_similar_customers = similar_customers[1:top_n + 1]  # Exclude the customer itself

    lookalikes = []
    for index, score in top_similar_customers:
        lookalike_id = customers_df.iloc[index]['CustomerID']
        lookalikes.append((lookalike_id, score))
    return lookalikes

## Generate Lookalike Recommendations for first 20 customers

In [22]:
lookalike_data = []
for i in range(1, 21):
    customer_id = f'C{i:04}'
    lookalikes = get_lookalikes(customer_id, similarity_matrix, customers)
    lookalike_data.append({'cust_id': customer_id, 'lookalikes': lookalikes})

lookalike_df = pd.DataFrame(lookalike_data)

## Generate Lookalike.csv file

In [25]:
lookalike_dict = {}
for index, row in lookalike_df.iterrows():
    cust_id = row['cust_id']
    lookalikes_list = []
    for lookalike_id, score in row['lookalikes']:
        lookalikes_list.append([lookalike_id, score])
    lookalike_dict[cust_id] = lookalikes_list

lookalike_final_df = pd.DataFrame(list(lookalike_dict.items()), columns=['cust_id', 'List<cust_id, score>'])

In [26]:
lookalike_final_df

Unnamed: 0,cust_id,"List<cust_id, score>"
0,C0001,"[[C0184, 0.8355257283673876], [C0005, 0.799847..."
1,C0002,"[[C0134, 0.9651268284105505], [C0166, 0.849010..."
2,C0003,"[[C0129, 0.9035246927758588], [C0031, 0.883629..."
3,C0004,"[[C0113, 0.9370076150944698], [C0122, 0.868311..."
4,C0005,"[[C0007, 0.9064548766601567], [C0199, 0.812312..."
5,C0006,"[[C0187, 0.7825101903373963], [C0126, 0.779802..."
6,C0007,"[[C0005, 0.9064548766601567], [C0112, 0.773751..."
7,C0008,"[[C0194, 0.8915192640736973], [C0059, 0.869261..."
8,C0009,"[[C0077, 0.8225586718117179], [C0010, 0.778424..."
9,C0010,"[[C0077, 0.8049878155520012], [C0062, 0.799237..."


In [27]:
lookalike_final_df.to_csv("Lookalike.csv", index=False)