# Basic Imports

In [None]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler
import numpy as np

# load dataset and merge it

In [None]:
customers = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/eCommerce files/Customers.csv')
products = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/eCommerce files/Products.csv')
transactions = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/eCommerce files/Transactions.csv')

# Merge datasets
merged = transactions.merge(customers, on='CustomerID', how='left').merge(products, on='ProductID', how='left')

# Feature engineering customer data

In [None]:
profile_features = customers.copy()
profile_features['DaysSinceSignup'] = (pd.Timestamp.now() - pd.to_datetime(profile_features['SignupDate'])).dt.days
profile_features = pd.get_dummies(profile_features, columns=['Region'], drop_first=True)

**Feature Engineering (on Customer Data):**
   - `profile_features = customers.copy()`:  Creates a copy of the customer data to avoid modifying the original.
   - `profile_features['DaysSinceSignup'] = ...`:  Calculates the number of days since each customer signed up. It uses the current timestamp (`pd.Timestamp.now()`) and subtracts the signup date (`profile_features['SignupDate']`).  The result is stored in a new column 'DaysSinceSignup'.
   - `profile_features = pd.get_dummies(...)`:  Performs one-hot encoding on the 'Region' column. One-hot encoding transforms categorical data into numerical columns (0s and 1s). `drop_first=True` removes the first category column to avoid multicollinearity (redundant information).  For instance, if there were regions 'North', 'South', 'East', 'West', this would create columns like 'Region_South', 'Region_East', 'Region_West' (no 'Region_North'), because knowing all other regions are zero indicates the customer is from the 'North'.

In [None]:
transaction_features = merged.groupby('CustomerID').agg({
    'TotalValue': ['mean', 'sum'],
    'TransactionID': 'count',
    'Category': lambda x: x.mode()[0] if not x.mode().empty else np.nan
}).reset_index()
transaction_features.columns = ['CustomerID', 'AvgTransactionValue', 'TotalSpending', 'TransactionCount', 'FavoriteCategory']

**Feature Engineering (on Transaction data):**

1. **Grouping by Customer:** `merged.groupby('CustomerID')`:  The transactions are grouped by each unique customer ID. This allows us to aggregate transaction information for each customer individually.

2. **Aggregation:**  The `.agg(...)` function calculates summary statistics for each customer's transactions:
   - `'TotalValue': ['mean', 'sum']`: Calculates the average transaction value (`mean`) and total spending (`sum`) for each customer.
   - `'TransactionID': 'count'`: Counts the total number of transactions made by each customer.
   - `'Category': lambda x: x.mode()[0] if not x.mode().empty else np.nan`:  Finds the most frequent category (mode) purchased by each customer.  The `lambda` function handles cases where a customer might have purchased items from only one category or where there is no mode. If there's no mode, it fills with `np.nan`.

3. **Renaming Columns:**  The code renames the columns of the resulting DataFrame to more descriptive names (e.g., `AvgTransactionValue`, `TotalSpending`, `TransactionCount`, `FavoriteCategory`).


In [None]:
# Merge all features
final_data = profile_features.merge(transaction_features, on='CustomerID', how='left')

# Encode favorite categories
final_data = pd.get_dummies(final_data, columns=['FavoriteCategory'], drop_first=True)
final_data.fillna(0, inplace=True)

# Data Normalization

In [None]:
# Normalize data
scaler = StandardScaler()
normalized_features = scaler.fit_transform(final_data.drop(['CustomerID', 'CustomerName', 'SignupDate'], axis=1))

# Build Recommendations

In [None]:
# Compute cosine similarity
similarity_matrix = cosine_similarity(normalized_features)

# Build recommendations for first 20 customers (top 3 look-alike)
top_lookalikes = {}
for i, customer_id in enumerate(final_data['CustomerID'][:20]):
    similarity_scores = list(enumerate(similarity_matrix[i]))
    sorted_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    top_3 = [(final_data['CustomerID'][idx], round(score, 2)) for idx, score in sorted_scores[1:4]]
    top_lookalikes[customer_id] = top_3

In [None]:
# Create Lookalike.csv
lookalike_data = [{'cust_id': k, 'lookalikes': v} for k, v in top_lookalikes.items()]
lookalike_df = pd.DataFrame(lookalike_data)

In [None]:
lookalike_df
# lets take a look at lookalike dataset

Unnamed: 0,cust_id,lookalikes
0,C0001,"[(C0192, 0.94), (C0184, 0.93), (C0091, 0.93)]"
1,C0002,"[(C0134, 0.97), (C0106, 0.94), (C0088, 0.87)]"
2,C0003,"[(C0052, 1.0), (C0031, 0.96), (C0076, 0.94)]"
3,C0004,"[(C0165, 0.97), (C0155, 0.96), (C0169, 0.91)]"
4,C0005,"[(C0007, 0.96), (C0186, 0.87), (C0140, 0.86)]"
5,C0006,"[(C0187, 0.93), (C0168, 0.89), (C0171, 0.87)]"
6,C0007,"[(C0005, 0.96), (C0140, 0.89), (C0186, 0.76)]"
7,C0008,"[(C0065, 0.85), (C0059, 0.78), (C0189, 0.78)]"
8,C0009,"[(C0010, 0.95), (C0062, 0.93), (C0198, 0.88)]"
9,C0010,"[(C0062, 0.98), (C0103, 0.96), (C0009, 0.95)]"


In [None]:
# saving the dataset
lookalike_df.to_csv('/content/drive/MyDrive/Colab Notebooks/eCommerce files/Lookalike.csv', index=False)

Here's a breakdown of the insights derived from the code:

**1. Data Preparation and Feature Engineering:**

* **Customer Profiles:**  Days since signup and region are used as features.  The 'Region' is one-hot encoded, transforming it into a numerical representation suitable for machine learning algorithms. This helps the model consider regional differences when determining similarity.  Days since signup helps quantify customer recency and potentially identify trends based on the customer's lifecycle.
* **Transaction Data:** The code aggregates transaction information for each customer, focusing on purchasing behavior. Key features derived are:
    * `AvgTransactionValue`: Average value of each transaction per customer. This indicates typical spending habits.
    * `TotalSpending`: Total money spent by the customer. A higher value suggests a more valuable customer.
    * `TransactionCount`: Number of transactions. Frequency of purchase could be a valuable indicator of customer engagement.
    * `FavoriteCategory`: Most frequently purchased product category. Shows preferred product areas for each customer. This category is then one-hot encoded, just like region.

* **Data Merging:** Customer profiles and transaction aggregates are merged into `final_data`.  This combines all relevant features for each customer into a single dataset.

* **Handling Missing Data:**  Missing values in `FavoriteCategory` (which may arise if some customers bought products from various categories equally) are filled with 0.

**2. Data Normalization:**

* **Standardization:** Using `StandardScaler`, all features are standardized (z-score normalization).  This transforms the data to have zero mean and unit variance. Standardization is crucial for distance-based methods like cosine similarity. Features with larger scales (like total spending) wouldn't disproportionately affect the similarity calculation compared to features with smaller scales (like the number of transactions).

**3. Look-Alike Model (Cosine Similarity):**

* **Cosine Similarity:**  The core of the look-alike model is cosine similarity. It measures the cosine of the angle between two vectors (representing customers in this case).  A cosine similarity of 1 indicates identical vectors (customers with identical behavior), while 0 means no similarity.
* **Similarity Matrix:**  A `similarity_matrix` is built, where each entry represents the similarity between two customers.
* **Top Look-Alikes:** For the first 20 customers, the code identifies the top 3 most similar customers (look-alikes) based on the similarity scores. It excludes the customer itself.

**4. Output and Storage:**

* **Look-Alike Data:** The results are stored in a DataFrame containing the customer ID and a list of their top 3 look-alike customers and their similarity scores.
* **CSV File:** The look-alike data is saved to `Lookalike.csv`.

**Overall Insights and Potential Improvements:**

* **Feature Importance:** The code doesn't analyze the importance of each feature. Techniques like feature importance from tree-based models or permutation importance could help identify which factors are most influential in determining look-alike customers. This would enhance interpretability and allow you to refine the feature set.
* **Alternative Similarity Measures:** Explore other similarity or distance metrics (e.g., Euclidean distance, Manhattan distance, or Jaccard similarity) to see if they yield better results. The choice of distance metric depends on the nature of the data and the problem.
* **Hyperparameter Tuning:** If you use more sophisticated similarity methods, hyperparameter tuning is critical.
* **Dynamic Threshold:** Instead of a fixed top 3, consider a dynamic threshold based on the similarity score.  This would give you a more flexible way to identify look-alike customers.
* **Recency, Frequency, Monetary Value (RFM) Analysis:** Consider incorporating RFM analysis, which can offer more detailed insights into customer behavior patterns.
* **Model Evaluation:** No evaluation of the model performance is included in the code. Defining metrics and validation approaches to evaluate the model's effectiveness at identifying truly similar customers would be valuable.


The provided code is a good starting point. These improvements can help make it more robust and insightful.


# END