A Lookalike Model was developed to identify customers similar to a given customer based on their transactional behavior and profile data. 

The model uses a combination of customer preferences, spending habits, and purchase history to compute similarity.

Approach

Data Preparation:
Merged the Customers, Products, and Transactions datasets.

Aggregated features such as total revenue, average quantity, and preferred product categories for each customer.

Encoded categorical variables (e.g., preferred categories) using one-hot encoding.

Model Development:
Standardized the data using StandardScaler.
Computed pairwise similarity using cosine similarity.
Identified the top 3 most similar customers for each customer.

Output:
Generated similarity scores for the first 20 customers (C0001 to C0020).

Saved recommendations in a CSV format with columns: 

CustomerID, SimilarCustomer1, Score1, SimilarCustomer2, Score2, SimilarCustomer3, Score3.

Sample Results

Customer C0001:

SimilarCustomer1: C0181 (Score: 0.996)

SimilarCustomer2: C0055 (Score: 0.992)

SimilarCustomer3: C0048 (Score: 0.992)

Customer C0002:

SimilarCustomer1: C0029 (Score: 0.999)

SimilarCustomer2: C0062 (Score: 0.991)

SimilarCustomer3: C0030 (Score: 0.984)

In [17]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler

customers_df = pd.read_csv('dataset/Customers.csv')
products_df = pd.read_csv('dataset/Products.csv')
transactions_df = pd.read_csv('dataset/Transactions.csv')

transactions_df['TransactionDate'] = pd.to_datetime(transactions_df['TransactionDate'])

In [18]:
merged_df = transactions_df.merge(customers_df, on='CustomerID').merge(products_df, on='ProductID')

merged_df['Revenue'] = merged_df['Quantity'] * merged_df['Price_y']

# Aggregate customer features
customer_features = merged_df.groupby('CustomerID').agg(
    total_revenue=('Revenue', 'sum'),
    avg_quantity=('Quantity', 'mean'),
    preferred_category=('Category', lambda x: x.mode()[0])  # Most frequent category
).reset_index()

In [19]:
customer_features = pd.get_dummies(customer_features, columns=['preferred_category'])

scaler = StandardScaler()
scaled_features = scaler.fit_transform(customer_features.drop(columns=['CustomerID']))

similarity_matrix = cosine_similarity(scaled_features)

similarity_df = pd.DataFrame(similarity_matrix, index=customer_features['CustomerID'], columns=customer_features['CustomerID'])

# Generate lookalike recommendations for the first 20 customers
lookalike_data = {}
for cust_id in customer_features['CustomerID'][:20]:
    similar_customers = similarity_df[cust_id].sort_values(ascending=False)[1:4]  # Top 3 lookalikes
    lookalike_data[cust_id] = similar_customers.index.tolist(), similar_customers.values.tolist()

In [20]:
import csv
with open('Neeraj_Kumark_Lookalike.csv', mode='w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['CustomerID', 'SimilarCustomer1', 'Score1', 'SimilarCustomer2', 'Score2', 'SimilarCustomer3', 'Score3'])
    for cust_id, (similar_ids, scores) in lookalike_data.items():
        writer.writerow([cust_id, *sum(zip(similar_ids, scores), ())])

print("Sample Lookalike Recommendations:")
for cust_id, (similar_ids, scores) in list(lookalike_data.items())[:5]:
    print(f"Customer {cust_id}:")
    for idx, (sim_id, score) in enumerate(zip(similar_ids, scores), start=1):
        print(f"  SimilarCustomer{idx}: {sim_id} (Score: {score:.3f})")

Sample Lookalike Recommendations:
Customer C0001:
  SimilarCustomer1: C0181 (Score: 0.996)
  SimilarCustomer2: C0055 (Score: 0.992)
  SimilarCustomer3: C0048 (Score: 0.992)
Customer C0002:
  SimilarCustomer1: C0029 (Score: 1.000)
  SimilarCustomer2: C0062 (Score: 0.991)
  SimilarCustomer3: C0030 (Score: 0.984)
Customer C0003:
  SimilarCustomer1: C0089 (Score: 0.999)
  SimilarCustomer2: C0136 (Score: 0.954)
  SimilarCustomer3: C0110 (Score: 0.952)
Customer C0004:
  SimilarCustomer1: C0171 (Score: 0.993)
  SimilarCustomer2: C0168 (Score: 0.991)
  SimilarCustomer3: C0153 (Score: 0.989)
Customer C0005:
  SimilarCustomer1: C0186 (Score: 0.998)
  SimilarCustomer2: C0199 (Score: 0.998)
  SimilarCustomer3: C0140 (Score: 0.991)
