## TASK 2: Lookalike Model

Build a Lookalike Model that takes a user's information as input and recommends 3 similar customers based on their profile and transaction history. The model should:
- Use both customer and product information.
- Assign a similarity score to each recommended customer.

Deliverables:
- Give the top 3 lookalikes with there similarity scores for the first 20 customers (CustomerID: C0001 - C0020)  in Customers.csv. Form an “Lookalike.csv” which has just one map: Map<cust_id, List<cust_id, score>>
- A Jupyter Notebook/Python script explaining your model development. 

Evaluation Criteria:
- Model accuracy and logic.
- Quality of recommendations and similarity scores.

In [43]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics.pairwise import cosine_similarity

In [44]:
df_customers = pd.read_csv("/home/dispers/PRACHI/Customers.csv")
df_products = pd.read_csv("/home/dispers/PRACHI/Products.csv")
df_txns = pd.read_csv("/home/dispers/PRACHI/Transactions.csv")

# Merging all three datasets to ease analysis
df_merged = pd.merge(df_txns, df_products, how = "left", on = ["ProductID","Price"])
df_merged = pd.merge(df_merged, df_customers, how = "left", on = "CustomerID")


As the regions are of character/text data type, working with them directly can cause an issue. Thus, we use one-hot encoding to convert them to boolean values which are easier to process and compare for the model.

In [45]:
# One hot encoding for the various regions
df_merged = pd.get_dummies(df_merged, columns=['Region'])

Every customer makes transactions in different months of the year. Extracting month from the transaction dates can help us analyze the customer's behaviour according to the season. It is possible that a certain type/category of product is purchased more in a certain month of the year. 

In [46]:
# Getting month in which the transaction was made
df_merged['TransactionMonth'] = pd.to_datetime(df_merged['TransactionDate']).dt.month

As we need to recommend three similiar customer profiles based on customers' personal information and their transaction histories, we will make a customer profile for every consumer.

In [47]:
cust_profiles = df_merged.groupby('CustomerID').agg({
    'TotalValue': 'sum',
    'TransactionID':'count',
    'Quantity':'sum',
    'TransactionMonth': lambda x: x.mode()[0],    
    'Region_Asia': 'first',
    'Region_Europe': 'first',
    'Region_North America': 'first',
    'Region_South America': 'first',
    'Category': lambda x: ','.join(x),  
}).reset_index()

cust_profiles['AvgTxnValue'] = cust_profiles['TotalValue'] / cust_profiles['TransactionID']
cust_profiles = cust_profiles.rename(columns={"TransactionID":"TotalTxns"})
cust_profiles

Unnamed: 0,CustomerID,TotalValue,TotalTxns,Quantity,TransactionMonth,Region_Asia,Region_Europe,Region_North America,Region_South America,Category,AvgTxnValue
0,C0001,3354.52,5,12,1,False,False,False,True,"Books,Home Decor,Electronics,Electronics,Elect...",670.904000
1,C0002,1862.74,4,10,2,True,False,False,False,"Home Decor,Home Decor,Clothing,Clothing",465.685000
2,C0003,2725.38,4,14,6,False,False,False,True,"Home Decor,Home Decor,Clothing,Electronics",681.345000
3,C0004,5354.88,8,23,12,False,False,False,True,"Books,Home Decor,Home Decor,Home Decor,Books,B...",669.360000
4,C0005,2034.24,3,7,3,True,False,False,False,"Home Decor,Electronics,Electronics",678.080000
...,...,...,...,...,...,...,...,...,...,...,...
194,C0196,4982.88,4,12,8,False,True,False,False,"Books,Clothing,Home Decor,Home Decor",1245.720000
195,C0197,1928.65,3,9,1,False,True,False,False,"Home Decor,Electronics,Electronics",642.883333
196,C0198,931.83,2,3,9,False,True,False,False,"Electronics,Clothing",465.915000
197,C0199,1979.28,4,9,8,False,True,False,False,"Electronics,Home Decor,Home Decor,Electronics",494.820000


As categories appear as a text as well, we need to perform one-hot encoding as earlier.

As the ranges of different columns vary greatly (for example Quantity and TotalValue), we will use the MinMaxScaler to scale the values in the range 0-1 while still accounting for their actual distance from the original value scale.

In [48]:
categories_encoded = cust_profiles['Category'].str.get_dummies(sep=',')
cust_profiles = pd.concat([cust_profiles, categories_encoded], axis=1).drop(columns=['Category'])

scaler = MinMaxScaler()
features = ['Quantity', 'TotalValue', 'TotalTxns'] + list(categories_encoded.columns)
cust_profiles[features] = scaler.fit_transform(cust_profiles[features])


In [49]:
cust_profiles

Unnamed: 0,CustomerID,TotalValue,TotalTxns,Quantity,TransactionMonth,Region_Asia,Region_Europe,Region_North America,Region_South America,AvgTxnValue,Books,Clothing,Electronics,Home Decor
0,C0001,0.308942,0.4,0.354839,1,False,False,False,True,670.904000,1.0,0.0,1.0,1.0
1,C0002,0.168095,0.3,0.290323,2,True,False,False,False,465.685000,0.0,1.0,0.0,1.0
2,C0003,0.249541,0.3,0.419355,6,False,False,False,True,681.345000,0.0,1.0,1.0,1.0
3,C0004,0.497806,0.7,0.709677,12,False,False,False,True,669.360000,1.0,0.0,1.0,1.0
4,C0005,0.184287,0.2,0.193548,3,True,False,False,False,678.080000,0.0,0.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
194,C0196,0.462684,0.3,0.354839,8,False,True,False,False,1245.720000,1.0,1.0,0.0,1.0
195,C0197,0.174318,0.2,0.258065,1,False,True,False,False,642.883333,0.0,0.0,1.0,1.0
196,C0198,0.080203,0.1,0.064516,9,False,True,False,False,465.915000,0.0,1.0,1.0,0.0
197,C0199,0.179098,0.3,0.258065,8,False,True,False,False,494.820000,0.0,0.0,1.0,1.0


Getting the cosine similarity between customers based on their profiles (as created above).
Here we treat each customer as a vector with differet features and compute the angle between them to get the similarity score. We then convert it to a dataframe for further processing

In [50]:
similarity_matrix = cosine_similarity(cust_profiles[features])
similarity_df = pd.DataFrame(similarity_matrix, index=cust_profiles['CustomerID'], columns=cust_profiles['CustomerID'])

Finding lookalikes for the top 20 customers based on the similarity scores.

In [51]:
lookalikes = {}
for cust_id in cust_profiles['CustomerID'][:20]:  
    similar_customers = similarity_df[cust_id].nlargest(4).iloc[1:]  
    lookalikes[cust_id] = [(other_id, round(score, 4)) for other_id, score in similar_customers.items()]

Creating a CSV to store the lookalikes as required.

In [52]:
lookalikes_list = [
    {'CustomerID': cust_id, 'Lookalikes': str(similars)}
    for cust_id, similars in lookalikes.items()
]
lookalikes_df = pd.DataFrame(lookalikes_list)
lookalikes_df.to_csv('Lookalike.csv', index=False)

print("CSV created.")

CSV created.
