# Task 2: Lookalike Model

To create a lookalike model, I first combined customer and transaction data. This included customer profiles from Customers.csv and their purchasing behavior from Transactions.csv. Product information was also incorporated to enrich the data.

In [2]:
import pandas as pd

# Loading the datasets
customers = pd.read_csv('Customers.csv')
transactions = pd.read_csv('Transactions.csv')
products = pd.read_csv('Products.csv')

In [3]:
print(products.head())


  ProductID              ProductName     Category   Price
0      P001     ActiveWear Biography        Books  169.30
1      P002    ActiveWear Smartwatch  Electronics  346.30
2      P003  ComfortLiving Biography        Books   44.12
3      P004            BookWorld Rug   Home Decor   95.69
4      P005          TechPro T-Shirt     Clothing  429.31


In [4]:
# Merge datasets to include Price from Products.csv
merged_data = pd.merge(transactions, products, on='ProductID', how='left')
merged_data = pd.merge(merged_data, customers, on='CustomerID', how='left')

print(merged_data.head())


  TransactionID CustomerID ProductID      TransactionDate  Quantity  \
0        T00001      C0199      P067  2024-08-25 12:38:23         1   
1        T00112      C0146      P067  2024-05-27 22:23:54         1   
2        T00166      C0127      P067  2024-04-25 07:38:55         1   
3        T00272      C0087      P067  2024-03-26 22:55:37         2   
4        T00363      C0070      P067  2024-03-21 15:10:10         3   

   TotalValue  Price_x                      ProductName     Category  Price_y  \
0      300.68   300.68  ComfortLiving Bluetooth Speaker  Electronics   300.68   
1      300.68   300.68  ComfortLiving Bluetooth Speaker  Electronics   300.68   
2      300.68   300.68  ComfortLiving Bluetooth Speaker  Electronics   300.68   
3      601.36   300.68  ComfortLiving Bluetooth Speaker  Electronics   300.68   
4      902.04   300.68  ComfortLiving Bluetooth Speaker  Electronics   300.68   

      CustomerName         Region  SignupDate  
0   Andrea Jenkins         Europe  202

In [6]:
# Renaming Price_y to Price for clarity
merged_data = merged_data.rename(columns={'Price_y': 'Price'})


print(merged_data.head())


  TransactionID CustomerID ProductID      TransactionDate  Quantity  \
0        T00001      C0199      P067  2024-08-25 12:38:23         1   
1        T00112      C0146      P067  2024-05-27 22:23:54         1   
2        T00166      C0127      P067  2024-04-25 07:38:55         1   
3        T00272      C0087      P067  2024-03-26 22:55:37         2   
4        T00363      C0070      P067  2024-03-21 15:10:10         3   

   TotalValue  Price_x                      ProductName     Category   Price  \
0      300.68   300.68  ComfortLiving Bluetooth Speaker  Electronics  300.68   
1      300.68   300.68  ComfortLiving Bluetooth Speaker  Electronics  300.68   
2      300.68   300.68  ComfortLiving Bluetooth Speaker  Electronics  300.68   
3      601.36   300.68  ComfortLiving Bluetooth Speaker  Electronics  300.68   
4      902.04   300.68  ComfortLiving Bluetooth Speaker  Electronics  300.68   

      CustomerName         Region  SignupDate  
0   Andrea Jenkins         Europe  2022-12-0

Now use the updated merged_data to calculate the aggregated features.

In [7]:
# Aggregate transactional data for each customer
customer_features = merged_data.groupby('CustomerID').agg({
    'Quantity': 'sum',         # Total quantity purchased
    'TotalValue': 'sum',       # Total spending
    'Price': 'mean'            # Average price of purchased products
}).reset_index()

# Merge back with customer profile data
customer_data = pd.merge(customers, customer_features, on='CustomerID')

# Display the final customer_data
print(customer_data.head())


  CustomerID        CustomerName         Region  SignupDate  Quantity  \
0      C0001    Lawrence Carroll  South America  2022-07-10        12   
1      C0002      Elizabeth Lutz           Asia  2022-02-13        10   
2      C0003      Michael Rivera  South America  2024-03-07        14   
3      C0004  Kathleen Rodriguez  South America  2022-10-09        23   
4      C0005         Laura Weber           Asia  2022-08-15         7   

   TotalValue       Price  
0     3354.52  278.334000  
1     1862.74  208.920000  
2     2725.38  195.707500  
3     5354.88  240.636250  
4     2034.24  291.603333  


Explanation:
I started by combining transactional and customer data to create a dataset that included:

1. Total quantity of items purchased (Quantity).
2. Total spending (TotalValue).
3. Average price of products purchased (Price).



This dataset gives a holistic view of each customer’s behavior and profile, which is essential for similarity calculations.

## Compute Similarity Scores

I used the cosine similarity metric to compare customers based on their profile and transactional data. This helps identify customers with similar spending patterns, product preferences, and purchasing behavior

In [8]:
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity

# Select features for similarity calculation
features = ['Quantity', 'TotalValue', 'Price']
scaler = StandardScaler()
scaled_features = scaler.fit_transform(customer_data[features])

# Compute cosine similarity
similarity_matrix = cosine_similarity(scaled_features)

# Convert similarity matrix to a DataFrame
similarity_df = pd.DataFrame(similarity_matrix, index=customer_data['CustomerID'], columns=customer_data['CustomerID'])

# Display similarity matrix for reference
print(similarity_df.head())


CustomerID     C0001     C0002     C0003     C0004     C0005     C0006  \
CustomerID                                                               
C0001       1.000000  0.104513 -0.524923 -0.925208  0.909351  0.442395   
C0002       0.104513  1.000000  0.791531 -0.464035  0.506433 -0.844066   
C0003      -0.524923  0.791531  1.000000  0.172432 -0.124725 -0.994780   
C0004      -0.925208 -0.464035  0.172432  1.000000 -0.990272 -0.083333   
C0005       0.909351  0.506433 -0.124725 -0.990272  1.000000  0.029596   

CustomerID     C0007     C0008     C0009     C0010  ...     C0191     C0192  \
CustomerID                                          ...                       
C0001       0.957854 -0.980620  0.885035 -0.268370  ...  0.953552  0.875392   
C0002      -0.126391 -0.208586  0.552510  0.929885  ...  0.366172  0.561020   
C0003      -0.694381  0.426063 -0.070251  0.960431  ... -0.270712 -0.056387   
C0004      -0.786871  0.960972 -0.985116 -0.108724  ... -0.969254 -0.975266   
C0005  

Explanation:
To calculate similarity between customers:

1. I selected Quantity, TotalValue, and Price as the key features.
2. These features were scaled using StandardScaler to ensure uniformity in magnitude.
3. I applied the cosine_similarity function to measure the similarity between customers in the feature space.

## To Find Top 3 Similar Customers

For each customer, I identified the top 3 most similar customers based on the highest similarity scores (excluding the customer itself).

In [9]:
# Function to get top 3 similar customers
def get_top_similar_customers(cust_id, similarity_df):
    similar_customers = similarity_df.loc[cust_id].sort_values(ascending=False).iloc[1:4]
    return list(similar_customers.index), list(similar_customers.values)

# Generate Lookalike data for first 20 customers
lookalike_results = {}
for cust_id in customers['CustomerID'][:20]:
    similar_ids, scores = get_top_similar_customers(cust_id, similarity_df)
    lookalike_results[cust_id] = list(zip(similar_ids, scores))

# Display results
print(lookalike_results)


{'C0001': [('C0103', 0.9975729385618538), ('C0092', 0.9968787968825864), ('C0135', 0.9927364238882178)], 'C0002': [('C0029', 0.9998543931340029), ('C0077', 0.9961038168882547), ('C0157', 0.9954784900159904)], 'C0003': [('C0111', 0.9984874468302141), ('C0190', 0.9966561574371822), ('C0038', 0.9901332836738033)], 'C0004': [('C0165', 0.9983897071764074), ('C0162', 0.9980867096016258), ('C0075', 0.996932345616167)], 'C0005': [('C0167', 0.9999721868436701), ('C0020', 0.99971426883456), ('C0128', 0.9987615592886807)], 'C0006': [('C0168', 0.9976122332196319), ('C0196', 0.9950250564515252), ('C0187', 0.9947524750205508)], 'C0007': [('C0125', 0.9998486580402707), ('C0089', 0.99834375759003), ('C0085', 0.9960335186380587)], 'C0008': [('C0084', 0.9960866913262758), ('C0113', 0.9958170325568012), ('C0017', 0.9931732089853939)], 'C0009': [('C0130', 0.9999651017117012), ('C0128', 0.9985963548763069), ('C0192', 0.9985908489461927)], 'C0010': [('C0176', 0.9994511608148322), ('C0055', 0.993840552919188

Explanation:
For each customer:

1. I sorted the similarity scores in descending order.
2. Excluded the customer itself (highest score = 1.0).
3. Selected the top 3 customers with the highest similarity scores.

The results were saved in a CSV file, with the format: Map<cust_id, List<cust_id, score>>

In [11]:
# Convert lookalike results into a DataFrame
lookalike_df = pd.DataFrame([
    {'CustomerID': cust_id, 'Lookalikes': lookalikes}
    for cust_id, lookalikes in lookalike_results.items()
])

# Save to CSV
lookalike_df.to_csv('Doradla_PardhaSaradhiRaju_Lookalike.csv', index=False)

# Display saved DataFrame
print(lookalike_df.head())


  CustomerID                                         Lookalikes
0      C0001  [(C0103, 0.9975729385618538), (C0092, 0.996878...
1      C0002  [(C0029, 0.9998543931340029), (C0077, 0.996103...
2      C0003  [(C0111, 0.9984874468302141), (C0190, 0.996656...
3      C0004  [(C0165, 0.9983897071764074), (C0162, 0.998086...
4      C0005  [(C0167, 0.9999721868436701), (C0020, 0.999714...


Explanation:
The final lookalike results were saved in a structured format with:

1. CustomerID: The ID of the customer for whom lookalikes are generated.
2. Lookalikes: A list of tuples, each containing the ID of a similar customer and their similarity score.
3. This file (FirstName_LastName_Lookalike.csv) meets the deliverable requirements.

# Steps Followed:

1. Data Preparation: Combined customer, transaction, and product data to create a comprehensive dataset with key features like total spending, quantity purchased, and average price.
2. Similarity Computation: Scaled the features and used cosine similarity to measure similarity between customers.
3. Top 3 Recommendations: For each customer, identified the top 3 most similar customers based on similarity scores.
4. Results: Generated and saved recommendations for the first 20 customers as a CSV file