# A Lookalike Model that takes a user's information as input and recommends 3 similar customers based on their profile and transaction history.

Cosine similarity would capture their shared interest in similar product categories and thus consider them similar, even if the total monetary value differs.
In summary, Cosine Similarity helps us find customers who have similar behaviors, preferences, and purchasing patterns, regardless of their total spending or purchase frequency. This makes it ideal for building a recommendation system or identifying lookalike customers. 

One-hot encoding is applied to the Region column to convert categorical variables (e.g., regions like "North", "South") into binary columns (e.g., Region_South, Region_West, etc.), allowing them to be used in machine learning algorithms.

Scaling Numerical Features: The numerical features (total_spent, avg_transaction_value, num_transactions, unique_products_bought) are scaled using StandardScaler, which normalizes the data so that it has a mean of 0 and a standard deviation of 1. This is important for similarity calculations, as it prevents features with larger values from dominating the similarity metric.

The customer data is transformed into a matrix (customer_matrix), where each row corresponds to a customer, and the columns are their purchase behaviors and region information. Non-numerical columns (like CustomerName, SignupDate) are dropped as they are not useful for similarity analysis.

The cosine similarity between each pair of customers is calculated using cosine_similarity() from sklearn. This similarity metric compares customers based on their purchase behaviors and regions. A cosine similarity score closer to 1 indicates that the customers are similar, while a score closer to 0 indicates they are less similar.

In [9]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity

customers_df = pd.read_csv("Customers.csv")
products_df = pd.read_csv("Products.csv")
transactions_df = pd.read_csv("Transactions.csv")

# Merge Transactions with Product Details 
transactions_df = transactions_df.merge(products_df, on="ProductID", how="left")

# Aggregate transaction data for each customer
customer_transactions = transactions_df.groupby("CustomerID").agg(
    total_spent=("TotalValue", "sum"),  # Total spending per customer
    avg_transaction_value=("TotalValue", "mean"),  # Average transaction value
    num_transactions=("TransactionID", "count"),  # Number of transactions
    unique_products_bought=("ProductID", "nunique")  # Number of unique products bought
).reset_index()

# Merge Customer Data with Transaction Features
customer_data = customers_df.merge(customer_transactions, on="CustomerID", how="left")

# Fill missing values (Customers with no transactions get 0s)
customer_data.fillna(0, inplace=True)

# Encode Categorical Variables (One-hot encoding for Region)
customer_data = pd.get_dummies(customer_data, columns=["Region"], drop_first=True)

# ------------------ Step 3: Feature Scaling ------------------ #

# Select Numerical Features for Normalization
numeric_features = ["total_spent", "avg_transaction_value", "num_transactions", "unique_products_bought"]
scaler = StandardScaler()
customer_data[numeric_features] = scaler.fit_transform(customer_data[numeric_features])


# Remove Non-Numerical Columns
customer_matrix = customer_data.set_index("CustomerID").drop(columns=["CustomerName", "SignupDate"])

# Compute Cosine Similarity Between Customers
similarity_matrix = cosine_similarity(customer_matrix)

# Convert Similarity Matrix to DataFrame
similarity_df = pd.DataFrame(similarity_matrix, index=customer_data["CustomerID"], columns=customer_data["CustomerID"])

# Get the first 20 customers
top_20_customers = customers_df["CustomerID"][:20]

# To store top 3 similar customers for each customer
lookalike_dict = {}

for cust_id in top_20_customers:
    similar_customers = similarity_df[cust_id].drop(index=cust_id).nlargest(3)
    lookalike_dict[cust_id] = list(zip(similar_customers.index, similar_customers.values))
    
# Printed the Id and Lookalikescores
for cust_id, lookalikes in lookalike_dict.items():
    print(f"Customer ID {cust_id} has the following lookalikes:")
    for similar_customer, similarity_score in lookalikes:
        print(f"\tLookalike: {similar_customer} with Similarity Score: {similarity_score:.4f}")
    print()


lookalike_df = pd.DataFrame(lookalike_dict.items(), columns=["CustomerID", "Lookalikes"])


lookalike_csv_path = "Lookalike.csv"
lookalike_df.to_csv(lookalike_csv_path, index=False)

# Compute Mean Similarity Score of Recommended Lookalikes
mean_similarity_score = np.mean([score for pairs in lookalike_dict.values() for _, score in pairs])

# Results
print("Lookalike Recommendations Saved as Lookalike.csv")
print(f"Model Accuracy (Mean Similarity Score): {mean_similarity_score:.4f}")


Customer ID C0001 has the following lookalikes:
	Lookalike: C0137 with Similarity Score: 0.9998
	Lookalike: C0152 with Similarity Score: 0.9995
	Lookalike: C0107 with Similarity Score: 0.9654

Customer ID C0002 has the following lookalikes:
	Lookalike: C0043 with Similarity Score: 0.9870
	Lookalike: C0142 with Similarity Score: 0.9769
	Lookalike: C0097 with Similarity Score: 0.9602

Customer ID C0003 has the following lookalikes:
	Lookalike: C0133 with Similarity Score: 0.9886
	Lookalike: C0052 with Similarity Score: 0.9427
	Lookalike: C0112 with Similarity Score: 0.9358

Customer ID C0004 has the following lookalikes:
	Lookalike: C0108 with Similarity Score: 0.9864
	Lookalike: C0113 with Similarity Score: 0.9743
	Lookalike: C0155 with Similarity Score: 0.9611

Customer ID C0005 has the following lookalikes:
	Lookalike: C0159 with Similarity Score: 0.9993
	Lookalike: C0123 with Similarity Score: 0.9986
	Lookalike: C0178 with Similarity Score: 0.9986

Customer ID C0006 has the following

# We got the accuracy to be 96.62% using cosine similarity.