Part 1: Creating the Synthetic Dataset
1.1. Define the Dataset Structure
We'll create three main components:

Customers:

customer_id: Unique identifier.
age: Age of the customer.
gender: Gender (Male, Female, Other).
skin_type: Skin type (Oily, Dry, Combination, Sensitive).
skin_concerns: Primary skin concerns (e.g., Acne, Wrinkles).
Products:

product_id: Unique identifier.
product_name: Name of the product.
brand: Brand name.
category: Product category (Cleanser, Moisturizer, Serum, etc.).
ingredients: Key ingredients.
benefits: Primary benefits.
suitable_skin_types: Suitable skin types.
Interactions:

interaction_id: Unique identifier.
customer_id: Reference to the customer.
product_id: Reference to the product.
rating: Rating given by the customer (1 to 5).
review: Short review text.
1.2. Generate Synthetic Data
We'll use Python with the pandas, numpy, and faker libraries to generate this data. You can run the following script in a Jupyter Notebook or Google Colab.



In [1]:
#a. Install Necessary Libraries
#If you're using Google Colab, most libraries are pre-installed. Otherwise, install them using pip:
#!pip install pandas numpy faker

#b. Python Script to Generate Synthetic Data
import pandas as pd
import numpy as np
from faker import Faker
import random

# Initialize Faker
fake = Faker()

# Seed for reproducibility
Faker.seed(0)
np.random.seed(0)
random.seed(0)

# Define sample data
genders = ['Male', 'Female', 'Other']
skin_types = ['Oily', 'Dry', 'Combination', 'Sensitive']
skin_concerns_list = ['Acne', 'Wrinkles', 'Dark Spots', 'Dryness', 'Redness', 'Uneven Texture']
categories = ['Cleanser', 'Moisturizer', 'Serum', 'Sunscreen', 'Exfoliator', 'Mask']
brands = ['BrandA', 'BrandB', 'BrandC', 'BrandD', 'BrandE']
ingredients = ['Hyaluronic Acid', 'Salicylic Acid', 'Vitamin C', 'Retinol', 'Niacinamide', 'Glycolic Acid']
benefits = ['Hydration', 'Oil Control', 'Brightening', 'Anti-Aging', 'Exfoliation', 'Soothing']

# Generate Customers Data
num_customers = 1000
customers = []
for i in range(1, num_customers + 1):
    customer = {
        'customer_id': i,
        'age': np.random.randint(18, 65),
        'gender': random.choice(genders),
        'skin_type': random.choice(skin_types),
        'skin_concerns': ', '.join(random.sample(skin_concerns_list, k=random.randint(1,3)))
    }
    customers.append(customer)

customers_df = pd.DataFrame(customers)

# Generate Products Data
num_products = 200
products = []
for i in range(1, num_products + 1):
    product = {
        'product_id': i,
        'product_name': f"{random.choice(['Ultra', 'Hydra', 'Clear', 'Radiant', 'Pure'])} {random.choice(['Glow', 'Fresh', 'Smooth', 'Bright', 'Revive'])} {random.randint(100,999)}",
        'brand': random.choice(brands),
        'category': random.choice(categories),
        'ingredients': ', '.join(random.sample(ingredients, k=random.randint(2,4))),
        'benefits': ', '.join(random.sample(benefits, k=random.randint(1,3))),
        'suitable_skin_types': ', '.join(random.sample(skin_types, k=random.randint(1,4)))
    }
    products.append(product)

products_df = pd.DataFrame(products)

# Generate Interactions Data
interactions = []
interaction_id = 1
for _ in range(5000):  # 5 interactions per customer on average
    customer = random.choice(customers_df['customer_id'].tolist())
    product = random.choice(products_df['product_id'].tolist())
    rating = random.randint(1, 5)
    review = fake.sentence(nb_words=10)
    interaction = {
        'interaction_id': interaction_id,
        'customer_id': customer,
        'product_id': product,
        'rating': rating,
        'review': review
    }
    interactions.append(interaction)
    interaction_id += 1

interactions_df = pd.DataFrame(interactions)

# Save to CSV
customers_df.to_csv('customers.csv', index=False)
products_df.to_csv('products.csv', index=False)
interactions_df.to_csv('interactions.csv', index=False)

print("Synthetic dataset created and saved as CSV files.")




Synthetic dataset created and saved as CSV files.


c. Explanation of the Script
Libraries:

pandas: For data manipulation.
numpy: For numerical operations.
faker: To generate realistic fake data.
random: For random selections.
Data Definitions:

Define lists of possible values for genders, skin types, skin concerns, product categories, brands, ingredients, and benefits.
Customers Data:

Generate 1,000 customers with random ages (18-65), genders, skin types, and 1-3 skin concerns.
Products Data:

Generate 200 skincare products with random names, brands, categories, 2-4 ingredients, 1-3 benefits, and suitable skin types.
Interactions Data:

Simulate 5,000 interactions where customers rate and review products. Each interaction links a customer to a product with a rating (1-5) and a short review.
Saving Data:

Save the generated data into CSV files: customers.csv, products.csv, and interactions.csv.
d. Sample Data Preview
Customers (customers.csv):

customer_id	age	gender	skin_type	skin_concerns
1	25	Female	Combination	Acne, Wrinkles
2	34	Male	Oily	Dark Spots
...	...	...	...	...
Products (products.csv):

product_id	product_name	brand	category	ingredients	benefits	suitable_skin_types
1	Ultra Glow 123	BrandA	Cleanser	Hyaluronic Acid, Vitamin C	Hydration	Oily, Dry
2	Hydra Fresh 456	BrandB	Moisturizer	Salicylic Acid, Retinol, Niacinamide	Oil Control, Anti-Aging	Oily, Combination
...	...	...	...	...	...	...
Interactions (interactions.csv):

interaction_id	customer_id	product_id	rating	review
1	1	5	4	"This product works well."
2	2	3	5	"Loved the results after use."
...	...	...	...	...


Part 2: Building and Training the Neural Network Using PyTorch
We'll build a neural network using PyTorch to predict whether a customer will like a product (liked) based on their profile and product attributes. We'll follow these steps:

Data Preprocessing
Dataset and DataLoader Creation
Neural Network Definition
Training the Model
Evaluating the Model
Saving and Uploading the Model to Hugging Face
2.1. Setting Up the Environment
Ensure you have the necessary libraries installed. If you're using Google Colab, most are pre-installed. Otherwise, install them using pip:

In [None]:
pip install pandas numpy scikit-learn torch torchvision transformers


In [19]:
#2.2. Import Necessary Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report, confusion_matrix
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

#2.3. Load and Preprocess the Data
# Load datasets
customers = pd.read_csv('customers.csv')
products = pd.read_csv('products.csv')
interactions = pd.read_csv('interactions.csv')

# Merge interactions with customers and products
data = interactions.merge(customers, on='customer_id').merge(products, on='product_id')

# Create binary target variable
data['liked'] = data['rating'].apply(lambda x: 1 if x >= 4 else 0)

# Drop unnecessary columns
data = data.drop(['interaction_id', 'rating', 'review'], axis=1)

# Feature Engineering

# Initialize LabelEncoders
le_gender = LabelEncoder()
le_skin_type = LabelEncoder()
le_category = LabelEncoder()
le_brand = LabelEncoder()

# Encode categorical features
data['gender_encoded'] = le_gender.fit_transform(data['gender'])
data['skin_type_encoded'] = le_skin_type.fit_transform(data['skin_type'])
data['category_encoded'] = le_category.fit_transform(data['category'])
data['brand_encoded'] = le_brand.fit_transform(data['brand'])

# Feature Counts for multi-valued fields
data['skin_concerns_count'] = data['skin_concerns'].apply(lambda x: len(x.split(',')))
data['suitable_skin_types_count'] = data['suitable_skin_types'].apply(lambda x: len(x.split(',')))

# Text vectorization for ingredients and benefits
cv_ingredients = CountVectorizer(max_features=100)
cv_benefits = CountVectorizer(max_features=50)

ingredients_matrix = cv_ingredients.fit_transform(data['ingredients'])
benefits_matrix = cv_benefits.fit_transform(data['benefits'])

# Convert to DataFrame
ingredients_df = pd.DataFrame(ingredients_matrix.toarray(), columns=cv_ingredients.get_feature_names_out())
benefits_df = pd.DataFrame(benefits_matrix.toarray(), columns=cv_benefits.get_feature_names_out())

# Concatenate with original dataframe
data = pd.concat([data, ingredients_df, benefits_df], axis=1)

# Drop original text columns
data = data.drop(['gender', 'skin_type', 'skin_concerns', 'category', 'brand', 'suitable_skin_types', 'product_name', 'ingredients', 'benefits'], axis=1)

# Handle missing values
imputer = SimpleImputer(strategy='mean')
data = pd.DataFrame(imputer.fit_transform(data), columns=data.columns)

# Feature Scaling
scaler = StandardScaler()
numerical_features = ['age', 'skin_concerns_count', 'suitable_skin_types_count'] + list(ingredients_df.columns) + list(benefits_df.columns)
data[numerical_features] = scaler.fit_transform(data[numerical_features])

# Define feature columns and target
feature_columns = [col for col in data.columns if col != 'liked']

X = data[feature_columns].values
y = data['liked'].values

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((4000, 24), (1000, 24), (4000,), (1000,))

In [4]:
# Display some train and test data
print("Sample of training data:")
print(X_train[:5])
print("\nCorresponding training labels:")
print(y_train[:5])

print("\nSample of test data:")
print(X_test[:5])
print("\nCorresponding test labels:")
print(y_test[:5])

print("\nFeature names:")
print(feature_columns)

print("\nShape of training data:", X_train.shape)
print("Shape of test data:", X_test.shape)



Sample of training data:
[[ 2.88000000e+02  7.20000000e+01 -7.73528100e-01  1.00000000e+00
   2.00000000e+00  1.21785578e+00  3.00000000e+00  1.00000000e+00
  -1.35375500e+00]
 [ 8.34000000e+02  1.94000000e+02  9.87492047e-01  0.00000000e+00
   0.00000000e+00  1.21785578e+00  5.00000000e+00  4.00000000e+00
   5.29599200e-01]
 [ 7.40000000e+01  1.15000000e+02 -6.26776421e-01  1.00000000e+00
   2.00000000e+00 -3.66456103e-03  0.00000000e+00  2.00000000e+00
  -1.35375500e+00]
 [ 5.93000000e+02  5.00000000e+00 -1.50728649e+00  2.00000000e+00
   0.00000000e+00 -1.22518490e+00  5.00000000e+00  3.00000000e+00
  -4.12077898e-01]
 [ 2.27000000e+02  1.76000000e+02 -2.59897223e-01  1.00000000e+00
   1.00000000e+00  1.21785578e+00  5.00000000e+00  3.00000000e+00
  -4.12077898e-01]]

Corresponding training labels:
[0 0 1 1 0]

Sample of test data:
[[ 2.08000000e+02  1.35000000e+02  1.42774708e+00  0.00000000e+00
   0.00000000e+00 -3.66456103e-03  0.00000000e+00  2.00000000e+00
   1.47127630e+00]
 [

In [21]:
#2.4. Create a Custom Dataset Class
class SkincareDataset(Dataset):
    def __init__(self, features, labels):
        self.X = torch.tensor(features, dtype=torch.float32)
        self.y = torch.tensor(labels, dtype=torch.float32).unsqueeze(1)  # For binary classification

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

#Create DataLoaders

# Create Dataset objects
train_dataset = SkincareDataset(X_train, y_train)
test_dataset = SkincareDataset(X_test, y_test)

# Create DataLoaders
batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)





In [22]:
# 2.5. Inspect DataLoader contents

print("Sample from train_loader:")
for batch_features, batch_labels in train_loader:
    print("Batch features shape:", batch_features.shape)
    print("Batch labels shape:", batch_labels.shape)
    print("First few features in batch:")
    print(batch_features[:2])
    print("Corresponding labels:")
    print(batch_labels[:2])
    break  # We only need to see one batch

print("\nSample from test_loader:")
for batch_features, batch_labels in test_loader:
    print("Batch features shape:", batch_features.shape)
    print("Batch labels shape:", batch_labels.shape)
    print("First few features in batch:")
    print(batch_features[:2])
    print("Corresponding labels:")
    print(batch_labels[:2])
    break  # We only need to see one batch


Sample from train_loader:
Batch features shape: torch.Size([32, 24])
Batch labels shape: torch.Size([32, 1])
First few features in batch:
tensor([[ 1.2700e+02,  3.6000e+01,  3.2711e-01,  1.0000e+00,  3.0000e+00,
          3.0000e+00,  0.0000e+00, -3.6646e-03,  1.4713e+00, -8.1959e-01,
         -1.0488e+00, -9.7745e-01,  1.0606e+00, -8.5201e-01,  9.0490e-01,
          1.0276e+00,  1.2719e+00,  1.2719e+00, -7.0912e-01, -7.1359e-01,
         -6.0816e-01, -6.2578e-01, -7.1359e-01, -7.0499e-01],
        [ 1.5600e+02,  1.4400e+02, -1.0670e+00,  0.0000e+00,  3.0000e+00,
          4.0000e+00,  2.0000e+00, -1.2252e+00, -1.3538e+00,  6.3824e-01,
         -1.0488e+00,  1.0231e+00, -9.4283e-01, -8.5201e-01,  9.0490e-01,
         -9.7316e-01,  1.2719e+00,  1.2719e+00, -7.0912e-01,  1.4014e+00,
         -6.0816e-01, -6.2578e-01,  1.4014e+00,  1.4185e+00]])
Corresponding labels:
tensor([[0.],
        [0.]])

Sample from test_loader:
Batch features shape: torch.Size([32, 24])
Batch labels shape: torch

In [32]:

#2.6. Define the Neural Network
class SkincareNN(nn.Module):
    def __init__(self, input_size):
        super(SkincareNN, self).__init__()
        self.fc1 = nn.Linear(input_size, 256)
        self.bn1 = nn.BatchNorm1d(256)
        self.fc2 = nn.Linear(256, 128)
        self.bn2 = nn.BatchNorm1d(128)
        self.fc3 = nn.Linear(128, 64)
        self.bn3 = nn.BatchNorm1d(64)
        self.fc4 = nn.Linear(64, 32)
        self.bn4 = nn.BatchNorm1d(32)
        self.fc5 = nn.Linear(32, 1)
        self.dropout = nn.Dropout(0.5)
        self.leaky_relu = nn.LeakyReLU(0.1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.leaky_relu(self.bn1(self.fc1(x)))
        x = self.dropout(x)
        x = self.leaky_relu(self.bn2(self.fc2(x)))
        x = self.dropout(x)
        x = self.leaky_relu(self.bn3(self.fc3(x)))
        x = self.dropout(x)
        x = self.leaky_relu(self.bn4(self.fc4(x)))
        x = self.dropout(x)
        x = self.sigmoid(self.fc5(x))
        return x

#2.7. Initialize the Model, Loss Function, and Optimizer
input_size = X_train.shape[1]
model = SkincareNN(input_size)

# Define loss function
criterion = nn.BCEWithLogitsLoss()

# Define optimizer
learning_rate = 0.001
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=1e-4)

# Learning rate scheduler
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.2, patience=5, verbose=True)





In [33]:
#2.8. Training the Model
# Training Loop
num_epochs = 20

for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    for inputs, labels in train_loader:
        # Zero the parameter gradients
        optimizer.zero_grad()
        
        # Forward pass
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        
        # Backward pass and optimization
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item() * inputs.size(0)
    
    epoch_loss = running_loss / len(train_loader.dataset)
    
    # Evaluation on Test Set
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for inputs, labels in test_loader:
            outputs = model(inputs)
            predicted = (outputs >= 0.5).float()
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    
    accuracy = correct / total * 100
    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {epoch_loss:.4f}, Test Accuracy: {accuracy:.2f}%')




Epoch [1/20], Loss: 0.7462, Test Accuracy: 60.20%
Epoch [2/20], Loss: 0.7168, Test Accuracy: 60.20%
Epoch [3/20], Loss: 0.7050, Test Accuracy: 60.20%
Epoch [4/20], Loss: 0.7010, Test Accuracy: 60.20%
Epoch [5/20], Loss: 0.6985, Test Accuracy: 60.20%
Epoch [6/20], Loss: 0.6969, Test Accuracy: 60.20%
Epoch [7/20], Loss: 0.6959, Test Accuracy: 60.20%
Epoch [8/20], Loss: 0.6953, Test Accuracy: 60.20%
Epoch [9/20], Loss: 0.6948, Test Accuracy: 60.20%
Epoch [10/20], Loss: 0.6946, Test Accuracy: 60.20%
Epoch [11/20], Loss: 0.6940, Test Accuracy: 60.20%
Epoch [12/20], Loss: 0.6939, Test Accuracy: 60.20%
Epoch [13/20], Loss: 0.6941, Test Accuracy: 60.20%
Epoch [14/20], Loss: 0.6937, Test Accuracy: 60.20%
Epoch [15/20], Loss: 0.6938, Test Accuracy: 60.20%
Epoch [16/20], Loss: 0.6936, Test Accuracy: 60.20%
Epoch [17/20], Loss: 0.6937, Test Accuracy: 60.20%
Epoch [18/20], Loss: 0.6936, Test Accuracy: 60.20%
Epoch [19/20], Loss: 0.6935, Test Accuracy: 60.20%
Epoch [20/20], Loss: 0.6935, Test Accura

In [None]:
#2.9. Evaluate the Model
# Final Evaluation
model.eval()
y_pred = []
y_true = []

with torch.no_grad():
    for inputs, labels in test_loader:
        outputs = model(inputs)
        predicted = (outputs >= 0.5).float()
        y_pred.extend(predicted.squeeze().tolist())
        y_true.extend(labels.squeeze().tolist())

# Classification Report
print(classification_report(y_true, y_pred))

# Confusion Matrix
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)


In [None]:
#2.10. Save and Upload the Model to Hugging Face
# Save the model
torch.save(model.state_dict(), 'skincare_nn.pth')