### Crafting Recommendations based on user's interests with NLP Models


#### Determining challenges to tackle
The vast amount of information available has led to an overwhelming number of choices for users. As a result, recommendation systems have become essential tools for developers to help users discover relevant products or content. Traditional methods often relied on keywords filtering or content-based filtering, but advancements in Natural Language Processing (NLP) and transformer models have opened up new possibilities for building more sophisticated and context-aware recommendation systems.


#### Key Objectives
* Semantic Understanding: Utilize NLP models to extract meaningful semantic information from both user input and product descriptions.

* Context-Aware Recommendations: Build a recommendation system that understands the context and preferences with providing more accurate and context-aware suggestions.

* Scalability: Design the system to be scalable, allowing for efficient handling of large datasets and a growing number of products.

#### Technical Framework
* Step 1: Using embedding, map all descriptions to a low-dimensional vector space, obtaining "description vectors."

* Step 2: Same to embedding query (user input)

* Step 3: Compare the similarity (vector similarity) using cosine similarity score between the Query and each description vectors, retrieving a set of the most similar descriptions.

* Step 4: return the list of recommended products



<img src="img/nlp.png" alt="nlp" style="width:100%;"/>

### Why embedding
An "embedding" refers to a numerical representation of words, phrases, or sentences in a continuous vector space. It is fast and less expensive comparing with other methods.
  
Why not just use full-text search such as Lucene? Imaging there are more information perceived from text than just vocabulary, including grammar, semantics, emotions, feelings, themes, context. For instance, nuts and almonds, outdoor and bike. 

### Let's start from pre-trained NLP model from SentenceTransformer
Based on my research, the all-mpnet-base-v2 (https://huggingface.co/sentence-transformers/all-mpnet-base-v2) model is evaluated as one of the top performance regarding to semantics search. 

In [2]:
from sentence_transformers import SentenceTransformer, util

# Load a pre-trained sentence embedding model
embed_model = SentenceTransformer('all-mpnet-base-v2')

In [8]:
## Example
## examples
examples = [
    "what is the Capital City of USA?",
    "Washington D.C.",
    "Beijing",
    "Seattle",
    "Shenzhen",
    "Climate change poses significant challenges to global ecosystems.",
    "How is the weather today?"
]
# Compute embeddings for the sentences
sentence_embeddings = embed_model.encode(examples, convert_to_tensor=True)

# Calculate cosine similarity between pairs of sentences
cosine_similarity_matrix = util.pytorch_cos_sim(sentence_embeddings, sentence_embeddings)

# Print the cosine similarity matrix
print("Cosine Similarity Matrix:")
print(cosine_similarity_matrix)

Cosine Similarity Matrix:
tensor([[ 1.0000,  0.5814,  0.4435,  0.4848,  0.3636, -0.0497,  0.1315],
        [ 0.5814,  1.0000,  0.4952,  0.6263,  0.4216, -0.0394,  0.1140],
        [ 0.4435,  0.4952,  1.0000,  0.4947,  0.7890, -0.0284,  0.1381],
        [ 0.4848,  0.6263,  0.4947,  1.0000,  0.4441,  0.0206,  0.2155],
        [ 0.3636,  0.4216,  0.7890,  0.4441,  1.0000, -0.0332,  0.0863],
        [-0.0497, -0.0394, -0.0284,  0.0206, -0.0332,  1.0000,  0.1699],
        [ 0.1315,  0.1140,  0.1381,  0.2155,  0.0863,  0.1699,  1.0000]])


From the above similarity matrix, you can see that the sentence of "what is the Capital City of USA?" is most similar with "Washington D.C." with score 0.5814. Seattle is the second highest one with score 0.4848 because it is one of city in USA. Beijing is the third place with score 0.4435 since it is also a capital city but not in USA. And "Shenzhen" is the next place since it is a city but not in USA. The later two sentences are weakly similar to "what is the Capital City of USA?" since there is no relavent information being mentioned. 

In [10]:
import nltk
# Example product descriptions (long paragraphs)
## This product descriptions are generated by ChatGPT
## Product1: QuantumSound Wireless Earbuds
## Product2: EcoSmart Smart Thermostat
## Product3: AdventureTech Waterproof Backpack
## Product4: LuxeLiving Organic Cotton Bed Sheets
## Product5: SmartChef Multifunctional Cooking Appliance
product_descriptions = [
    "Immerse yourself in superior audio quality with our QuantumSound Wireless Earbuds. These sleek earbuds feature advanced noise-canceling technology, ensuring a crystal-clear sound experience. With touch controls and a long-lasting battery life, these earbuds are perfect for music enthusiasts on the go.",
    "Take control of your home's climate with the EcoSmart Smart Thermostat. This intelligent thermostat learns your preferences over time and adjusts heating and cooling to optimize energy efficiency. With a user-friendly interface and compatibility with smart home ecosystems, it's a sustainable and convenient choice for modern homes.",
    "Gear up for your next outdoor adventure with the AdventureTech Waterproof Backpack. This rugged backpack is designed for durability and functionality, featuring multiple compartments, waterproof zippers, and padded shoulder straps. Whether you're hiking, camping, or commuting, this backpack ensures your essentials stay safe and dry.",
    "Elevate your sleep experience with LuxeLiving Organic Cotton Bed Sheets. Crafted from 100% organic cotton, these luxurious sheets offer a silky-smooth feel and breathability. With a sateen weave and timeless design, they add a touch of elegance to any bedroom while promoting a comfortable and eco-friendly night's rest.",
    "Transform your kitchen with the SmartChef Multifunctional Cooking Appliance. This versatile device combines a pressure cooker, slow cooker, air fryer, and more into a single compact unit. With smart connectivity and an intuitive control panel, it simplifies meal preparation, letting you create delicious and healthy dishes with ease."
]

# User input (interests)
user_interests = "Looking for a product for outdoor activities."

# Text cleaning and sentence tokenization
cleaned_descriptions = []
for description in product_descriptions:
    # Example: Lowercasing and removing punctuation
    cleaned_description = description.lower().replace(".", "").replace(",", "")
    # Add more cleaning steps as needed
    
    # Tokenize the cleaned text into sentences
    sentences = nltk.sent_tokenize(cleaned_description)
    
    # Encode each sentence separately
    sentence_embeddings = embed_model.encode(sentences, convert_to_tensor=True)
    
    
    # Store the cleaned and encoded description
    cleaned_descriptions.append({
        "original_description": description,
        "cleaned_description": cleaned_description,
        "document_embedding": sentence_embeddings
    })

# Encode user interests to obtain an embedding
user_interest_embedding = embed_model.encode(user_interests, convert_to_tensor=True)

# Calculate cosine similarity between user interests and product embeddings
similarities = [util.pytorch_cos_sim(user_interest_embedding, desc["document_embedding"]).item() for desc in cleaned_descriptions]

# Combine product descriptions with similarity scores
recommendations = list(zip(product_descriptions, similarities))

# Sort recommendations by similarity score (higher score means more similar)
recommendations.sort(key=lambda x: x[1], reverse=True)

# Display top recommendations
for product, similarity in recommendations:
    print(f"Product: {product} | Similarity: {similarity:.4f}")


Product: Gear up for your next outdoor adventure with the AdventureTech Waterproof Backpack. This rugged backpack is designed for durability and functionality, featuring multiple compartments, waterproof zippers, and padded shoulder straps. Whether you're hiking, camping, or commuting, this backpack ensures your essentials stay safe and dry. | Similarity: 0.4568
Product: Immerse yourself in superior audio quality with our QuantumSound Wireless Earbuds. These sleek earbuds feature advanced noise-canceling technology, ensuring a crystal-clear sound experience. With touch controls and a long-lasting battery life, these earbuds are perfect for music enthusiasts on the go. | Similarity: 0.2009
Product: Transform your kitchen with the SmartChef Multifunctional Cooking Appliance. This versatile device combines a pressure cooker, slow cooker, air fryer, and more into a single compact unit. With smart connectivity and an intuitive control panel, it simplifies meal preparation, letting you creat

You can see the "AdventureTech Waterproof Backpack" has the highest similarity score (0.4568) when the user asking a product for outdoor activities!