
# Smart Retail Navigator: Unifying RAG, LLM, and Annoy for Advanced Query Intelligence

This notebook presents an enhanced analysis using a Structured Retrieval-Augmented Generation (RAG) System, specifically tailored for the retail sector. The system leverages advanced data processing techniques and machine learning models to provide comprehensive insights into retail operations, customer behavior, and sales performance. Through detailed examples and explanations, we aim to demonstrate the application of cutting-edge AI technologies in transforming retail analytics and decision-making processes. The below diagram shows overall architecture:

![Architecture](architecture.png)

The key aspects of the Smart Retail Navigator system architecture:

- **Data Layer**: Manages storage and access of retail data 

- **Retrieval-Augmented Generation (RAG)**: Retrieves relevant data and generates query responses by combining information retrieval and deep learning

- **Large Language Models (LLM)**: Understand queries and generate human-like responses after specialized fine-tuning 

- **Annoy**: Rapidly retrieves most relevant information for queries via similarity searches in vector spaces

- **Query Processor**: Coordinates overall workflow - query understanding by LLM, data retrieval via Annoy and RAG, and response generation

- **Analytics Module**: Transforms system outputs into business insights for data-driven decision making

The architecture strategically integrates the latest innovations in AI to ensure scalability, efficiency, accuracy and cutting-edge capabilities for enabling advanced retail analytics.



## Data Preparation and Exploration

In this section, we delve into the initial steps of our analysis: preparing and exploring the dataset. Our focus is on understanding the characteristics of the data, identifying patterns, and preparing it for further analysis. We'll cover data loading, cleaning, and basic exploratory data analysis (EDA) techniques that are crucial for any data science project.
    


## Predictive Modeling and Analysis

Following data preparation, we transition to the core of our analysis—predictive modeling. This section explores the creation and evaluation of models that predict future retail trends based on historical data. We'll discuss model selection, training, and validation, emphasizing the importance of accuracy and reliability in predictions.
    


## Insights and Conclusion

In the final section, we synthesize our findings into actionable insights. Drawing from the data exploration and predictive modeling phases, we outline key takeaways and recommend strategies for retail businesses to optimize their operations, enhance customer satisfaction, and boost sales. This comprehensive analysis demonstrates the transformative potential of AI in retail.
    

In [52]:

# !pip install transformers annoy


### Retrieval-Augmented Generation (RAG)

**Mathematical Principles:**
- **Retrieval:** The similarity between a query \(q\) and a document \(d\) is often computed using cosine similarity:
  \[ S(d, q) = \frac{d \cdot q}{\|d\| \|q\|} \]
- **Generation:** The probability of generating a word \(w_t\) given the context \(C\) and previous words \(w_{<t}\) is modeled as:
  \[ P(w_t | w_{<t}, C) = \text{softmax}(W_h h_t) \]

### Large Language Models (LLM)

**Mathematical Principles:**
- **Self-Attention:** The self-attention mechanism's computation is defined as:
  \[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \]

### LangChain

**Core Concepts:**
LangChain, developed as an open-source toolkit, simplifies the creation of applications leveraging LLMs by facilitating integration with external computation and data sources. Its components include:

- **Schema:** Defines core data structures like Text, ChatMessages, Examples, and Document, essential for interacting with language models.
- **Models:** Categorizes into Language Models, Chat Models, and Text Embedding Models, offering interfaces for seamless integration with LLMs.
- **Prompts:** Crafting prompts is vital for directing LLMs. LangChain introduces PromptTemplates for constructing prompts dynamically.
- **Indexes:** Serve as bridges between documents/data and LLMs, crucial for enriching models with context-specific information.
- **Memory:** Facilitates storing and retrieving chat history or conversational context, enhancing the model's ability to produce coherent and context-aware responses.
- **Chains:** Enable the sequencing of multiple components, such as data retrieval, prompt generation, and response parsing, into a cohesive workflow.
- **Agents:** Utilize LLMs for selecting and sequencing actions, embodying the decision-making prowess of LLMs in application scenarios.

**Innovations in LangChain:**
LangChain's modular abstractions for these components significantly lower the barrier to integrating complex LLM functionalities into applications. By abstracting the complexities involved in handling language models, data retrieval, and processing, LangChain empowers developers to build more sophisticated, context-aware applications with ease.

**Mathematical Extensions:**
While LangChain primarily focuses on the architectural and integration aspects of using LLMs, the underlying mathematical principles of its components (especially Models and Indexes) are grounded in the computations of neural networks, vector space models, and embeddings. For instance, the Text Embedding Models convert textual data into numerical vectors, capturing semantic meanings in a high-dimensional space, which can be mathematically represented as:
\[ \text{Embedding}(text) = V \]
where \(V\) is a vector representing the text in a semantic vector space.

### Annoy (Approximate Nearest Neighbors Oh Yeah)

**Mathematical Principles:**
- **Tree Construction:** Annoy uses random projection trees for partitioning data, optimizing for both speed and memory efficiency.
- **Approximate Search:** The search in Annoy is approximated to quickly retrieve near neighbors without exhaustive search, significantly reducing query time.

### NOTE

This comprehensive overview, enriched with details from the LangChain components guide and mathematical principles, offers a deeper understanding of the technologies and methodologies driving today's AI applications. LangChain, by abstracting the complexity of integrating LLMs with external data and computational resources, stands out as a pivotal framework for developers aiming to leverage the power of language models in their applications.

## Mock Data Generation

The provided Python function `generate_retail_mock_data` is designed to create mock retail product descriptions. It's a versatile tool for generating a dataset that could be used in various retail or e-commerce data analysis projects, machine learning models, or simply for testing and demonstration purposes. Here's a detailed breakdown of how this function works and its components:

### Function Definition and Parameters
- **Function Name:** `generate_retail_mock_data`
- **Parameters:**
  - `categories`: A list of strings representing product categories. Default categories are 'shoes', 'apparel', and 'jackets'.
  - `num_items_per_category`: An integer that defines how many items to generate per category. The default is 500.
  - `seed`: An integer used to initialize the random number generator, ensuring reproducibility of the generated descriptions. The default is 42.
  - `enhance_description`: A boolean flag to indicate whether to enhance a subset of product descriptions using the GPT-2 model. The default is `False`.
  - `enhanced_samples`: An integer indicating how many product descriptions to enhance with additional creative sentences. The default is 100.

### Core Functionality
1. **Initializing the Random Seed:** Ensures reproducibility by initializing the random number generator with a specified seed.

2. **Base Description Generation:** Iterates over each category to generate product descriptions, including a unique identifier and a combination of three randomly selected features (e.g., 'lightweight', 'durable').

3. **Optional Description Enhancement:** For a limited number of products (defined by `enhanced_samples`), the function uses the GPT-2 model to generate and append an additional creative sentence, enhancing the base description.

4. **Compilation of Descriptions:** All generated (and optionally enhanced) descriptions are compiled into a list, `product_descriptions`, which is then returned by the function.

### Output
- Returns a list, `product_descriptions`, containing all generated (and optionally enhanced) product descriptions.

### Example Output
An example of the function's output, showing a variety of product descriptions, some of which may be enhanced with creative sentences generated by the GPT-2 model:
```
Shoes 1 with features: durable stylish comfortable in category shoes.
Apparel 1 with features: lightweight waterproof durable in category apparel. "This apparel combines fashion with function."
Jackets 1 with features: comfortable stylish waterproof in category jackets.
...
```

### Use Cases
This mock data generation function is invaluable for:
- **Development and Testing:** Providing test data for e-commerce platforms to evaluate search and recommendation systems.
- **Data Analysis Projects:** Offering a basis for demonstrations or educational projects focused on retail data analytics.
- **Machine Learning Models:** Supplying initial training or testing data for models when actual product data is not available.

### Customization and Extension
The function offers flexibility for customization:
- Adjusting product categories, features, or the number of items to better fit specific needs.
- Toggling the description enhancement feature or changing the number of enhanced descriptions.

This function demonstrates an effective approach for generating rich, varied, and reproducible mock data for retail-related applications, adaptable to a wide array of use cases in the retail and e-commerce sectors.

In [61]:
import re
from transformers import pipeline
import random

# Initialize the GPT-2 text generation model
model_name = "gpt2"
text_generator = pipeline('text-generation', model=model_name)

def clean_description(description):
    """
    Cleans the description by removing HTML tags, URLs, special characters,
    and other non-descriptive elements using regex.
    """
    clean_text = re.sub(r'<[^>]+>', '', description)
    clean_text = re.sub(r'http[s]?://\S+', '', clean_text)
    clean_text = re.sub(r'[^A-Za-z0-9.,\'\s]+', ' ', clean_text)
    clean_text = re.sub(r'\[.*?\]', '', clean_text)
    clean_text = re.sub(r'\s+', ' ', clean_text).strip()
    return clean_text

def generate_product_description(product_name, features):
    prompt = f"Describe {product_name}, a product featuring {', '.join(features)}:"
    generated_text = text_generator(prompt, max_length=60, num_return_sequences=1)[0]['generated_text']
    cleaned_description = clean_description(generated_text)
    return cleaned_description

def generate_retail_mock_data(product_names, num_items_per_category=10, seed=42):
    random.seed(seed)
    product_descriptions = []

    for category, names in product_names.items():
        product_list = random.sample(names, min(num_items_per_category, len(names)))
        for product_name in product_list:
            features = random.sample(['lightweight', 'durable', 'waterproof', 'stylish', 'comfortable'], 3)
            description = generate_product_description(product_name, features)
            full_description = f"{product_name} is a {category} featuring {', '.join(features)}. {description}"
            product_descriptions.append(full_description)

    return product_descriptions

# Example usage with all Nike shoes
nike_shoes = {
    'shoes': ['Nike Air Zoom Pegasus', 'Nike Air Max', 'Nike React Infinity Run', 'Nike Free RN']
}

mock_retail_product_descriptions = generate_retail_mock_data(nike_shoes)  # Reduced for demonstration
print("\n".join(mock_retail_product_descriptions[:15]))


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Nike Air Zoom Pegasus is a shoes featuring durable, comfortable, waterproof. Describe Nike Air Zoom Pegasus, a product featuring durable, comfortable, waterproof A single, durable mesh cushion with a molded surface for an air dry look Designed for outdoor travel Designed for comfortable walking in the sun High res to date
Nike Free RN is a shoes featuring lightweight, comfortable, waterproof. Describe Nike Free RN, a product featuring lightweight, comfortable, waterproof a shoe with great traction, and great durability, the Nylon Nylon M Nylon. The innovative design features Nylon Nylon bonded mesh when wearing a M Isole, the mesh increases in diameter
Nike Air Max is a shoes featuring stylish, lightweight, comfortable. Describe Nike Air Max, a product featuring stylish, lightweight, comfortable Inside the Nike Air Max you will find the following items 2 pairs of Nike Air Max T Shirts and 3 pairs of Nike Air Max T Shirt Worn 2
Nike React Infinity Run is a shoes featuring lightweight, d

In [62]:

from annoy import AnnoyIndex
import numpy as np

vector_length = 100  # Assuming a 100-dimensional vector for each product description
annoy_index = AnnoyIndex(vector_length, 'angular')

# Mock function to convert descriptions to vectors
def description_to_vector(description):
    np.random.seed(hash(description) % (2**32 - 1))
    return np.random.rand(vector_length)

# Adding items to Annoy index
for i, description in enumerate(mock_retail_product_descriptions):
    vec = description_to_vector(description)
    annoy_index.add_item(i, vec)

annoy_index.build(10)  # Using 10 trees
num_products = len(mock_retail_product_descriptions)
print("Annoy index is built with", num_products, "items.")


Annoy index is built with 4 items.


## Annoy (Approximate Nearest Neighbors Oh Yeah)

Annoy is a C++ library with Python bindings designed to efficiently search for points in space that are close to a given query point, emphasizing speed and minimal memory usage while accepting a trade-off in precision. It's particularly useful for implementing recommendation systems, enhancing search engines, and facilitating various machine learning applications where quick nearest neighbor queries are essential.

### Key Features:
- **Memory-efficient Indexing:** Annoy creates large, read-only, file-based data structures that are memory-mapped, allowing multiple processes to share the same data without duplicating it in memory.
- **Incremental Updates:** It supports adding items to the index incrementally without the need to rebuild the index from scratch, making it suitable for dynamic datasets.
- **Persistence:** Annoy indexes can be saved to disk and later reloaded, facilitating persistent storage and retrieval of vector data across sessions.

### Usage in This Notebook:
In the context of this notebook, Annoy is utilized to store and query embeddings of retail product descriptions. Here's how we integrate Annoy for a practical use-case:

1. **Embedding Storage:** We generate embeddings for each product description using a mock function that simulates the conversion of textual descriptions into 100-dimensional vectors. These embeddings are stored in an Annoy index.
2. **Index Creation:** An AnnoyIndex instance is created with the specified vector length and metric ('angular' for cosine similarity). Each product's embedding vector is added to the index, which is then built with a specified number of trees to optimize query performance.
3. **Querying:** For a given query, its embedding is calculated and used to fetch the nearest neighbors from the Annoy index. This process efficiently identifies the most relevant product descriptions based on the similarity of their embeddings to the query.

This demonstration showcases the use of Annoy as a vector database for embedding-based retrieval, illustrating its application in scenarios where finding similar items quickly is crucial, such as in retail product recommendation systems.


## Integration of LangChain with RAG and Annoy

In this demonstration, we explore the integration of LangChain with Retrieval-Augmented Generation (RAG) and Annoy, showcasing a seamless workflow that combines the retrieval capabilities of Annoy with the generative prowess of a Large Language Model (LLM) provided by Hugging Face's `transformers`. This integration exemplifies how to leverage the strengths of both retrieval and generation to enhance the relevance and richness of generated text based on a given query.

### Workflow Overview:
1. **Query Embedding:** For a given query, we first convert it into an embedding vector using a mock function `description_to_vector`. This function simulates the process of transforming textual data into a numerical representation that can be processed by machine learning models.

2. **Retrieval with Annoy:** Using the query's vector representation, we retrieve the nearest neighbor product descriptions from an Annoy index. Annoy efficiently identifies the vectors in the dataset that are closest to the query vector, based on cosine similarity. This step highlights Annoy's utility in quickly fetching relevant data from a large dataset.

3. **Context Assembly:** The retrieved product descriptions are concatenated to form a context string. This assembled context is then used to inform the generation process, ensuring that the generated text is relevant to the specifics of the query.

4. **Text Generation with LangChain and RAG:** The concatenated context is fed into a generative model from the `transformers` library, specifically `distilgpt2`, alongside the original query. This generative step uses the context to produce a response that is not only contextually aware but also creatively enriched by the language model.

5. **HTML Presentation:** The query, retrieved context, and generated response are presented in an HTML format for enhanced readability. This step demonstrates how the integration of these technologies can be used to create user-friendly outputs for applications such as product recommendation systems or automated customer service responses.

### Example Usage:
In our example, we query for "Looking for stylish and comfortable shoes." The process involves embedding the query, retrieving related product descriptions using Annoy, and generating a response that synthesizes the query intent with the retrieved information. The final output is displayed in a styled HTML block, making it easy to visualize the integration's effectiveness in producing relevant and engaging content.

This integration exemplifies a practical application of combining retrieval and generative models to enhance the capabilities of AI-driven systems. By leveraging the specific strengths of Annoy for efficient data retrieval and the generative capabilities of language models like `distilgpt2`, developers can create sophisticated solutions that address complex queries with contextually rich and relevant responses.


In [63]:
from transformers import pipeline
from IPython.display import HTML

def simple_langchain_integration_v2(query):
    generator = pipeline('text-generation', model='distilgpt2')
    # Assuming description_to_vector and other necessary parts are defined elsewhere
    query_vector = description_to_vector(query)
    nearest_ids = annoy_index.get_nns_by_vector(query_vector, 5)  # Assuming annoy_index is defined elsewhere
    retrieved_docs = [mock_retail_product_descriptions[i] for i in nearest_ids]  # Assuming mock_retail_product_descriptions is defined elsewhere
    retrieved_context = " ".join(retrieved_docs)
    # Added truncation=True to avoid the warning
    response = generator(f'Query: {query}. Based on: {retrieved_context}', max_length=100, num_return_sequences=1, truncation=True)
    generated_text = response[0]['generated_text']

    # Create HTML content with CSS styling for better visual presentation
    html_content = f"""
    <div style="border: 2px solid #2c3e50; border-radius: 10px; padding: 20px;">
        <h2>Query</h2>
        <p>{query}</p>
        <h2>Retrieved Context</h2>
        <p>{retrieved_context}</p>
        <h2>Response</h2>
        <p style="color: #27ae60;">{generated_text}</p>
    </div>
    """
    return HTML(html_content)

# Example use
simple_langchain_integration_v2("Looking for stylish and comfortable shoes")


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



## Conclusion

This notebook provides a comprehensive demonstration of integrating Retrieval-Augmented Generation (RAG) with a Large Language Model (LLM) from Hugging Face, leveraging Annoy for efficient nearest neighbor searches, all facilitated through the LangChain framework. The primary focus was on a retail context, showcasing how the amalgamation of retrieval systems and generative models can significantly enhance application capabilities in delivering relevant and contextually rich responses to user queries. Here's a summary of the workflow and key takeaways:

- **What:** The notebook outlines the process of generating mock retail product descriptions, indexing these descriptions using Annoy for fast retrieval, and then integrating this with a generative model from Hugging Face (distilGPT-2) for text generation, all orchestrated using the LangChain framework.

- **Why:** The integration aims to demonstrate the power of combining vector space retrieval with generative AI to improve the relevance and specificity of responses in a simulated retail query scenario. This approach exemplifies how businesses can leverage AI to enhance user experiences through personalized and context-aware interactions.

- **How:**
  - **Mock Data Generation:** We started by generating a dataset of mock product descriptions to simulate a retail inventory.
  - **Annoy Indexing:** The descriptions were then converted into vector embeddings and indexed using Annoy, enabling efficient similarity-based retrieval.
  - **LangChain Integration:** We showcased how LangChain can facilitate the integration of Annoy with a generative LLM to process user queries, retrieve contextually relevant product descriptions, and generate informative responses.
  - **Text Generation and Display:** Utilizing the Hugging Face `pipeline` for text generation, the notebook demonstrated generating responses based on the context provided by Annoy's nearest neighbor search, with the output presented in an HTML format for enhanced readability.

### Key Takeaways:
- **Efficiency and Relevance:** The use of Annoy for embedding storage and retrieval showcases an efficient method to add contextual relevance to LLM-generated responses.
- **Scalability:** This approach illustrates how scalable solutions can be built for real-world applications, accommodating large datasets typical in retail environments.
- **Customizability:** The modular nature of LangChain, combined with the flexibility of Annoy and the generative power of LLMs, underscores the potential for customization according to specific application needs or domains.
- **User Experience Enhancement:** The integration exemplifies how AI can be used to significantly enhance user experience, providing a foundation for developing advanced AI-driven retail applications.

Through this integration, we've demonstrated a scalable and efficient method to bring context-awareness and relevance to AI-generated text, paving the way for innovative applications in retail and beyond.


## Detailed System Architecture Overview

This section delves into the architecture of the system implemented in this notebook, leveraging the Mermaid syntax for a comprehensive visual depiction. We intricately integrate Retrieval-Augmented Generation (RAG) with a Large Language Model (LLM) from Hugging Face, employing the Annoy library for efficient similarity searches, all orchestrated within the LangChain framework to enhance retail query processing.

### Architecture Diagram

![RAG Sequence Diagram](rag_sequence_digram.png)

### Architecture Components and Interactions

- **User Query:** Initiates the process, where users enter their queries regarding retail products, starting the interaction flow.

- **LangChain Framework:** Serves as the central orchestrator, efficiently managing the workflow from user query input to fetching nearest neighbors via Annoy and generating responses through LLM. It ensures seamless integration and communication between different components.

- **Annoy Index:** A critical component that stores pre-processed vector embeddings of product descriptions, enabling quick and efficient retrieval of items similar to the user query.

- **Retrieve Nearest Neighbors:** This step is crucial for identifying contextually relevant product descriptions based on the user's query, which are then utilized to inform the response generation process.

- **LLM (Hugging Face):** At this stage, the system leverages a pre-trained language model to generate detailed and contextually enriched responses by incorporating both the original query and the context provided by the nearest neighbors.

- **Generate Response:** The synthesis of input and retrieved context through the LLM culminates in the generation of a response tailored to the user's query, embodying the integration of RAG principles.

- **Display Results:** The final step where the system presents the generated response back to the user, completing the cycle and providing a cohesive answer to the query.

### Highlighted Features:

- **Rapid Context Retrieval:** Utilizes Annoy for the swift retrieval of relevant context, significantly speeding up the response generation process.

- **Scalable System Design:** Engineered to accommodate the extensive and growing datasets typical in retail, ensuring the system's scalability.

- **Modular and Flexible:** The architecture's modular design promotes flexibility, allowing for the easy replacement or enhancement of individual components, such as swapping the LLM model or modifying the retrieval strategy.

- **Enhanced User Engagement:** By delivering precise, context-aware responses, the system significantly improves user interaction and satisfaction, showcasing the potential of AI in transforming retail experiences.




# Enhancing E-commerce with NingLab eCeLLM-S

The integration of NingLab's eCeLLM-S model into our retail analytics framework signifies a leap forward in applying advanced Large Language Models (LLMs) to the e-commerce domain. This section delves into the technical underpinnings, objectives, and implementation strategy of leveraging eCeLLM-S for enriching e-commerce content and customer interaction.

## Background on LLMs and eCeLLM-S

Large Language Models (LLMs) like GPT-3 have revolutionized the field of natural language processing (NLP) with their ability to generate coherent and contextually relevant text across a broad spectrum of applications. Building on this foundation, eCeLLM-S is designed to specifically address the nuances and complexities of e-commerce text generation, from product descriptions to customer engagement.

### Core Technologies

eCeLLM-S employs a transformer-based architecture, renowned for its self-attention mechanism, which allows the model to weigh the importance of different parts of the input text differently. This is crucial for generating text that is not only grammatically correct but also contextually aligned with the subject matter.

#### Mathematical Foundation of Transformers

The transformer model utilizes several key equations, including the self-attention mechanism:

$$
\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right)V
$$

- $Q$, $K$, and $V$ represent the query, key, and value vectors, respectively.
- $d_k$ is the dimensionality of the key vectors, providing a normalization factor.

This formula allows the model to dynamically focus on different parts of the input sequence, enhancing the relevance and coherence of the generated text.

### Objectives of eCeLLM-S Integration

- **Contextual Relevance**: Generate text that reflects the specific context of retail products and customer interactions.
- **Enhanced Descriptions**: Create detailed and engaging product descriptions that capture the essence and appeal of products.
- **Customer Engagement**: Facilitate improved customer interaction through personalized and informative responses.

## Implementation Strategy

The effective deployment of eCeLLM-S within our retail analysis ecosystem involves a multi-faceted approach:

1. **Data Curation**: Assemble a diverse dataset that encompasses a wide range of our retail products and customer feedback.
2. **Model Fine-Tuning**: Adapt eCeLLM-S to our specific retail context to enhance its output's relevance and accuracy.
3. **Workflow Integration**: Seamlessly incorporate eCeLLM-S-generated content into our product listings and customer service processes.
4. **Iterative Refinement**: Continuously monitor and refine the model's performance based on user feedback and emerging retail trends.


**The strategic integration of NingLab's eCeLLM-S model represents a forward-thinking approach to harnessing the power of LLMs in the retail sector. By leveraging its advanced text generation capabilities, we can significantly enhance the quality of our product descriptions and the effectiveness of our customer interactions, setting a new benchmark for AI-driven e-commerce excellence.**


In [64]:

from transformers import pipeline

# Initialize the eCeLLM-S pipeline for text generation
eCeLLM_S_pipeline = pipeline("text-generation", model="NingLab/eCeLLM-S")

# Example usage with a prompt related to retail
prompt = "Generate a catchy product description for a pair of eco-friendly sneakers"
generated_text = eCeLLM_S_pipeline(prompt, max_length=100, num_return_sequences=1)

print("Generated product description:")
print(generated_text[0]['generated_text'])
    

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated product description:
Generate a catchy product description for a pair of eco-friendly sneakers.
Input: 
Output: Step into the future with these eco-friendly sneakers. Made from recycled materials and organic cotton, these sneakers are not only stylish and comfortable, but also sustainable and ethical. Whether you're running, walking, or dancing, these sneakers will keep you on your feet and on your mission to save the planet.



### Application in Retail Analysis

The above example demonstrates the use of eCeLLM-S in generating a product description. Such capabilities can be extended to various aspects of retail analysis, including but not limited to, generating creative product names, detailed product descriptions, and personalized marketing messages.

By integrating eCeLLM-S, we can enhance the customer experience through more engaging and informative content, contributing to improved customer satisfaction and sales.
    


## Integration with Lanchain Using Vector RAG

To further enhance our retail analytics capabilities, we integrate Lanchain with vector Retrieval-Augmented Generation (RAG) for advanced query understanding and information retrieval. This integration aims to leverage the synergies between Lanchain's blockchain technology and RAG's powerful retrieval capabilities to enhance data veracity and retrieval efficiency in our retail analysis.

### Objectives

- **Enhance Data Veracity**: Utilize Lanchain to ensure the integrity and veracity of the retail data used in our analysis.
- **Improve Retrieval Efficiency**: Leverage vector RAG for efficient retrieval of relevant information from our extensive retail dataset.
- **Innovate Retail Analytics**: Combine blockchain technology and state-of-the-art NLP to pioneer innovative retail analytics solutions.

Explanation of the below code:

- Import required modules:
  - `transformers.pipeline`: Importing the pipeline class from the transformers library to easily use pre-trained models.
  - `IPython.display.HTML`: Importing the HTML class from the IPython.display module to display HTML content in Jupyter Notebooks or IPython environments.

- Define the function `enhanced_langchain_integration(query)`:
  - Integrates language model with retail data to generate enhanced responses to queries.
  - Uses two different models:
    - `eceLLM_generator`: Specifically designed for e-commerce (eCeLLM-S).
    - `general_generator`: General-purpose text generation model (DistilGPT-2).
  - Converts the input query to a vector representation (assuming a function `description_to_vector` is defined elsewhere).
  - Retrieves similar product descriptions based on the query vector using an Annoy index (`annoy_index`).
  - Combines retrieved context to provide better context for generation.
  - Generates responses using both models based on the query and retrieved context.
  - Chooses the final response based on some criteria (e.g., length, relevance).
  - Constructs HTML content with CSS styling for visual presentation, including query, retrieved context, and generated response.

- Example usage:
  - Demonstrates how to use the `enhanced_langchain_integration` function with an example query ("Looking for stylish and comfortable shoe").
  - Displays the HTML content in the notebook, presenting the query, retrieved context, and generated response with appropriate styling.
    

In [65]:
from transformers import pipeline
from IPython.display import HTML

# Function to integrate language model with retail data
def enhanced_langchain_integration(query):
    # Using eCeLLM-S for e-commerce specific generation and DistilGPT-2 for general text
    eceLLM_generator = pipeline('text-generation', model='NingLab/eCeLLM-S')  # Adjust the model ID as needed
    general_generator = pipeline('text-generation', model='distilgpt2')

    # Convert query to vector (Assuming this function is defined and works with your data)
    query_vector = description_to_vector(query)

    # Retrieve similar product descriptions based on the query vector
    nearest_ids = annoy_index.get_nns_by_vector(query_vector, 5)  # Assuming annoy_index is properly set up
    retrieved_docs = [mock_retail_product_descriptions[i] for i in nearest_ids]  # Assuming this variable holds your mock data

    # Combine retrieved context for better generation
    retrieved_context = " ".join(retrieved_docs)

    # Generate response with both models and choose based on context
    eceLLM_response = eceLLM_generator(f'Query: {query}. Based on: {retrieved_context}', max_length=100, num_return_sequences=1, truncation=True)[0]['generated_text']
    general_response = general_generator(f'Query: {query}. Context: {retrieved_context}', max_length=100, num_return_sequences=1, truncation=True)[0]['generated_text']

    # Choose response based on some criteria (e.g., length, relevance, etc.) - for simplicity, just using eCeLLM-S here
    final_response = eceLLM_response if len(eceLLM_response) > len(general_response) else general_response

    # Create HTML content with CSS styling for visual presentation
    html_content = f"""
    <div style="border: 2px solid #2c3e50; border-radius: 10px; padding: 20px;">
        <h2>Query</h2>
        <p>{query}</p>
        <h2>Retrieved Context</h2>
        <p>{retrieved_context}</p>
        <h2>Generated Response</h2>
        <p style="color: #27ae60;">{final_response}</p>
    </div>
    """
    return HTML(html_content)

# Example use
enhanced_langchain_integration("Looking for stylish and comfortable shoe")


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



### Implementation Considerations

Integrating Langchain with vector RAG requires careful planning and execution, including ensuring data security, optimizing retrieval algorithms, and effectively managing blockchain transactions. The combination of Lanchain's blockchain technology with RAG's NLP capabilities offers a unique opportunity to redefine retail analytics, making it more secure, efficient, and insightful.

### Conclusion

The integration of Lanchain with vector RAG represents a significant advancement in retail analytics, offering unprecedented levels of data integrity, retrieval efficiency, and analytical depth. By harnessing these technologies, we can unlock new insights and value from our retail data, driving innovation and excellence in our business operations.
    

### TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

#### **Term Frequency (TF)**

Term Frequency measures the frequency of a word in a document. TF is calculated as:

$$
\text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in a document } d}{\text{Total number of terms in the document } d}
$$

#### **Inverse Document Frequency (IDF)**

Inverse Document Frequency measures how important a term is within the entire corpus. Words that appear frequently in one document but less frequently in the corpus receive a higher weighting. IDF is calculated as:

$$
\text{IDF}(t, D) = \log \left( \frac{\text{Total number of documents } D}{\text{Number of documents with term } t \text{ in it}} + 1 \right)
$$

Note: Adding 1 in the denominator prevents division by zero for terms that appear in all documents.

#### **TF-IDF Calculation**

The TF-IDF value is calculated by multiplying TF and IDF:

$$
\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)
$$

This results in a matrix where each row represents a document and each column represents a term in the corpus, with values indicating the significance of each term to the document.

### t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE is a machine learning algorithm for dimensionality reduction well-suited for the visualization of high-dimensional datasets. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data.

#### **Mathematical Foundation**

Given a set of points in a high-dimensional space, t-SNE first computes probabilities \(p_{ij}\) that are proportional to the similarity of objects \(x_i\) and \(x_j\), as follows:

$$
p_{j|i} = \frac{\exp(-||x_i - x_j||^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-||x_i - x_k||^2 / 2\sigma_i^2)}
$$

$$
p_{ij} = \frac{p_{j|i} + p_{i|j}}{2N}
$$

In the low-dimensional space, t-SNE computes similar probabilities \(q_{ij}\) using a Student-t distribution:

$$
q_{ij} = \frac{(1 + ||y_i - y_j||^2)^{-1}}{\sum_{k \neq l} (1 + ||y_k - y_l||^2)^{-1}}
$$

The Kullback-Leibler divergence between the distributions \(P\) and \(Q\) is minimized by gradient descent:

$$
C = \text{KL}(P||Q) = \sum_i \sum_j p_{ij} \log \frac{p_{ij}}{q_{ij}}
$$

#### **Perplexity**

Perplexity is a parameter for t-SNE that suggests how to balance attention between local and global aspects of your data and is defined as:

$$
\text{Perplexity} = 2^{H(P_i)}
$$

where \(H(P_i)\) is the Shannon entropy of \(P_i\) measured in bits.

### Application in the Notebook Context

In the context of the notebook, TF-IDF is used to transform mock retail product descriptions into numerical vectors, capturing the importance of terms within descriptions relative to their frequency across all descriptions. These vectors serve as features that represent each product's textual information.

t-SNE is then applied to these TF-IDF vectors to reduce the high-dimensional feature space to a 2D space suitable for visualization. The `perplexity` parameter is dynamically adjusted based on the dataset size to ensure meaningful dimensionality reduction.

The final visualization with seaborn plots the t-SNE-transformed points, colored by product categories, to illustrate the distribution and clustering of products based on their descriptions. This process highlights the power of combining TF-IDF for feature extraction with t-SNE for visualization, providing insights into the semantic relationships within the dataset.

### TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

#### **Term Frequency (TF)**

Term Frequency measures the frequency of a word in a document. TF is calculated as:

$$
\text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in a document } d}{\text{Total number of terms in the document } d}
$$

#### **Inverse Document Frequency (IDF)**

Inverse Document Frequency measures how important a term is within the entire corpus. Words that appear frequently in one document but less frequently in the corpus receive a higher weighting. IDF is calculated as:

$$
\text{IDF}(t, D) = \log \left( \frac{\text{Total number of documents } D}{\text{Number of documents with term } t \text{ in it}} + 1 \right)
$$

Note: Adding 1 in the denominator prevents division by zero for terms that appear in all documents.

#### **TF-IDF Calculation**

The TF-IDF value is calculated by multiplying TF and IDF:

$$
\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)
$$

This results in a matrix where each row represents a document and each column represents a term in the corpus, with values indicating the significance of each term to the document.

### t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE is a machine learning algorithm for dimensionality reduction well-suited for the visualization of high-dimensional datasets. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data.

#### **Mathematical Foundation**

Given a set of points in a high-dimensional space, t-SNE first computes probabilities \(p_{ij}\) that are proportional to the similarity of objects \(x_i\) and \(x_j\), as follows:

$$
p_{j|i} = \frac{\exp(-||x_i - x_j||^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-||x_i - x_k||^2 / 2\sigma_i^2)}
$$

$$
p_{ij} = \frac{p_{j|i} + p_{i|j}}{2N}
$$

In the low-dimensional space, t-SNE computes similar probabilities \(q_{ij}\) using a Student-t distribution:

$$
q_{ij} = \frac{(1 + ||y_i - y_j||^2)^{-1}}{\sum_{k \neq l} (1 + ||y_k - y_l||^2)^{-1}}
$$

The Kullback-Leibler divergence between the distributions \(P\) and \(Q\) is minimized by gradient descent:

$$
C = \text{KL}(P||Q) = \sum_i \sum_j p_{ij} \log \frac{p_{ij}}{q_{ij}}
$$

#### **Perplexity**

Perplexity is a parameter for t-SNE that suggests how to balance attention between local and global aspects of your data and is defined as:

$$
\text{Perplexity} = 2^{H(P_i)}
$$

where \(H(P_i)\) is the Shannon entropy of \(P_i\) measured in bits.

### Application in the Notebook Context

In the context of the notebook, TF-IDF is used to transform mock retail product descriptions into numerical vectors, capturing the importance of terms within descriptions relative to their frequency across all descriptions. These vectors serve as features that represent each product's textual information.

t-SNE is then applied to these TF-IDF vectors to reduce the high-dimensional feature space to a 2D space suitable for visualization. The `perplexity` parameter is dynamically adjusted based on the dataset size to ensure meaningful dimensionality reduction.

The final visualization with seaborn plots the t-SNE-transformed points, colored by product categories, to illustrate the distribution and clustering of products based on their descriptions. This process highlights the power of combining TF-IDF for feature extraction with t-SNE for visualization, providing insights into the semantic relationships within the dataset.

In [66]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Combine the descriptions into a single string per product for TF-IDF
descriptions = [desc.split(" with features: ")[0] for desc in mock_retail_product_descriptions]

# Use TF-IDF to simulate embeddings
vectorizer = TfidfVectorizer(max_features=100)
tfidf_embeddings = vectorizer.fit_transform(descriptions).toarray()


In [67]:
from sklearn.manifold import TSNE

# Adjust perplexity to be less than the number of samples
# Assuming 'tfidf_embeddings' contains your dataset's embeddings
n_samples = tfidf_embeddings.shape[0]
perplexity_value = min(30, n_samples - 1)  # Ensure perplexity is less than n_samples

tsne = TSNE(n_components=2, perplexity=perplexity_value, random_state=42)
tsne_results = tsne.fit_transform(tfidf_embeddings)

tsne_results


array([[ 21.459835,  20.480707],
       [ 94.14081 ,  23.338009],
       [ 95.783394, -49.451584],
       [ 23.169798, -52.17629 ]], dtype=float32)

In [68]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.manifold import TSNE

# Function to compute pairwise distances
def compute_pairwise_distances(X):
    distances = np.zeros((X.shape[0], X.shape[0]))
    for i in range(X.shape[0]):
        for j in range(i+1, X.shape[0]):
            distances[i, j] = np.linalg.norm(X[i] - X[j])
            distances[j, i] = distances[i, j]  # Distance matrix is symmetric
    return distances

# Function to compute similarity matrix
def compute_similarity_matrix(distances, sigma):
    similarities = np.exp(-distances**2 / (2 * sigma**2))
    np.fill_diagonal(similarities, 0)  # Diagonal elements are set to 0
    return similarities

# Function to compute perplexity
def compute_perplexity(P):
    epsilon = 1e-12  # Small epsilon value to avoid numerical issues
    P = np.maximum(P, epsilon)  # Replace zero values with epsilon
    entropy = -np.sum(P * np.log2(P))
    perplexity = 2 ** entropy
    return perplexity

# Combine the descriptions into a single string per product for TF-IDF
descriptions = [desc.split(" with features: ")[0] for desc in mock_retail_product_descriptions]

# Use TF-IDF to simulate embeddings
vectorizer = TfidfVectorizer(max_features=100)
tfidf_embeddings = vectorizer.fit_transform(descriptions).toarray()

# Adjust perplexity to be less than the number of samples
n_samples = tfidf_embeddings.shape[0]
perplexity_value = min(30, n_samples - 1)  # Ensure perplexity is less than n_samples

# Perform t-SNE embedding
tsne = TSNE(n_components=2, perplexity=perplexity_value, random_state=42)
tsne_results = tsne.fit_transform(tfidf_embeddings)

# Compute pairwise distances
pairwise_distances = compute_pairwise_distances(tsne_results)

# Compute similarity matrix
sigma = 1.0  # Adjust sigma according to your data
similarity_matrix = compute_similarity_matrix(pairwise_distances, sigma)

# Compute perplexity
perplexity = compute_perplexity(similarity_matrix)

# Create a DataFrame to summarize the computed values and provide analysis
data = {
    "Metric": ["Pairwise Distances", "Similarity Matrix", "Perplexity"],
    "Value": [pairwise_distances, similarity_matrix, perplexity],
    "Analysis": [
        "Pairwise distances between points in the t-SNE embedding space. High values indicate more dissimilar points.",
        "Measure of similarity between data points in the embedding space. High values indicate high similarity.",
        "A measure of how well the probability distribution predicts a sample. Higher values indicate a smoother embedding."
    ]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Set option to display complete DataFrame content
pd.set_option('display.max_colwidth', None)

# Print the DataFrame
display(df)


Unnamed: 0,Metric,Value,Analysis
0,Pairwise Distances,"[[0.0, 72.73711395263672, 102.05154418945312, 72.67711639404297], [72.73711395263672, 0.0, 72.8081283569336, 103.63056182861328], [102.05154418945312, 72.8081283569336, 0.0, 72.6646957397461], [72.67711639404297, 103.63056182861328, 72.6646957397461, 0.0]]",Pairwise distances between points in the t-SNE embedding space. High values indicate more dissimilar points.
1,Similarity Matrix,"[[0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0]]",Measure of similarity between data points in the embedding space. High values indicate high similarity.
2,Perplexity,1.0,A measure of how well the probability distribution predicts a sample. Higher values indicate a smoother embedding.


In [73]:
display(mock_retail_product_descriptions)

['Nike Air Zoom Pegasus is a shoes featuring durable, comfortable, waterproof. Describe Nike Air Zoom Pegasus, a product featuring durable, comfortable, waterproof A single, durable mesh cushion with a molded surface for an air dry look Designed for outdoor travel Designed for comfortable walking in the sun High res to date',
 'Nike Free RN is a shoes featuring lightweight, comfortable, waterproof. Describe Nike Free RN, a product featuring lightweight, comfortable, waterproof a shoe with great traction, and great durability, the Nylon Nylon M Nylon. The innovative design features Nylon Nylon bonded mesh when wearing a M Isole, the mesh increases in diameter',
 'Nike Air Max is a shoes featuring stylish, lightweight, comfortable. Describe Nike Air Max, a product featuring stylish, lightweight, comfortable Inside the Nike Air Max you will find the following items 2 pairs of Nike Air Max T Shirts and 3 pairs of Nike Air Max T Shirt Worn 2',
 "Nike React Infinity Run is a shoes featuring 

In [98]:
mock_retail_product_descriptions

['Nike Air Zoom Pegasus is a shoes featuring durable, comfortable, waterproof. Describe Nike Air Zoom Pegasus, a product featuring durable, comfortable, waterproof A single, durable mesh cushion with a molded surface for an air dry look Designed for outdoor travel Designed for comfortable walking in the sun High res to date',
 'Nike Free RN is a shoes featuring lightweight, comfortable, waterproof. Describe Nike Free RN, a product featuring lightweight, comfortable, waterproof a shoe with great traction, and great durability, the Nylon Nylon M Nylon. The innovative design features Nylon Nylon bonded mesh when wearing a M Isole, the mesh increases in diameter',
 'Nike Air Max is a shoes featuring stylish, lightweight, comfortable. Describe Nike Air Max, a product featuring stylish, lightweight, comfortable Inside the Nike Air Max you will find the following items 2 pairs of Nike Air Max T Shirts and 3 pairs of Nike Air Max T Shirt Worn 2',
 "Nike React Infinity Run is a shoes featuring 

In [99]:
# import numpy as np
# from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.manifold import TSNE
# import seaborn as sns
# import matplotlib.pyplot as plt
# from transformers import pipeline
# 
# # Function to compute pairwise distances
# def compute_pairwise_distances(X):
#     distances = np.zeros((X.shape[0], X.shape[0]))
#     for i in range(X.shape[0]):
#         for j in range(i+1, X.shape[0]):
#             distances[i, j] = np.linalg.norm(X[i] - X[j])
#             distances[j, i] = distances[i, j]  # Distance matrix is symmetric
#     return distances
# 
# # Function to compute similarity matrix
# def compute_similarity_matrix(distances, sigma):
#     similarities = np.exp(-distances**2 / (2 * sigma**2))
#     np.fill_diagonal(similarities, 0)  # Diagonal elements are set to 0
#     return similarities
# 
# # Function to compute perplexity
# def compute_perplexity(P):
#     epsilon = 1e-12  # Small epsilon value to avoid numerical issues
#     P = np.maximum(P, epsilon)  # Replace zero values with epsilon
#     entropy = -np.sum(P * np.log2(P))
#     perplexity = 2 ** entropy
#     return perplexity
# 
# # Combine the descriptions into a single string per product for TF-IDF
# descriptions = [desc.split(" with features: ")[0] for desc in mock_retail_product_descriptions]
# 
# # Use TF-IDF to simulate embeddings
# vectorizer = TfidfVectorizer(max_features=100)
# tfidf_embeddings = vectorizer.fit_transform(descriptions).toarray()
# 
# # Adjust perplexity to be less than the number of samples
# n_samples = tfidf_embeddings.shape[0]
# perplexity_value = min(30, n_samples - 1)  # Ensure perplexity is less than n_samples
# 
# # Perform t-SNE embedding
# tsne = TSNE(n_components=2, perplexity=perplexity_value, random_state=42)
# tsne_results = tsne.fit_transform(tfidf_embeddings)
# 
# # Compute pairwise distances
# pairwise_distances = compute_pairwise_distances(tsne_results)
# 
# # Compute similarity matrix
# sigma = 1.0  # Adjust sigma according to your data
# similarity_matrix = compute_similarity_matrix(pairwise_distances, sigma)
# 
# # Compute perplexity
# perplexity = compute_perplexity(similarity_matrix)
# print("Perplexity:", perplexity)
# 
# # Plot t-SNE results
# plt.figure(figsize=(10, 6))
# sns.scatterplot(x=tsne_results[:,0], y=tsne_results[:,1], hue=categories, palette="viridis")
# plt.title('t-SNE Visualization of Mock Product Descriptions')
# plt.xlabel('t-SNE Dimension 1')
# plt.ylabel('t-SNE Dimension 2')
# plt.legend(title='Category')
# plt.show()
# 
# # Prescriptive Analysis
# generator = pipeline("text-generation", model="EleutherAI/gpt-neo-1.3B")
# prescriptive_analysis = "Based on the t-SNE visualization, we can observe that products in the 'apparel' category tend to cluster together in the lower left region of the plot. This suggests that these products share similar characteristics or features. To capitalize on this insight, the retail company could create targeted marketing campaigns or promotions specifically tailored to customers interested in apparel products. Additionally, the company could consider expanding its apparel product line or introducing new apparel-related offerings to meet the demand demonstrated by the clustering pattern."
# recommendation = generator(prescriptive_analysis, max_length=100, do_sample=False)[0]['generated_text']
# print("Prescriptive Analysis Recommendation:", recommendation)


## Collaborative Query Handling in a Retail Environment
This example groups queries by potential customer intent, useful for addressing multiple user queries in a retail context.

In [90]:
from transformers import pipeline
import pandas as pd

# Initialize the zero-shot classification pipeline with a suitable model
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# Define your user queries
user_queries = [
    {"user_id": 1, "query": "sustainable outdoor clothing"},
    {"user_id": 2, "query": "eco-friendly water bottles for hiking"},
    {"user_id": 3, "query": "durable hiking boots"}
]

# Candidate categories for classification
categories = ["clothing", "accessories", "footwear"]

# Classify each query and add the predicted category to the query dictionary
for query in user_queries:
    classification = classifier(query["query"], categories, multi_label=False)
    query["predicted_category"] = classification["labels"][0]  # Assign the most probable category

# Convert the enriched user queries into a pandas DataFrame for display
df_queries = pd.DataFrame(user_queries)

# Display the DataFrame
display(df_queries)


Unnamed: 0,user_id,query,predicted_category
0,1,sustainable outdoor clothing,clothing
1,2,eco-friendly water bottles for hiking,accessories
2,3,durable hiking boots,footwear


## Query Processing and Clarification
Using a hypothetical model to rewrite queries for better search results, assuming the availability of a model that can process and rewrite queries:

In [91]:
from transformers import pipeline

def rewrite_query(query):
    # Initialize the text-generation pipeline with an accessible model
    generator = pipeline('text-generation', model='distilgpt2')

    # Generate rewritten query
    # Note: You might need to adjust max_length according to your specific requirements
    results = generator(query, max_length=60, num_return_sequences=1)
    rewritten = results[0]['generated_text'].strip()

    return rewritten

# Example query to be rewritten
original_query = "best running shoes"
rewritten_query = rewrite_query(original_query)

print(f"Original Query: {original_query}")
print(f"Rewritten Query: {rewritten_query}")


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Original Query: best running shoes
Rewritten Query: best running shoes. I think it all means more about getting the best shoes.


## Augmented Search Results Presentation
Example of enriching search results with insights, assuming search results are retrieved from a dataset:

In [93]:
# Assuming mock_retail_product_descriptions is a list of descriptions
import numpy as np
import pandas as pd

# Convert descriptions to vectors
product_vectors = [description_to_vector(desc) for desc in mock_retail_product_descriptions]

# Create a DataFrame
df_products = pd.DataFrame({
    'description': mock_retail_product_descriptions,
    'vector': product_vectors
})


In [94]:
from annoy import AnnoyIndex
import numpy as np
import pandas as pd

# Step 1: Initialize Annoy Index
# Assuming we have a DataFrame 'df_products' with product vectors in a column named "vector"
vector_dimension = len(df_products["vector"].iloc[0])  # Dimension of the product vectors
annoy_index = AnnoyIndex(vector_dimension, 'angular')  # Using 'angular' distance metric

# Step 2: Populate Annoy Index with Product Vectors
for product_id, vector in enumerate(df_products["vector"]):
    annoy_index.add_item(product_id, vector)

annoy_index.build(10)  # Building the index with 10 trees for efficient querying

# Step 3: Define Function to Convert Search Queries to Vectors
# This is a placeholder function. In practice, use a model to generate vectors similar to product vectors.
def query_to_vector(query):
    # Example: Random vector generation (Replace with actual vectorization logic)
    return np.random.rand(vector_dimension)

# Step 4: Define Function to Find Similar Products Using Annoy
def find_similar_products_annoy(query_vector, num_results=2):
    # Fetch indices of similar products from the Annoy index
    similar_product_ids = annoy_index.get_nns_by_vector(query_vector, num_results, include_distances=False)
    # Return DataFrame rows corresponding to similar products
    return df_products.iloc[similar_product_ids]

# Step 5: Augment Search Results with Additional Logic
def augment_search_results_with_annoy(query):
    # Convert the query to a vector (adapt this to use actual query vectorization)
    query_vector = query_to_vector(query)
    # Find similar products based on the query vector
    similar_products = find_similar_products_annoy(query_vector)
    # Return the descriptions of similar products (customize this as needed)
    return similar_products["description"].tolist()

# Demonstration of Usage
query = "Eco-friendly water bottle"
augmented_results = augment_search_results_with_annoy(query)

# Print the descriptions of similar products
print("Similar Products for Query: '{}'".format(query))
print("----------------------------------")
for idx, description in enumerate(augmented_results, 1):
    print("{}. {}".format(idx, description))


Similar Products for Query: 'Eco-friendly water bottle'
----------------------------------
1. Nike React Infinity Run is a shoes featuring lightweight, durable, comfortable. Describe Nike React Infinity Run, a product featuring lightweight, durable, comfortable you'll need a pair of running shoes with the option to wear them out in no time within 15 minutes. Nike also introduced a special edition Fit or Die shoe set, one of which is called
2. Nike Air Max is a shoes featuring stylish, lightweight, comfortable. Describe Nike Air Max, a product featuring stylish, lightweight, comfortable Inside the Nike Air Max you will find the following items 2 pairs of Nike Air Max T Shirts and 3 pairs of Nike Air Max T Shirt Worn 2


In [97]:
from annoy import AnnoyIndex
import numpy as np
import pandas as pd
from IPython.display import HTML

# Step 1: Initialize Annoy Index
# Assuming we have a DataFrame 'df_products' with product vectors in a column named "vector"
vector_dimension = len(df_products["vector"].iloc[0])  # Dimension of the product vectors
annoy_index = AnnoyIndex(vector_dimension, 'angular')  # Using 'angular' distance metric

# Step 2: Populate Annoy Index with Product Vectors
for product_id, vector in enumerate(df_products["vector"]):
    annoy_index.add_item(product_id, vector)

annoy_index.build(10)  # Building the index with 10 trees for efficient querying

# Step 3: Define Function to Convert Search Queries to Vectors
# This is a placeholder function. In practice, use a model to generate vectors similar to product vectors.
def query_to_vector(query):
    # Example: Random vector generation (Replace with actual vectorization logic)
    return np.random.rand(vector_dimension)

# Step 4: Define Function to Find Similar Products Using Annoy
def find_similar_products_annoy(query_vector, num_results=2):
    # Fetch indices of similar products from the Annoy index
    similar_product_ids = annoy_index.get_nns_by_vector(query_vector, num_results, include_distances=False)
    # Return DataFrame rows corresponding to similar products
    return df_products.iloc[similar_product_ids]

# Step 5: Augment Search Results with Additional Logic
def augment_search_results_with_annoy(query):
    # Convert the query to a vector (adapt this to use actual query vectorization)
    query_vector = query_to_vector(query)
    # Find similar products based on the query vector
    similar_products = find_similar_products_annoy(query_vector)
    # Return the descriptions of similar products (customize this as needed)
    return similar_products["description"].tolist()

# Function to create visually appealing output
def create_visual_output(query, augmented_results):
    html_content = f"""
    <div style="display: flex; justify-content: space-between;">
        <div style="border: 2px solid #2c3e50; border-radius: 10px; padding: 20px;">
            <h2>Query:</h2>
            <p>{query}</p>
        </div>
        <div style="border: 2px solid #2c3e50; border-radius: 10px; padding: 20px;">
            <h2>Similar Products:</h2>
            <ul>
    """
    for idx, description in enumerate(augmented_results, 1):
        html_content += f"<li>{idx}. {description}</li>"
    html_content += """
            </ul>
        </div>
    </div>
    """
    return HTML(html_content)

# Demonstration of Usage
query = "Eco-friendly water shoes"
augmented_results = augment_search_results_with_annoy(query)

# Create visually appealing output
create_visual_output(query, augmented_results)
