# **Fashion Search AI Project**
This project demonstrates a generative search system that retrieves fashion product recommendations based on user queries. 
We use a combination of OpenAI's GPT models and FAISS (Facebook AI Similarity Search) for performing semantic search over product data. 
The goal is to generate embeddings from product descriptions and use those embeddings to match user queries and return relevant products.

### **Components of the Project**:
1. **OpenAI GPT and Embeddings**: Used to generate embeddings from product data and to process and generate responses to user queries.
2. **FAISS**: A vector similarity search engine to store and retrieve embeddings efficiently.
3. **LangChain**: Orchestrates the generative model with retrieval from FAISS.

### **How to Run**:
1. Ensure you have installed the necessary dependencies: OpenAI, FAISS, LangChain, numpy, pandas, etc.
2. Place the product data in a JSONL format and modify the path in the code to load it.
3. Run each section in the notebook to initialize the models, load data, generate embeddings, and run the query system.


In [1]:
# Use RecursiveCharacterTextSplitter from LangChain to handle long product descriptions
# Import necessary libraries for handling embeddings, FAISS, and querying
# Import all the necessary modules
import os
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain_openai import OpenAI
import json
import time

In [2]:
# Set up OpenAI API key to access GPT models and embeddings
OpenAI.api_key = open("OPENAI_API_Key.txt", "r").read().strip()
os.environ['OPENAI_API_KEY'] = OpenAI.api_key

In [3]:
# Function to load product data from a JSONL file
# Function to load the JSONL dataset
def load_jsonl_dataset(file_path='meta_Amazon_Fashion.jsonl'):
    products = []
    try:
        with open(file_path, 'r') as f:
            for line in f:
                try:
                    product = json.loads(line)
                    products.append(product)
                except json.JSONDecodeError:
                    print(f"Skipping invalid JSON line: {line}")
    except FileNotFoundError:
        print(f"File not found: {file_path}")
    except Exception as e:
        print(f"An error occurred: {e}")
    
    return products

# Function to preprocess the data, including categories handling
def preprocess_data(products):
    texts = []
    
    for product in products:
        # Extract necessary fields
        title = product.get('title', 'No title')
        description = " ".join(product.get('description', []))  # description is a list, join into a string
        features = " ".join(product.get('features', []))        # features are a list, join into a string
        store = product.get('store', 'Unknown store')
        price = product.get('price', 'No price listed')
        
        # Extract ratings
        average_rating = product.get('average_rating', 'No rating available')
        rating_number = product.get('rating_number', 'No rating number available')

        # Extract main image (hi_res or large image)
        images = product.get('images', [])
        main_image_url = next((img.get('hi_res') or img.get('large') for img in images if img), 'No image available')

        # Extract categories
        categories = " > ".join(product.get('categories', []))  # Join categories with a separator if they exist

        # Extract details if available
        details = product.get('details', {})
        package_dimensions = details.get('Package Dimensions', 'Unknown dimensions')
        item_model = details.get('Item model number', 'No model number')
        date_available = details.get('Date First Available', 'No date available')

        # Concatenate all fields into a single string for embedding
        text = (f"Title: {title}. Description: {description}. Features: {features}. "
                f"Store: {store}. Price: {price}. Average Rating: {average_rating} "
                f"based on {rating_number} ratings. Main Image: {main_image_url}. "
                f"Categories: {categories}. Package Dimensions: {package_dimensions}. "
                f"Item Model: {item_model}. Date First Available: {date_available}.")
        
        texts.append(text)
    
    return texts

# Example of how to run the code
if __name__ == '__main__':
    products = load_jsonl_dataset('meta_Amazon_Fashion.jsonl')
    processed_texts = preprocess_data(products)
    
    # Print the first preprocessed product's text for verification
    print(processed_texts[0])

# The function opens the specified JSONL file and reads each line, parsing it into a product object.

Title: YUEDGE 5 Pairs Men's Moisture Control Cushioned Dry Fit Casual Athletic Crew Socks for Men (Blue, Size 9-12). Description: . Features: . Store: GiveGift. Price: None. Average Rating: 4.6 based on 16 ratings. Main Image: https://m.media-amazon.com/images/I/81XlFXImFrS._AC_UL1500_.jpg. Categories: . Package Dimensions: 10.31 x 8.5 x 1.73 inches; 14.82 Ounces. Item Model: DHES5PM21DH12. Date First Available: February 12, 2021.


In [4]:
processed_texts = processed_texts[:10000]

In [5]:
# Use RecursiveCharacterTextSplitter from LangChain to handle long product descriptions
#split the text into smaller chunks if needed
def split_text(texts, chunk_size=1000, chunk_overlap=200):
    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    chunks = [splitter.split_text(text) for text in texts]
    return [chunk for sublist in chunks for chunk in sublist]  # Flatten the list of lists

In [6]:
# Store product descriptions in FAISS after generating embeddings
# 3. Store the texts and their embeddings in FAISS for similarity search
def store_embeddings_in_faiss(texts):
    embeddings_model = OpenAIEmbeddings()  # Initialize OpenAI Embeddings model

    # Use FAISS from_texts, which generates embeddings internally
    vector_store = FAISS.from_texts(texts, embeddings_model)

    return vector_store

In [7]:
# 5. Create a retrieval-based Q&A chain
def create_retrieval_qa_chain(vector_store):
    llm = OpenAI()  # Use GPT-4 (or any OpenAI model) for generating responses
    qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=vector_store.as_retriever())
    return qa_chain

In [8]:
# 6. Query the system for answers
def query_qa_system(qa_chain, query):
    response = qa_chain.invoke({"query": query})
    return response

In [9]:

chunks = split_text(processed_texts)

In [10]:
vector_store = store_embeddings_in_faiss(chunks)

In [11]:
qa_chain = create_retrieval_qa_chain(vector_store)

In [12]:
query = "Find me moisture-wicking socks for men."
response = query_qa_system(qa_chain, query)
print(response)

{'query': 'Find me moisture-wicking socks for men.', 'result': " Gold Toe Mens Socks (6 Pairs) Cotton Socks, Moisture Wicking Socks, Mens Dress Crew Socks, YUEDGE 5 Pairs Men's Moisture Control Cushioned Dry Fit Casual Athletic Crew Socks for Men, and Forestgrow Men's Performance Hiking Socks w Thick Cushion Anti-Blister Moisture Wicking Multisport Outdoor Trekking Crew Socks 3322 Dark Grey 3 Pairs are all examples of moisture-wicking socks for men."}
