In [1]:
!pip install openai sentence-transformers llama-index faiss-gpu

Collecting openai
  Downloading openai-1.31.1-py3-none-any.whl (324 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m324.1/324.1 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers
  Downloading sentence_transformers-3.0.0-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.7/224.7 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama-index
  Downloading llama_index-0.10.43-py3-none-any.whl (6.8 kB)
Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
Collecting llama-index-agent-open

In [2]:
!pip install faiss-gpu sentence-transformers transformers



In [3]:
import os
import numpy as np
import pandas as pd
import faiss
from sentence_transformers import SentenceTransformer
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from pathlib import Path
from google.colab import files
import torch
from sklearn.metrics.pairwise import cosine_similarity

# Load the SentenceTransformer model for embeddings
embed_model = SentenceTransformer('sentence-transformers/paraphrase-MiniLM-L6-v2')

# Upload the CSV file
uploaded = files.upload()
file_name = list(uploaded.keys())[0]

# Load data from CSV
df = pd.read_csv(file_name)
print(df.info())
print(df.head())

# Concatenate all columns into a single text column
df['combined_text'] = df.apply(lambda row: ' '.join(row.values.astype(str)), axis=1)

# Truncate documents to a reasonable length (e.g., 512 characters) to manage memory usage
df['combined_text'] = df['combined_text'].apply(lambda x: x[:512])

# Create a list of combined text for each row
documents = df['combined_text'].tolist()

# Function to create embeddings in batches
def create_embeddings_in_batches(texts, batch_size=1):  # Use a batch size of 1
    embeddings = []
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i + batch_size]
        batch_embeddings = embed_model.encode(batch_texts, convert_to_tensor=True).cpu().numpy()
        embeddings.append(batch_embeddings)
    return np.vstack(embeddings)

# Create embeddings for the documents in batches
embeddings = create_embeddings_in_batches(documents)

# Create and populate FAISS index
embedding_dim = embeddings.shape[1]
faiss_index = faiss.IndexFlatL2(embedding_dim)
faiss_index.add(embeddings)

# Load a smaller model
model_name = "gpt2-medium"  # Use a smaller model like gpt2-medium
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Enable mixed precision training/inference
generator = pipeline('text-generation', model=model, tokenizer=tokenizer, device=0, torch_dtype=torch.float16)

# Define a function to retrieve documents and generate a response
def rag_query(query, query_type="factual", top_k=5, relevance_threshold=0.5):
    # Embed the query
    query_embedding = embed_model.encode([query], convert_to_tensor=True).cpu().numpy()

    # Retrieve relevant documents
    D, I = faiss_index.search(query_embedding, top_k)
    retrieved_docs = [documents[i] for i in I[0]]

    # Compute similarity scores
    doc_embeddings = embeddings[I[0]]
    similarities = cosine_similarity(query_embedding, doc_embeddings).flatten()

    # Check if the maximum similarity is below the threshold
    if max(similarities) < relevance_threshold:
        return "The query does not seem to be relevant to the documents in the dataset."

    context = " ".join(retrieved_docs)

    # Formulate the prompt based on query type
    if query_type == "factual":
        prompt = f"Context: {context}\n\nQuestion: {query}\nAnswer:"
    elif query_type == "summary":
        prompt = f"Context: {context}\n\nPlease provide a summary of the above context."
    elif query_type == "clarification":
        prompt = f"Context: {context}\n\nI need more details about the following topic: {query}"
    else:
        return "Invalid query type specified."

    # Generate the response using GPT-2
    with torch.no_grad():  # Disable gradients to save memory
        response = generator(prompt, max_new_tokens=50, do_sample=True, temperature=0.7)[0]['generated_text']  # Reduce max_new_tokens
    return response

# Clear GPU memory before running the model
torch.cuda.empty_cache()



  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Saving menstrual_qa.csv to menstrual_qa.csv
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 530 entries, 0 to 529
Data columns (total 2 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   instruction (string)  530 non-null    object
 1   output (string)       530 non-null    object
dtypes: object(2)
memory usage: 8.4+ KB
None
                                instruction (string)  \
0           What is a normal menstrual cycle length?   
1       What are common causes of irregular periods?   
2              How can I alleviate menstrual cramps?   
3      What are the signs of a heavy menstrual flow?   
4  Is it normal to experience mood swings during ...   

                                     output (string)  
0  A normal menstrual cycle typically ranges from...  
1  Common causes of irregular periods include hor...  
2  Menstrual cramps can be alleviated through var...  
3  Signs of a heavy menstrual flow include soakin.

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [4]:
# Test the RAG system with different query types
query = "What are menstrual-cramps?"
response_factual = rag_query(query, query_type="factual")
print("Factual Response:", response_factual)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Factual Response: Context: Why do some individuals experience menstrual cramps during their period? Menstrual cramps, also known as dysmenorrhea, occur due to the contraction of the uterus as it sheds its lining during menstruation. Increased levels of prostaglandins, hormone-like substances, contribute to uterine muscle contractions and pain. What are menstrual cramps? Dysmenorrhea is the medical term for menstrual cramps, caused by uterine contractions. Primary dysmenorrhea refers to recurrent, crampy pain occurring with menses in the absence of a disorder, while secondary dysmenorrhea refers to menstrual pain associated with an underlying pelvic pathology (disorder). What is the typical duration of menstrual cramps for most women? Menstrual cramps typically last for 1 to 3 days during menstruation. What are common symptoms experienced during menstruation? Common symptoms during menstruation include menstrual cramps (dysmenorrhea), bloating, breast tenderness, fatigue, mood swings, h

In [5]:
query_summary = "Give me a summary of PMS"
response_summary = rag_query(query_summary, query_type="summary")
print("Summary Response:", response_summary)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Summary Response: Context: How is PMS managed? Management of PMS may involve lifestyle changes (such as regular exercise, healthy diet, stress reduction techniques), over-the-counter pain relievers, hormonal birth control, and medications to alleviate specific symptoms like mood swings or bloating. What is Premenstrual Syndrome (PMS)? PMS is a combination of physical, emotional, and psychological symptoms that occur in the days or weeks before menstruation and typically resolve once menstruation begins. What are the common symptoms of PMS? Common symptoms of PMS include mood swings, irritability, fatigue, bloating, breast tenderness, food cravings, and headaches. What are some natural remedies for PMS (premenstrual syndrome)? Natural remedies for PMS include dietary changes (such as reducing caffeine and increasing intake of fruits and vegetables), regular exercise, stress management techniques (such as yoga and meditation), and herbal supplements. What is the term for the emotional sy

In [6]:
query_clarification = "Can you provide more details about what to eat during periods?"
response_clarification = rag_query(query_clarification, query_type="clarification")
print("Clarification Response:", response_clarification)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Clarification Response: Context: what is the ideal diet for each phase of menstrual cycle During the menstrual cycle, dietary needs may vary across phases. In the follicular phase, focus on iron-rich foods like leafy greens and lean meats to replenish iron lost during menstruation. Prioritize complex carbohydrates such as whole grains and legumes to maintain energy levels. Incorporate foods rich in omega-3 fatty acids like salmon and flaxseeds to help alleviate menstrual cramps. During ovulation, emphasize foods high in antioxidants like b What are some dietary strategies for managing common menstrual symptoms? Eating small, frequent meals, reducing salt intake, and incorporating anti-inflammatory foods can help manage symptoms like bloating and mood swings. Can specific dietary patterns affect menstrual cycles? Yes, factors such as balanced macronutrient intake, hydration, and avoiding excessive caffeine and alcohol can influence menstrual regularity What are some nutrient-rich foods 

In [7]:
query_clarification = "What is machine learning?"
response_clarification = rag_query(query_clarification, query_type="clarification")
print("Clarification Response:", response_clarification)

Clarification Response: The query does not seem to be relevant to the documents in the dataset.
