# Task 1 - 🧠 Retrieval-Augmented Generation (RAG) Prototype for Bay Wheels (Lyft)
In this notebook, we implemented a Retrieval-Augmented Generation (RAG) pipeline for Bay Wheels, a bike-sharing service operated by Lyft. The goal was to build a chatbot capable of answering typical business and sales-related queries such as:

“Which area has the highest number of trips in a day?”

“What are the pricing plans or ticket options?”

“How many stations are available across the network?”

While this is a prototype, the full trips dataset has not yet been embedded into the vector database. However, the chatbot can still respond to general questions about Bay Wheels based on the available data.

We used Pinecone as the Vector Database (VectorDB) for efficient document retrieval and Mistral, a lightweight, open-source language model running locally via the Ollama framework.
This switch to a local model was necessary due to OpenAI API quota limitations during testing.

#Fallback to Local LLM (Ollama)

Initially, this notebook used the OpenAI GPT-4 model via the `ChatOpenAI` API for answering questions using a Retrieval-Augmented Generation (RAG) pipeline.

However, due to **OpenAI API quota limitations** (error code `429 - insufficient_quota`), the pipeline now uses a **local open-source LLM** (`mistral`) through the [Ollama](https://ollama.com) framework.


# Loading Packages

In [1]:
import os
import time
from pinecone import Pinecone, ServerlessSpec
from langchain_pinecone import PineconeVectorStore
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.llms import Ollama
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import PromptTemplate

# Creating Pinecone Index

In [12]:
# Set your API keys for Pinecone
pc = Pinecone(
    api_key=os.environ['PINECONE_API_KEY']
)

# Create Index if not already created
pinecone_index_name = "langchain-embeddings-demo"
if pinecone_index_name not in pc.list_indexes().names():
    pc.create_index(
        name=pinecone_index_name,
        dimension=384, # '384' is the dimension for ada-002 embeddings
        metric='cosine',
        spec=ServerlessSpec(
            cloud='aws',
            region='us-east-1'
        )
    )

    while not pc.describe_index(pinecone_index_name).index.status['ready']:
        time.sleep(1)

    print("Pinecone Index provisioned")
else:
    print("Pinecone Index Already Provisioned")



KeyboardInterrupt



# Creating And Loading Embeddings

In [8]:
# 1. Initialize HuggingFace Embeddings (no API key required)
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")  # You can choose other models too

# 2. Load all text files from a directory
directory_path = "Desktop/Lyft-Project/Lyft-baywheels-ChatBot/data"  # Path to your folder with .txt files
loader = DirectoryLoader(directory_path, glob="*.txt", loader_cls=TextLoader)
documents = loader.load()

# 3. Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=50)
split_documents = text_splitter.split_documents(documents)

# 4. Connect to Pinecone and insert documents
pinecone_index_name = "langchain-embeddings-demo"

vectorstore = PineconeVectorStore(index_name=pinecone_index_name, embedding=embeddings)
vectorstore.add_documents(documents=split_documents)

print("✅ Embeddings from local HuggingFace model created and stored in Pinecone Vector Database!")


✅ Embeddings from local HuggingFace model created and stored in Pinecone Vector Database!


# Asking the Questions and Getting Answers

In [10]:
# 1. Load Embeddings (from HuggingFace)
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# 2. Connect to Pinecone (already created)
pinecone_index_name = "langchain-embeddings-demo"
vector_store = PineconeVectorStore(index_name=pinecone_index_name, embedding=embeddings)

# 3. Define Retriever
retriever = vector_store.as_retriever(search_kwargs={"k": 1})

# 4. Use Ollama's local model as the LLM (like mistral, llama3, gemma)
llm = Ollama(model="mistral")  # You can also use "llama3", "gemma", etc.

# 5. Define Prompt Template
prompt_template = PromptTemplate(
    template="""
Use the following context to answer the question as accurately as possible.
Context:
{context}

Question:
{question}

Answer:""",
    input_variables=["context", "question"]
)

# 6. Create the LLM Chain
llm_chain = prompt_template | llm | StrOutputParser()

# 7. Ask a Question
query = "What are the number of stations for baywheels?"
docs = retriever.invoke(query)
context = "\n\n".join([doc.page_content for doc in docs])

# 8. Run the RAG pipeline
answer = llm_chain.invoke({"context": context, "question": query})

# 9. Output the Answer
print("Question:", query)
print("Answer:", answer)

Question: What are the number of stations for baywheels?
Answer:  As of January 2018, there were 262+ docking stations for Bay Wheels (Lyft Bikes). However, keep in mind that this number might have changed since then.


# Deleting Pinecone Index

In [11]:
# Set your API keys for Pinecone
pc = Pinecone(
    api_key=os.environ['PINECONE_API_KEY']
)

# Create Index if not already created
pinecone_index_name = "langchain-embeddings-demo"
if pinecone_index_name in pc.list_indexes().names():
    pc.delete_index( name=pinecone_index_name )

    print("Pinecone Index Deleted")
else:
    print("Pinecone Index Had Already been Deleted")

Pinecone Index Deleted


# File Embedded in the Vector Database

This file contains a structured text summary of Bay Wheels (Lyft) data, scraped from the official Lyft website and Wikipedia. It was embedded into the VectorDB for retrieval and chatbot testing.

---

## **Bay Wheels (Lyft Bikes) – System Overview & Data Summary**

### 1. Operating Entity
- Operated by **Motivate**, owned by **Lyft**.  
  *Reference: [1]*

---

### 2. Service Area & Fleet
- Operates in: **San Francisco Bay Area** (San Francisco, East Bay, San Jose)  
- Fleet: ~2,600+ bicycles, 262+ docking stations *(as of Jan 2018)*  
  *Reference: [2]*

---

### 3. History & Launch Dates
- **Aug 29, 2013** – Launched as *Bay Area Bike Share*  
- **June 28, 2017** – Rebranded to *Ford GoBike*  
- **June 11, 2019** – Relaunched as *Bay Wheels* under Lyft  
  *References: [3], [4], [5]*

---

### 4. System Expansion Goals
- Planned growth to:  
  → ~7,000 bikes  
  → ~540 stations (across SF, Oakland, Berkeley, Emeryville, San Jose)  
  *Reference: [6]*

---

### 5. Bike Types & Technology
- Classic **docked bikes** and hybrid **e‑bikes**  
- Dockless option with built-in lock  
- **Clipper card** supported for contactless access  
- Bikes by **8D Technologies** and **Motivate**  
  *Reference: [7]*

---

### 6. Pricing (Effective mid 2025)
- Single Ride: 3.99 dollar / 30 min
- Day Pass: 15 dollar/day (unlimited 30-min classic rides)  
- Monthly: 29 dollar/month (45 min free + ebike discounts)  
- Annual: 150 dollar/year  
- Lyft Pink: 199 dollar/year (includes rideshare perks)  
  *Reference: [8]*

---

### 7. How It Works
1. Unlock via Lyft app or Clipper QR  
2. Ride  
3. Return to dock for free (or $2 lock fee for e‑bike racks)  
  *Reference: [9]*

---

### 8. Additional Features
- **Ride Together**: unlocks bikes for guests  
- **Bike Angels**: rebalancing incentive program  
  *References: [10], [11]*

---

### 9. Data & Developer Access
- Open access to trip histories, ridership, membership stats  
  *Reference: [12]*

---

### 10. Resources & Extras
- Blog updates, adaptive bikeshare logs, art bike initiatives  
  *Reference: [13]*

---

> This file was generated using ChatGPT to scrape and summarize content from official Bay Wheels sources.


# Future Enhancement: Integrating Trip-Level Data in Production
While this prototype focuses on answering general business and operational questions using curated Bay Wheels data, the model's usefulness can be significantly enhanced in a production setting by incorporating detailed trip-level datasets.

The raw trip data includes fields such as duration_sec, start_time, end_time, start_station_name, end_station_id, and user_type, which can provide rich insights into rider behavior, traffic patterns, and operational bottlenecks.

However, due to the large size and granularity of this dataset, it has been excluded from this prototype to maintain performance and minimize vector database load.

In a real-world deployment, this trip data can be transformed before embedding by:

Aggregating the number of trips between stations per day

Summarizing peak usage periods by station

Creating department-specific views (e.g., operations vs. planning)

This structured summarization approach would retain critical insights while keeping the RAG system optimized and scalable for production.