<a href="https://colab.research.google.com/github/kdats/Ichange-my-city-using-RAG-bluru/blob/main/Ichangemycity_RAG_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# About iChangeMyCity

> iChangeMyCity is a civic tech platform in India, started by Janaagraha Centre for Citizenship and Democracy.  
> It lets people file complaints about local issues like garbage, roads, streetlights, water supply, traffic, public places, and more, directly to city authorities.  
> This platform is now one of India's largest collections of civic complaints, using regular people’s experiences to help cities.

---

## Objective

- **Build** a RAG pipeline using open-source vector search and Google Gemini LLM  
- **Allow** semantic question answering over real city complaint data  
- **Show** how RAG can help people actually use large, messy data in a useful way

---

## About This Dataset

Source: iChangeMyCity.com, covering public complaints from Bangalore between 2019 and 2022.  
Columns:  
- `category`: Type of civic issue (e.g. Garbage, Roads, Water, etc.)  
- `location`: Name of the locality or neighborhood in Bangalore  
- `ward_name`: The BBMP municipal ward name  
- `complaint`: The complaint details or description in the user’s own words  
- `status`: Status of the complaint (resolved, in progress, or pending)  
- `date`: The date when the complaint was made  

---

## What Problem Are We Trying to Solve?

Bangalore is a huge city, growing fast, so it gets a lot of civic complaints.  
If you use regular dashboards, you can just see counts or some basic stats, but not real answers.

People – officials, citizens, journalists – need to know things like:
- Which areas get garbage complaints again and again
- What are the three main unresolved infrastructure problems in a ward
- How are civic issues changing in neighborhoods

So here, the idea is to make an AI system that can answer real language questions about city complaints using RAG, pulling actual info from the dataset.

---

## Why This Matters

- Data can help the city decide what to fix first and see which areas have the most problems
- Residents, activists, or the media can find out what’s really happening in their area
- This type of solution can work for any city if it has this kind of data

---

## What Does This Notebook Do?

- Loads and cleans up the complaints data
- Lets you search for meaning, not just keywords, in thousands of complaints
- Uses Google Gemini LLM to answer your questions, using real complaint info
- Shows how Retrieval-Augmented Generation (RAG) helps find real answers in city data

---

## Purpose

This notebook shows a RAG pipeline that summarizes and analyzes complaints from Bangalore.  
It can answer stuff like:
- Which areas get the most garbage complaints
- What’s the main complaint in a certain ward
- Which localities have recurring water or road issues

The point is to help city folks, researchers, or activists get more than just numbers – you get real answers from the actual complaints.

---

## Why RAG (Retrieval-Augmented Generation)

Regular language models sometimes make up facts (hallucinate) or can’t give answers based on your data.  
RAG works by:
- **Retrieval**: finds the most relevant info from your own database (not just the internet)
- **Augmentation**: sends this context to the language model (here, Gemini) so it only answers based on what it found
- **Generation**: writes a natural answer using that info

This approach is good for accurate, transparent, and up-to-date results.

---

## Workflow Overview

- Install libraries: pandas, sentence-transformers, chromadb, google-generativeai
- Load and prepare complaint data
- Embed complaint texts for semantic search
- Store vectors in ChromaDB
- Search for the most relevant complaints for a query
- Send retrieved context and query to Gemini LLM
- Print out answers you can use for analysis

---

## API Key and Security

The Gemini API key is set from an environment variable or secret in the notebook.  
This example uses a sample key – always keep your real keys private.

---

## Example Use Cases

- Find wards with the most unresolved infrastructure issues
- Get top complaints for Dodda Nekkundi
- Find out what issues came up most in 2021

---

## Getting Started

- Change the sample CSV and API key as needed
- Try your own questions at the end to see how RAG works

This is built with open-source and Google AI tools, and it can be used for any city’s complaint records if you have similar data.


In [18]:
env_path = "/content/env.txt"
csv_dataset_path = "/content/icmyc-2019-2022.csv"

#### pip install libraries


In [19]:
! pip install -q pandas langchain chromadb google-generativeai langchain_community


### Import all required libraries for data handling, embeddings, and RAG.


In [20]:
import os
import pandas as pd
import google.generativeai as genai
from langchain.document_loaders import DataFrameLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import GooglePalmEmbeddings
from langchain.llms import GooglePalm
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate


#### 🔑 Set your Gemini API key and choose the Gemini model version (e.g., 'gemini-2.5-flash' or 'gemini-pro').
#### ⚠️ **Keep your API key secret in production!**


In [21]:
# Read API key from env.txt and set it as an environment variable
with open("env.txt", "r") as f:
    for line in f:
      google_api_key = line.strip()

# Now retrieve and configure for Gemini
import google.generativeai as genai
genai.configure(api_key=google_api_key)
print(google_api_key)

AI**************************************


#### 📥 Download a sample civic complaints CSV (or upload your own to Colab).
#### Example sample CSV: https://data.opencity.in/datastore/dump/5f99b09a-64b5-45f0-ab18-4cf0a0cabf6d?bom=True&format=csv
#### 👉 Replace this link or file path with your actual dataset as needed.


In [22]:
# use /content/icmyc-2019-2022.csv in google colab notebook

df = pd.read_csv(csv_dataset_path, encoding_errors="ignore")
df.fillna('Unknown', inplace=True)


### Prepare Text Chunks for Embedding
#### Convert complaint records into single text fields for vectorization.
#### (Optional) Chunk large text for better semantic search.


In [23]:
# Prepare combined text for each record
df['text'] = (
    "category_title: " + df['category_title'].astype(str) +
    "; location: " + df['location'].astype(str) +
    "; war_title: " + df['ward_title'].astype(str) +
    "; description: " + df['description'].astype(str) +
    "; complaint_status_title: " + df['complaint_status_title'].astype(str) +
    "; created_at: " + df['created_at'].astype(str)
)

# Load as LangChain documents
loader = DataFrameLoader(df, page_content_column='text')
documents = loader.load()

# Enrich with metadata
for doc, (_, row) in zip(documents, df.iterrows()):
    doc.metadata = {
        "location": row["location"],
        "Category": row["category_title"],
        "war_title": row["ward_title"],
        "complaint_status_title": row["complaint_status_title"],
        "created_at": row["created_at"]
    }


In [24]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=600,
    chunk_overlap=150,
    separators=["\n\n", "\n", ";", ".", " "]
)
texts = text_splitter.split_documents(documents)
print(f"Total chunks created: {len(texts)}")


Total chunks created: 454


### Generate Embeddings with Sentence Transformers
####  Embed all text chunks using a lightweight, high-quality open-source model.


In [25]:
! pip install -q sentence-transformers

from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')


### Create a Chroma Vector Database
####  Create a vector store in Chroma for fast similarity search and semantic retrieval.


In [26]:
vectordb = Chroma.from_documents(
    texts,
    embeddings,
    persist_directory="./vectordb"
)
vectordb.persist()


 ### Retrieval Function (Vector Search)
 #### Function to retrieve the top-K most relevant chunks for a user query.

In [27]:
retriever = vectordb.as_retriever(
    search_type="mmr",      # Maximal Marginal Relevance (diverse & relevant)
    search_kwargs={"k": 6, "fetch_k": 20}
)


### RAG QA – Query Gemini LLM Using Retrieved Context
#### Use Gemini LLM (via API) to answer the user's question, given the retrieved context.
User Wrapper to go from user query → retrieve docs → get answer from Gemini
<br>Run some real-world civic complaints analytics queries!


In [28]:
import google.generativeai as genai
import warnings

# (Assume you have set your API key and built the retriever)
warnings.filterwarnings('ignore')
def ask_gemini(context, query):
    genai.configure(api_key=google_api_key)
    model = genai.GenerativeModel('gemini-2.5-pro')
    prompt = (
        "You are an expert civic complaints analyst AI.\n"
        "Answer the following query using only the given context (from complaints records).\n\n"
        f"Context:\n{context}\n\n"
        f"Query: {query}\nAnswer:"
    )
    response = model.generate_content(prompt)
    return response.text

def get_rag_answer(query, retriever, k=8):
    # Retrieve top-k chunks using your retriever
    docs = retriever.get_relevant_documents(query)
    context = "\n".join(doc.page_content for doc in docs)
    return ask_gemini(context, query)

# Example queries
query1 = "Which areas are most prone to garbage dumping issues?"
resp1 = get_rag_answer(query1, retriever)
print("Most garbage dump prone areas:\n", resp1)

query2 = "List the areas that are likely to have the most infrastructure related complaints."
resp2 = get_rag_answer(query2, retriever)
print("Areas with most infra issues:\n", resp2)

query3 = "What are the most top 3 common complaints from Dodda Nekkundi in rankwise order ?"
resp3 = get_rag_answer(query3, retriever)
print("Ward 85 top complaints:\n", resp3)


Most garbage dump prone areas:
 Based on the complaints records, the following areas are prone to garbage dumping issues:

*   **Marithi Nilaya #17, 1St Lane,1St Cross Nagalingeswara Temple, Kundalahalli, Brookefield**: This location has repeated complaints regarding the lack of regular garbage collection.
*   **Near Collins Aerospace, Vijayanagar, Epip Zone, Whitefield**: A complaint was filed about a large garbage dump in this area.
*   **SLS Square Front Gate Road, Phase 2, Brookefield**: A huge dumping zone was reported in front of this location.
*   **ITPL Bypass Road, Doddanekkundi Extension, Friends Layout**: A complaint was made about garbage not being picked up on time.
*   **7th H Cross Road, Dodda Nekkundi Extension, Chinnapanna Halli**: This area has issues with waste dumping in open places due to irregular collection by municipal workers.
Areas with most infra issues:
 Based on the provided records, the following areas have multiple infrastructure-related complaints:

*   