# Traffic Violation RAG System
In this exam, you will implement a Retrieval-Augmented Generation (RAG) system that uses a language model and a vector database to answer questions about traffic violations. The goal is to generate answers with relevant data based on a dataset of traffic violations and fines.

Here are helpful resources:
* [LangChain](https://www.langchain.com/)
* [groq cloud documentation](https://console.groq.com/docs/models)
* [LangChain HuggingFace](https://python.langchain.com/docs/integrations/text_embedding/sentence_transformers/)
* [Chroma Vector Store](https://python.langchain.com/docs/integrations/vectorstores/chroma/)
* [Chroma Website](https://docs.trychroma.com/getting-started)
* [ChatGroq LangChain](https://python.langchain.com/docs/integrations/chat/groq/)
* [LLM Chain](https://api.python.langchain.com/en/latest/chains/langchain.chains.llm.LLMChain.html#langchain.chains.llm.LLMChain)

Dataset [source](https://www.moi.gov.sa/wps/portal/Home/sectors/publicsecurity/traffic/contents/!ut/p/z0/04_Sj9CPykssy0xPLMnMz0vMAfIjo8ziDTxNTDwMTYy83V0CTQ0cA71d_T1djI0MXA30gxOL9L30o_ArApqSmVVYGOWoH5Wcn1eSWlGiH1FSlJiWlpmsagBlKCQWqRrkJmbmqRqUZebngB2gUJAKdERJZmqxfkG2ezgAhzhSyw!!/)

Some installs if needed:
```python
!pip install langchain_huggingface langchain langchain-community langchain_chroma Chroma langchain_groq LLMChain
```

In [None]:
!pip install langchain_huggingface langchain langchain-community langchain_chroma Chroma langchain_groq LLMChain

Collecting langchain_huggingface
  Downloading langchain_huggingface-0.1.0-py3-none-any.whl.metadata (1.3 kB)
Collecting langchain
  Downloading langchain-0.3.0-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-community
  Downloading langchain_community-0.3.0-py3-none-any.whl.metadata (2.8 kB)
Collecting langchain_chroma
  Downloading langchain_chroma-0.1.4-py3-none-any.whl.metadata (1.6 kB)
Collecting Chroma
  Downloading Chroma-0.2.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting langchain_groq
  Downloading langchain_groq-0.2.0-py3-none-any.whl.metadata (2.9 kB)
[31mERROR: Could not find a version that satisfies the requirement LLMChain (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for LLMChain[0m[31m
[0m

In [None]:
!kaggle datasets download -d khaledzsa/dataset
!unzip dataset.zip

Dataset URL: https://www.kaggle.com/datasets/khaledzsa/dataset
License(s): unknown
Downloading dataset.zip to /content
  0% 0.00/3.73k [00:00<?, ?B/s]
100% 3.73k/3.73k [00:00<00:00, 6.84MB/s]
Archive:  dataset.zip
  inflating: Dataset.csv             


## Step 1: Install Required Libraries

To begin, install the necessary libraries for this project. The libraries include `LangChain` for building language model chains, and `Chroma` for managing a vector database.

In [None]:
!pip install langchain_huggingface langchain langchain-community langchain_chroma Chroma langchain_groq LLMChain

Collecting langchain_huggingface
  Using cached langchain_huggingface-0.1.0-py3-none-any.whl.metadata (1.3 kB)
Collecting langchain
  Using cached langchain-0.3.0-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-community
  Using cached langchain_community-0.3.0-py3-none-any.whl.metadata (2.8 kB)
Collecting langchain_chroma
  Using cached langchain_chroma-0.1.4-py3-none-any.whl.metadata (1.6 kB)
Collecting Chroma
  Using cached Chroma-0.2.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting langchain_groq
  Using cached langchain_groq-0.2.0-py3-none-any.whl.metadata (2.9 kB)
[31mERROR: Could not find a version that satisfies the requirement LLMChain (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for LLMChain[0m[31m
[0m

# Step 2: Load the Traffic Violations Dataset

You are provided with a dataset of traffic violations. Load the CSV file into a pandas DataFrame and preview the first few rows of the dataset using `.head()`. You can also try and see the dataset's characteristics.

In [None]:
import pandas as pd
df = pd.read_csv('/content/Dataset.csv')
df.head()

Unnamed: 0,المخالفة,الغرامة
0,قيادة المركبة في الأسواق التي لا يسمح بالقيادة...,الغرامة المالية 100 - 150 ريال
1,ترك المركبة مفتوحة وفي وضع التشغيل بعد مغادرتها.,الغرامة المالية 100 - 150 ريال
2,عدم وجود تأمين ساري للمركبة.,الغرامة المالية 100 - 150 ريال
3,عبور المشاة للطرق من غير الأماكن المخصصة لهم.,الغرامة المالية 100 - 150 ريال
4,عدم تقيد المشاة بالإشارات الخاصة بهم.,الغرامة المالية 100 - 150 ريال


In [None]:
#we have another way

from langchain_community.document_loaders.csv_loader import CSVLoader


loader = CSVLoader(file_path='/content/Dataset.csv' )
data = loader.load()

In [None]:
data

## Step 3: Create Markdown Content from the Dataset

For each traffic violation in the dataset, you will generate markdown text that describes the violation and the associated fine. Create a loop to iterate through the dataset and store the generated markdown in a list. Each fine should look like this:

**المخالفة** - الغرامة

In [None]:
markdown_violations = []

for index, row in df.iterrows():
    violation = row['المخالفة']
    fine = row['الغرامة']
    markdown_text = f"المخالفة: {violation} - الغرامة: {fine}."
    markdown_violations.append(markdown_text)

In [None]:
for entry in markdown_violations[:5]:
    print(entry)

المخالفة: قيادة المركبة في الأسواق التي لا يسمح بالقيادة فيها. - الغرامة: الغرامة المالية 100 - 150 ريال.
المخالفة: ترك المركبة مفتوحة وفي وضع التشغيل بعد مغادرتها. - الغرامة: الغرامة المالية 100 - 150 ريال.
المخالفة: عدم وجود تأمين ساري للمركبة. - الغرامة: الغرامة المالية 100 - 150 ريال.
المخالفة: عبور المشاة للطرق من غير الأماكن المخصصة لهم. - الغرامة: الغرامة المالية 100 - 150 ريال.
المخالفة: عدم تقيد المشاة بالإشارات الخاصة بهم. - الغرامة: الغرامة المالية 100 - 150 ريال.


## Step 4: Chunk the Markdown Data

Using LangChain's `RecursiveCharacterTextSplitter`, split the markdown texts into smaller chunks that will be stored in the vector database.

In [None]:
! pip install langchain_community

Collecting langchain_community
  Using cached langchain_community-0.3.0-py3-none-any.whl.metadata (2.8 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting langchain<0.4.0,>=0.3.0 (from langchain_community)
  Using cached langchain-0.3.0-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.0 (from langchain_community)
  Downloading langchain_core-0.3.1-py3-none-any.whl.metadata (6.2 kB)
Collecting langsmith<0.2.0,>=0.1.112 (from langchain_community)
  Downloading langsmith-0.1.123-py3-none-any.whl.metadata (13 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.5.2-py3-none-any.whl.metadata (3.5 kB)
Collecting tenacity!=8.4.0,<9.0.0,>=8.1.0 (from langchain_community)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain_

In [None]:
!pip install langchain



In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

chunk_size = 512
splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size)

chunks = splitter.create_documents(markdown_violations)

In [None]:
print(f"Number of chunks: {len(chunks)}")

Number of chunks: 104


## Step 5: Generate Embeddings for the Documents

In [None]:
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-3.1.0-py3-none-any.whl.metadata (23 kB)
Downloading sentence_transformers-3.1.0-py3-none-any.whl (249 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m249.1/249.1 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence-transformers
Successfully installed sentence-transformers-3.1.0


Generate embeddings for the chunks of text using HuggingFace's pre-trained Arabic language model. These embeddings will be stored in a `Chroma` vector store.

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

embeddings = HuggingFaceEmbeddings(model_name="asafaya/bert-base-arabic")


  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/491 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/445M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/62.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/334k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



In [None]:
!pip install langchain_chroma

Collecting langchain_chroma
  Using cached langchain_chroma-0.1.4-py3-none-any.whl.metadata (1.6 kB)
Collecting chromadb!=0.5.4,!=0.5.5,<0.6.0,>=0.4.0 (from langchain_chroma)
  Downloading chromadb-0.5.7-py3-none-any.whl.metadata (6.8 kB)
Collecting fastapi<1,>=0.95.2 (from langchain_chroma)
  Downloading fastapi-0.115.0-py3-none-any.whl.metadata (27 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb!=0.5.4,!=0.5.5,<0.6.0,>=0.4.0->langchain_chroma)
  Downloading chroma_hnswlib-0.7.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb!=0.5.4,!=0.5.5,<0.6.0,>=0.4.0->langchain_chroma)
  Downloading uvicorn-0.30.6-py3-none-any.whl.metadata (6.6 kB)
Collecting posthog>=2.4.0 (from chromadb!=0.5.4,!=0.5.5,<0.6.0,>=0.4.0->langchain_chroma)
  Downloading posthog-3.6.6-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb!=0.5.4,!=0.5.5,<0.6.0,>=0.4.0->langchain_chrom

In [None]:
!pip install chromadb



In [None]:
from langchain.vectorstores import Chroma

vectordb = Chroma.from_documents(documents=chunks, embedding=embeddings)

# Step 6: Define the RAG Prompt Template

Define a custom prompt template in Arabic to retrieve traffic violation-related answers based on the context. Ensure the template encourages the model to give **advice** in **Arabic**, staying within the context provided.

In [None]:
from langchain.prompts import PromptTemplate

prompt_template = PromptTemplate(
    input_variables=["context", "question"],
    template= "   لاتجب على الاسئلة الاخرى ،فيما يلي قائمة بالمخالفات المرورية والغرامات   : \n {context} \n \n بناءً على القائمة أعلاه، أجب على السؤال التالي: {question}."
)

## Step 7: Initialize the Language Model

Initialize the language model using the Groq API. Set up the model with a specific configuration, including the API key, temperature setting, and model name.

In [None]:
!pip install langchain-groq

Collecting langchain-groq
  Using cached langchain_groq-0.2.0-py3-none-any.whl.metadata (2.9 kB)
Collecting groq<1,>=0.4.1 (from langchain-groq)
  Downloading groq-0.11.0-py3-none-any.whl.metadata (13 kB)
Downloading langchain_groq-0.2.0-py3-none-any.whl (14 kB)
Downloading groq-0.11.0-py3-none-any.whl (106 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.5/106.5 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: groq, langchain-groq
Successfully installed groq-0.11.0 langchain-groq-0.2.0


In [None]:
groq_api_key = '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'

In [None]:
from langchain_groq import ChatGroq

llm = ChatGroq(
    model="llama-3.1-8b-instant",
    temperature=0.7,
    api_key=groq_api_key
)

## Step 8: Create the LLM Chain

Now, you will create an LLM Chain that combines the language model and the prompt template you defined. This chain will be used to generate responses based on the retrieved context.

In [None]:
from langchain import LLMChain

chain = LLMChain(
    llm=llm,
    prompt=prompt_template
)

## Step 9: Implement the Query Function

Create a function `query_rag` that will take a user query as input, retrieve relevant context from the vector store, and use the language model to generate a response based on that context.

In [None]:
def query_rag(question):

    relevant_context = "..."

    response = chain.run({
        "context": relevant_context,
        "question": question
    })

    return response

In [None]:
#we will try it

user_query = "ماهي الغرامة على القيادة بدون رخصة؟"

answer = query_rag(user_query)
print(answer)

  response = chain.run({


لا أرى القائمة في أسفل السؤال. ولكن يمكنني إجابة السؤال برغم ذلك.

غرامة القيادة بدون رخصة تختلف من دولة إلى أخرى، ولكن العامة تتراوح بين 500 إلى 2000 دولار أو حتى إيقاف السيارة وتأهيلها لاتخاذ الإجراءات القانونية.


In [None]:
#we will try it

user_query = 'ماهي الغرامة عدم تقيد المشاة بالإشارات الخاصة بهم.'

answer = query_rag(user_query)
print(answer)

حسناً. على أساس القائمة في سؤالك، الغرامة عدم تقيد المشاة بالإشارات الخاصة بهم هي: 150.


when we use `query_rag` not work will so we try another way



---
## other way

In [None]:
retriever = vectordb.as_retriever(search_kwargs={"k":3})

In [None]:
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm = llm,
    chain_type = "stuff",
    retriever=retriever,
    return_source_documents=False,
)

In [None]:
query = "ما هي الغرامة عدم تقيد المشاة بالإشارات الخاصة به"
result = qa_chain(query)
print(f'Answer:', {result['result']})

Answer: {'الغرامة المالية لعدم تقيد المشاة بالإشارات الخاصة بهم هي 100-150 ريال.'}


In [None]:
query = "طريقة عمل البيتزا"
result = qa_chain(query)
print(f'Answer:', {result['result']})

Answer: {'لا يوجد علاقة بين السؤال الخاص بك و السيارة التي ذكرتها في السياق الذي قدمته.'}


In [None]:
query = "هل هناك مخالفة على المركبة بدون لوحة وكم سعر المخالفة"
result = qa_chain(query)
print(f'Answer:', {result['result']})

Answer: {'نعم، هناك مخالفة على المركبة بدون لوحة خلفية. المخالفة هي: سير المركبة بلا لوحة خلفية، أو بلا لوحات. الغرامة المالية لذلك هي 3000 - 6000 ريال.'}


In [None]:
query = "غرامة عدم حمل رخصة"
result = qa_chain(query)
print(f'Answer:', {result['result']})

Answer: {'غرامة عدم حمل رخصة القيادة هي 1000 - 2000 ريال.'}


In [None]:
query = "غرامة عدم حمل رخصة"
result = qa_chain(query)
print(f'Answer:', {result['result']})

Answer: {'بالتأكيد، وفقًا للبيانات المتاحة علي وقع المخالفة "قيادة المركبة قبل الحصول على رخصة قيادة أو في حال سحب الرخصة" الغرامة 1000 - 2000 ريال.'}


In [None]:
query = "حمل رخصة"
result = qa_chain(query)
print(f'Answer:', {result['result']})

Answer: {'هل أنت تريد سؤال عن الغرامة للعامل في موضوع رخصة السواقه؟\n\nسيتم الإجابة على الذي ذكرت أو أى سؤال يتعلق بحقائب رخص شهر الشوال خاصنك، أو جبت لتقصر على شك  سؤالك غير مدروس'}


In [None]:
query = "غرامة إساءة استعمال منبة المركبة"
result = qa_chain(query)
print(f'Answer:', {result['result']})

Answer: {'الغرامة المالية لهذه المخالفة هي 150 - 300 ريال.'}




---



## Step 10: Inference - Running Queries in the RAG System

In this final step, you will implement an inference pipeline to handle real-time queries. You will allow the system to retrieve the most relevant violations and fines based on a user's input and generate a response.

1. Inference Workflow:

  * The user inputs a query (e.g., "ماهي الغرامة على القيادة بدون رخصة؟").
  * The system searches for the most relevant context from the traffic violation vector store.
  * It generates an answer and advice based on the context.

2. Goal:
  * Run the inference to answer questions based on the traffic violation dataset.

Future Work:

1. **Expanding the Database:**
   - **Adding more data:** You may need to include more traffic violations and fines from multiple sources, especially from different countries or traffic laws, to make the system more comprehensive.
   - **Regular data updates:** Traffic laws and fines change over time, so the data should be updated regularly to ensure the accuracy of the responses.

2. **Improving the Quality of Numerical Representations (Embeddings):**
   - **Using advanced language models:** Retrieval accuracy can be improved by using more advanced models such as OpenAI GPT-4 or Hugging Face Arabic Transformer, which can be more accurate in understanding legal language and specific terms related to violations.
   - **Fine-tuning models:** You can train the models specifically on traffic data to make them more capable of understanding questions related to violations and fines.

In [None]:
import pandas as pd

# تحميل البيانات (كما تم بالفعل)
df = pd.read_csv('/content/Dataset.csv')

# دالة للبحث عن المخالفة
def search_violation(question):
    # البحث عن المخالفة في النص
    for index, row in df.iterrows():
        if row['المخالفة'] in question:
            return f"{row['المخالفة']} - {row['الغرامة']}"
    return "السؤال غير موجود في قاعدة البيانات. الرجاء إدخال سؤال يتعلق بالمخالفات."

# اختبار دالة البحث
question = input("أدخل سؤالك هنا: ")
answer = search_violation(question)
print(answer)

KeyboardInterrupt: Interrupted by user