# Traffic Violation RAG System
In this exam, you will implement a Retrieval-Augmented Generation (RAG) system that uses a language model and a vector database to answer questions about traffic violations. The goal is to generate answers with relevant data based on a dataset of traffic violations and fines.

Here are helpful resources:
* [LangChain](https://www.langchain.com/)
* [groq cloud documentation](https://console.groq.com/docs/models)
* [LangChain HuggingFace](https://python.langchain.com/docs/integrations/text_embedding/sentence_transformers/)
* [Chroma Vector Store](https://python.langchain.com/docs/integrations/vectorstores/chroma/)
* [Chroma Website](https://docs.trychroma.com/getting-started)
* [ChatGroq LangChain](https://python.langchain.com/docs/integrations/chat/groq/)
* [LLM Chain](https://api.python.langchain.com/en/latest/chains/langchain.chains.llm.LLMChain.html#langchain.chains.llm.LLMChain)

Dataset [source](https://www.moi.gov.sa/wps/portal/Home/sectors/publicsecurity/traffic/contents/!ut/p/z0/04_Sj9CPykssy0xPLMnMz0vMAfIjo8ziDTxNTDwMTYy83V0CTQ0cA71d_T1djI0MXA30gxOL9L30o_ArApqSmVVYGOWoH5Wcn1eSWlGiH1FSlJiWlpmsagBlKCQWqRrkJmbmqRqUZebngB2gUJAKdERJZmqxfkG2ezgAhzhSyw!!/)

Some installs if needed:
```python
!pip install langchain_huggingface langchain langchain-community langchain_chroma Chroma langchain_groq LLMChain
```

In [1]:
# !kaggle datasets download -d khaledzsa/dataset
# !unzip dataset.zip

## Step 1: Install Required Libraries

To begin, install the necessary libraries for this project. The libraries include `LangChain` for building language model chains, and `Chroma` for managing a vector database.

In [2]:
!pip install langchain_huggingface langchain langchain-community langchain_chroma Chroma langchain_groq LLMChain



Collecting langchain_chroma
  Using cached langchain_chroma-0.1.4-py3-none-any.whl.metadata (1.6 kB)
Collecting Chroma
  Using cached Chroma-0.2.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
[31mERROR: Could not find a version that satisfies the requirement LLMChain (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for LLMChain[0m[31m
[0m

## Step 2: Load the Traffic Violations Dataset

You are provided with a dataset of traffic violations. Load the CSV file into a pandas DataFrame and preview the first few rows of the dataset using `.head()`. You can also try and see the dataset's characteristics.

In [3]:
import pandas as pd

In [4]:
df=pd.read_csv('Dataset.csv')
df.head()

Unnamed: 0,المخالفة,الغرامة
0,قيادة المركبة في الأسواق التي لا يسمح بالقيادة...,الغرامة المالية 100 - 150 ريال
1,ترك المركبة مفتوحة وفي وضع التشغيل بعد مغادرتها.,الغرامة المالية 100 - 150 ريال
2,عدم وجود تأمين ساري للمركبة.,الغرامة المالية 100 - 150 ريال
3,عبور المشاة للطرق من غير الأماكن المخصصة لهم.,الغرامة المالية 100 - 150 ريال
4,عدم تقيد المشاة بالإشارات الخاصة بهم.,الغرامة المالية 100 - 150 ريال


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   المخالفة  104 non-null    object
 1   الغرامة   104 non-null    object
dtypes: object(2)
memory usage: 1.8+ KB


## Step 3: Create Markdown Content from the Dataset

For each traffic violation in the dataset, you will generate markdown text that describes the violation and the associated fine. Create a loop to iterate through the dataset and store the generated markdown in a list. Each fine should look like this:

**المخالفة** - الغرامة

## Step 4: Chunk the Markdown Data

Using LangChain's `RecursiveCharacterTextSplitter`, split the markdown texts into smaller chunks that will be stored in the vector database.

In [6]:
pip install langchain




In [7]:
from tqdm.notebook import tqdm
from langchain.docstore.document import Document as LangchainDocument


In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

In [9]:
text_split=text_splitter.create_documents(df)

In [10]:
text_split

[Document(metadata={}, page_content='المخالفة'),
 Document(metadata={}, page_content='الغرامة')]

## Step 5: Generate Embeddings for the Documents

Generate embeddings for the chunks of text using HuggingFace's pre-trained Arabic language model. These embeddings will be stored in a `Chroma` vector store.

In [11]:
pip install langchain_huggingface



In [12]:
pip install -U langchain-community



In [13]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma




In [14]:
# emb = [len(tokenizer.encode(doc.page_content)) for doc in tqdm(text_split)]


In [15]:
import matplotlib.pyplot as plt

In [16]:
# fig = pd.Series(emb).hist()
# plt.title("Distribution of document lengths in the knowledge base (in count of tokens)")
# plt.show()

In [17]:
!pip install chromadb



In [22]:
embeddings = HuggingFaceEmbeddings(model_name="asafaya/bert-base-arabic")
db=Chroma.from_documents(text_split,embeddings)




## Step 6: Define the RAG Prompt Template

Define a custom prompt template in Arabic to retrieve traffic violation-related answers based on the context. Ensure the template greets the user first, states that the information provided could be incorrect, and advises the user to visit the traffic initiative website to verify. Additionally, provide the user with advice in Arabic, ensuring it stays within the given context.

In [23]:
from langchain.prompts import PromptTemplate
template = """
جاوب على كل ما يتعلق بالمخالفات السعوديه بعملة الريال السعودي التزم باللغه العربيه واذا سالك عن اي شيء اخر لا تجاوبه

السؤال: {query}

السياق: {context}

الإجابة:
"""
QA_CHAIN_PROMPT = PromptTemplate(input_variables=["query", "context"],template=template,)

## Step 7: Initialize the Language Model

Initialize the language model using the Groq API. Set up the model with a specific configuration, including the API key, temperature setting, and model name.

https://console.groq.com/docs/quickstart

In [24]:
pip install groq




In [25]:
pip install langchain_groq



In [26]:
import os
from groq import Groq


In [27]:
# import os
# os.environ["GROQ_API_KEY"] = "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"

Enter your Groq API key: ··········


In [44]:
from langchain_groq import ChatGroq
import os

os.environ["GROQ_API_KEY"] = "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"

chat_model = ChatGroq(
    model_name="llama3-8b-8192",
    temperature=0.1
)


In [30]:
from langchain_groq import ChatGroq
chat_model = ChatGroq(
    model_name="llama3-8b-8192",
    temperature=0.1
)

## Step 8: Create the LLM Chain

Now, you will create an LLM Chain that combines the language model and the prompt template you defined. This chain will be used to generate responses based on the retrieved context.

https://api.python.langchain.com/en/latest/chains/langchain.chains.llm.LLMChain.html

In [45]:
from langchain.chains import LLMChain

llm_chain = LLMChain(
    llm=chat_model,
    prompt=QA_CHAIN_PROMPT
)

In [33]:
response=model.run({
    "query": "الوقوف على خطوط السكة الحديدية",
    "context": "النص الخاص بالغرامات المرورية "
}
)

  response=model.run({


In [34]:
response

'الوقوف على خطوط السكة الحديدية هو مخالفة مرورية تتمتع بالغرامة المالية وفقاً لنص المادة 17 من قانون المرور السعودي، والتي تspecifies أن الغرامة لذلك المخالفة هي 300 ريال سعودي.'

## Step 9: Implement the Query Function

Create a function `query_rag` that will take a user query as input, retrieve relevant context from the vector store, and use the language model to generate a response based on that context.

In [36]:
def query_rag(user_query, Column1, llm_chain):
    docs = Column1.similarity_search(user_query, k=5)
    context = "\n".join([doc.page_content for doc in docs])
    response = llm_chain.run({
        "query": user_query,
        "context": context
    })

    return response

In [41]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.8.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.7 kB)
Downloading faiss_cpu-1.8.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m62.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.8.0.post1


In [42]:
from langchain_community.vectorstores import FAISS
vectorstore = FAISS.from_documents(text_split, embeddings)

## Step 10: Inference - Running Queries in the RAG System

In this final step, you will implement an inference pipeline to handle real-time queries. You will allow the system to retrieve the most relevant violations and fines based on a user's input and generate a response.

1. Inference Workflow:

  * The user inputs a query (e.g., "ماهي عقوبة عدم الوقوف وقوفاً تاماً عند إشارة؟").
  * The system searches for the most relevant context from the traffic violation vector store.
  * It generates an answer and advice based on the context.

2. Goal:
  * Run the inference to answer questions based on the traffic violation dataset.

In [46]:
user_query = "ماهي عقوبة عدم الوقوف وقوفاً تاماً عند إشارة؟"
response = query_rag(user_query, db, llm_chain)
print(response)



عقوبة عدم الوقوف وقوفاً تاماً عند إشارة هو غرامة مالية تتراوح بين 50 ريالاً سعودياً و200 ريالاً سعودياً.
