# مفتي الذكاء الاصطناعي

# Prepare Data

## Update format of data to question and answer

In [None]:
with open('full_data.txt', 'r') as file:
    lines = file.readlines()
    result = ""
    for i, line in enumerate(lines):
        if i % 2 == 0:
            result += "السائل: " + line.strip() + "\n"
        else:
            result += "الجواب: " + line.strip() + "\n"

with open('full_data.txt', 'w') as output_file:
    output_file.write(result)


## Splite Big file into small files

In [None]:
filename = 'full_data.txt'
with open(filename, 'r') as file:
    lines = file.readlines()
    num_lines = len(lines)
    num_files = num_lines // 6 + 1 if num_lines % 6 != 0 else num_lines // 6
    for i in range(num_files):
        with open(f"{i+1}.txt", 'w') as output_file:
            start = i * 6   
            end = min((i + 1) * 6, num_lines)
            for line in lines[start:end]:
                output_file.write(line)


## Number of Files After Splitting

In [1]:
import os
folder = './data_fatwa/'
num_files = len(os.listdir(folder))
print(f"The folder '{folder}' contains {num_files} files.")

The folder './data_fatwa/' contains 106 files.


## Details of full dataset 

In [4]:
import os
filename = './full_data.txt'
if os.path.isfile(filename):
    with open(filename, 'r') as file:
        contents = file.read()
        num_lines = contents.count('\n') + 1 
        num_words = len(contents.split())
        num_chars = len(contents)
    print(f"File '{filename}' contains the Fekh Al Shafay dataset, which is a collection of religious texts.")
    print(f"The file contains {num_lines} lines, {num_words} words, and {num_chars} characters.")
else:
    print(f"File '{filename}' does not exist.")


File './full_data.txt' contains the Fekh Al Shafay dataset, which is a collection of religious texts.
The file contains 629 lines, 44016 words, and 235519 characters.


# Installing libriries

## 1- llama-index
    -LlamaIndex (GPT Index) is a project that provides a central interface to connect your LLM’s with external data.

## Proposed Solution
    - That's where the LlamaIndex comes in. LlamaIndex is a simple, flexible interface between your external
      data and LLMs. It provides the following tools in an easy-to-use fashion
    - Offers you a comprehensive toolset trading off cost and performance.

## Source
  [Click here to visit Github](https://github.com/jerryjliu/llama_index)

In [None]:
!pip install llama-index

# 2- langchain
    - LangChain is a framework for developing applications powered by language models. We believe that the most         powerful and differentiated applications will not only call out to a language model via an API

## Source
  [Click here to visit Github](https://python.langchain.com/en/latest/index.html)

In [None]:
!pip install langchain

# Importing libraries 

## Faiss

The `faiss` module is a Python wrapper for the Facebook AI Similarity Search library, which provides fast and efficient algorithms for similarity search and clustering of high-dimensional vectors.

In the context of the `llama_index` package, the `faiss` module is used to provide an implementation of a vector index, where documents are represented as dense vectors in a high-dimensional space. This allows for efficient similarity search using algorithms such as k-nearest neighbors or approximate nearest neighbors.

In [23]:
from llama_index.readers import faiss

  from .autonotebook import tqdm as notebook_tqdm


# Explanation of llama_index package

The `llama_index` package provides several useful classes and modules for working with text data and language models. Here's a brief overview of some of the main components:

- `SimpleDirectoryReader`: This is a class that reads text files from a directory and returns their contents as strings.
- `GPTListIndex`: This is a class that represents an index of text documents that have been processed by a GPT-style language model. It provides methods for searching the index and retrieving the most similar documents to a given query.
- `readers`: This is a module that contains functions for reading text files in various formats, such as plain text, PDF, or HTML.
- `GPTSimpleVectorIndex`: This is a class that represents an index of text documents that have been processed by a GPT-style language model and converted into vectors. It provides methods for searching the index and retrieving the most similar documents to a given query.
- `LLMPredictor`: This is a class that wraps a GPT-style language model and provides a convenient interface for generating text based on prompts.
- `PromptHelper`: This is a class that provides methods for formatting and preprocessing text prompts for use with a language model.
- `ServiceContext`: This is a class that encapsulates configuration settings and shared resources for a web service that uses the `llama_index` package.

Overall, the `llama_index` package appears to be a collection of tools and utilities for working with text data and language models, such as GPT-style models. The specific classes and modules being imported depend on the needs of the code that is importing them.


In [24]:
from llama_index import SimpleDirectoryReader, GPTListIndex, readers, GPTSimpleVectorIndex, LLMPredictor, PromptHelper, ServiceContext

In [25]:
from langchain import OpenAI

In [26]:
import sys

In [27]:
import os

In [28]:
from IPython.display import Markdown, display

# Construct_index
## Used To train data
### Function to take folder of splitting files and convert it to index file

In [None]:
def construct_index(directory_path):
    # set maximum input size for the LLM model
    max_input_size = 4096
    # set number of output tokens for the LLM model
    num_outputs = 4000
    # set maximum overlap between input chunks for the LLM model
    max_chunk_overlap = 40
    # set the maximum size of each input chunk for the LLM model
    chunk_size_limit = 2000 

    # create a PromptHelper object to assist with generating prompts for the LLM model
    prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)

    # create an LLMPredictor object that uses a GPT-3.5-Turbo model for generating embeddings
    llm_predictor = LLMPredictor(llm=OpenAI(temperature=0.4, model_name="gpt-3.5-turbo", max_tokens=num_outputs))
 
    # read the text files from the specified directory into a list of strings
    documents = SimpleDirectoryReader(directory_path).load_data()
    
    # create a ServiceContext object that encapsulates configuration settings and shared resources for the vector index
    service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)

    # create a GPTSimpleVectorIndex object from the list of strings using the specified service context
    index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)

    # save the index to a file named 'index.json'
    index.save_to_disk('index.json')

    # return the index object
    return index


## Read api_key

In [29]:
os.environ["OPENAI_API_KEY"] = "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"

# Calling Function

In [None]:
construct_index("./data_fatwa/")

# Function To take Question and return The Answer

In [32]:
def ask_ai():
    index = GPTSimpleVectorIndex.load_from_disk('index.json')
    query = input("What do you want to ask? ")
    response = index.query(query)
    display(Markdown(f"Response: <b>{response.response}</b>"))

In [44]:
def ask_ai_2():
    index = GPTSimpleVectorIndex.load_from_disk('index.json')
    query = input("What do you want to ask? ")
    response = index.query(query)
    
    max_context_length = 4069
    response_chunks = [response.response[i:i+max_context_length] for i in range(0, len(response.response), max_context_length)]
    for i, chunk in enumerate(response_chunks):
        display(Markdown(f"Response part {i+1}: <b>{chunk}</b>"))


# Testing

In [33]:
ask_ai()

What do you want to ask? السلام عليكم


INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 2238 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 12 tokens


Response: <b>
الإجابة: السلام عليكم ورحمة الله وبركاته.</b>

In [34]:
ask_ai()

What do you want to ask? ماهي مشروعية الصلاة


INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 1531 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 18 tokens


Response: <b>
مشروعية الصلاة هي التزام المسلمين بأداء الصلاة المكتوبة في القرآن والسنة النبوية، والتي يجب عليهم الإجتهاد لأدائها في الوقت المحدد وبشكل صحيح.</b>

In [35]:
ask_ai()

What do you want to ask? مشروعية صلاة الجمعة


INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 2410 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 19 tokens


Response: <b>
صلاة الجمعة هي عبادة مشروعة في الإسلام، وهي واجبة على كل مسلم. وتحتوي على العديد من الحكم والفوائد، مثل التجمع الشعبي والتضامن والتعاون والتنبيه للأحداث التي تحدث في المجتمع. وتشتمل على شروط وجوب صلاة الجمعة، مثل الإسلام والبلوغ والعقل والحرية الكاملة والذكورة والصحة ال</b>

In [37]:
ask_ai()

What do you want to ask? من انت


INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 448 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 6 tokens


Response: <b>
أنا مفتي الذكاء الأصطناعي تم تصميمي في كلية العلوم جامعة الأزهر 2023.</b>

In [38]:
ask_ai()

What do you want to ask? من هو نجيب محفوظ


INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 452 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 19 tokens


Response: <b>
لا يمكنني الإجابة على هذا السؤال لأنه ليس في سياق عملي.</b>

In [39]:
ask_ai()

What do you want to ask? real madrid


INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 424 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 3 tokens


Response: <b>
لا يمكنني الإجابة، ذلك غير سياق عملي لذا.</b>

In [47]:
ask_ai_2()

What do you want to ask? ما حد السارقة


INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 1465 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 12 tokens


Response part 1: <b>
الإجابة: حد السارقة هو الإعدام، وهو العقاب الأقصى الذي يحصل على الشخص الذي يرتكب جريمة السرقة.</b>

In [49]:
ask_ai_2()

What do you want to ask? فترة المسج على الجبيرة


INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 1080 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 22 tokens


Response part 1: <b>
ليس للمسح على الجبيرة أو العصابة فترة معينة، بل يظل يمسح عليها ما دام العذر موجوداً، فإذا زال العذرـ بأن اندمل الجرح، وانجبر الكسر ـ بطل المسح ووجب الغسل.</b>

In [50]:
ask_ai_2()

What do you want to ask? ما وقت صلاة الترويح


INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 524 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 18 tokens


Response part 1: <b>
وقت صلاة الترويح هو بين صلاة العشاء وصلاة الفجر، وتصلى قبل الوتر.</b>