<a href="https://colab.research.google.com/github/orekhovsky/GenAI-mini-projects/blob/main/simple_RAG_with_Gigachat.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Без векторной бд

In [None]:
import pandas as pd

df = pd.read_parquet("hf://datasets/rag-datasets/rag-mini-bioasq/data/passages.parquet/part.0.parquet")
df_test = pd.read_parquet("hf://datasets/rag-datasets/rag-mini-bioasq/data/test.parquet/part.0.parquet")

import ast

# преобразование столбца relevant_passage_ids из строки в список
df_test['relevant_passage_ids'] = df_test['relevant_passage_ids'].apply(ast.literal_eval)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
pip install gigachat, transformers

In [None]:
pip install -U langchain-community

In [None]:
# функция для формирования контекста по вопросу из relevant_passage_ids
def perform_rag(question, df, df_test):

    matching_rows = df_test[df_test['question'] == question]

    # Получение индекса 'id' вместо стандартного числового индекса
    question_index = matching_rows.index[0]  # Здесь index вернёт значение из индекса 'id'
    print(f"Question ID: {question_index}")

    # Получение релевантных чанков из корпуса текстов
    relevant_passage_ids = df_test.loc[question_index, 'relevant_passage_ids']
    relevant_passages = [df.loc[i, 'passage'] for i in relevant_passage_ids]
    print(relevant_passage_ids)
    # Формирование входных данных для модели
    context = " ".join(relevant_passages)

    return context

In [None]:
from google.colab import userdata
from langchain.chat_models.gigachat import GigaChat


auth = userdata.get('SBER_AUTH')


llm = GigaChat(
    credentials=auth,
    model='GigaChat:latest',
    verify_ssl_certs=False,
    profanity_check=False
)


q1 = 'Is Hirschsprung disease a mendelian or a multifactorial disorder?'
context = perform_rag(q1, df, df_test)


input_text = f"""Answer the user's question.
Use only the information from the context. If the context does not contain enough information to answer the question, let the user know.
Context: {context}
Question: {q1}
Answer:"""


response = llm.predict(input_text)


print(f"Question: {q1}")
print(f"Answer: {response}")


Question ID: 0
[20598273, 6650562, 15829955, 15617541, 23001136, 8896569, 21995290, 12239580, 15858239]
Question: Is Hirschsprung disease a mendelian or a multifactorial disorder?
Answer: Hirschsprung disease (HSCR) is considered to be a **multifactorial** disorder rather than a Mendelian disorder. This means that the disease results from the combined effects of multiple genetic factors and possibly environmental influences. While there are specific genes like RET and EDNRB that play key roles in the development of HSCR, the disease often exhibits variable expressivity and incomplete penetrance, which complicates its inheritance pattern. Additionally, many cases do not follow a clear Mendelian inheritance pattern, further supporting the multifactorial nature of the disease.


In [None]:
df_test['answer'][0]

"Coding sequence mutations in RET, GDNF, EDNRB, EDN3, and SOX10 are involved in the development of Hirschsprung disease. The majority of these genes was shown to be related to Mendelian syndromic forms of Hirschsprung's disease, whereas the non-Mendelian inheritance of sporadic non-syndromic Hirschsprung disease proved to be complex; involvement of multiple loci was demonstrated in a multiplicative model."

In [None]:
pip install rouge-score

In [None]:
from rouge_score import rouge_scorer
true_answer = df_test['answer'][0]

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(true_answer, response)

# метрики
print("ROUGE-1:", scores['rouge1'])
print("ROUGE-2:", scores['rouge2'])
print("ROUGE-L:", scores['rougeL'])


ROUGE-1: Score(precision=0.2857142857142857, recall=0.39344262295081966, fmeasure=0.3310344827586207)
ROUGE-2: Score(precision=0.08433734939759036, recall=0.11666666666666667, fmeasure=0.0979020979020979)
ROUGE-L: Score(precision=0.14285714285714285, recall=0.19672131147540983, fmeasure=0.16551724137931034)


## Эксперименты с другими моделями

In [4]:
from huggingface_hub import InferenceClient

client = InferenceClient(api_key="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")

messages = [
	{
		"role": "user",
		"content": "What is the capital of France?"
	}
]

completion = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
	messages=messages,
	max_tokens=500
)

print(completion.choices[0].message)

BadRequestError: (Request ID: wyl1DdCWrhvz91Qnqktcp)

Bad request:
Model requires a Pro subscription; check out hf.co/pricing to learn more. Make sure to include your HF token in your query.