<a href="https://colab.research.google.com/github/malinphy/os_llms_colab/blob/main/small_llm_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!pip install git+https://github.com/huggingface/transformers
!pip install --upgrade requests torch einops accelerate bitsandbytes
!pip install faiss-cpu -q
!pip install langchain -q
!pip install sentence-transformers -q

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-trxw1s_6
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-trxw1s_6
  Resolved https://github.com/huggingface/transformers to commit 1be0145d6a045652c075fd5965d1e394cdb17654
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [3]:
# import os
# os.listdir('drive/MyDrive/QQ_PROJECTS/turkish_pro_OS/data/data/vector_stores')

In [4]:
import textwrap
from qa import questions
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
import pandas as pd
import numpy as np
from langchain.prompts import PromptTemplate

In [5]:
from transformers import AutoTokenizer, MarianMTModel,AutoModelForSeq2SeqLM
from transformers import MarianMTModel, MarianTokenizer
# from nltk.tokenize import sent_tokenize
tokenizer_eng2tr = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-tc-big-en-tr")
model_eng2tr = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-tc-big-en-tr")



In [6]:
def tr2eng(input_text):
    src = "tr"  # source language
    trg = "en"  # target language
    model_name_tr2eng = f"Helsinki-NLP/opus-mt-{src}-{trg}"
    model_tr2eng = MarianMTModel.from_pretrained(model_name_tr2eng)
    tokenizer_tr2eng = AutoTokenizer.from_pretrained(model_name_tr2eng)
    batch = tokenizer_tr2eng([input_text], return_tensors="pt")
    generated_ids = model_tr2eng.generate(**batch)
    eng_text = tokenizer_tr2eng.batch_decode(generated_ids, skip_special_tokens=True)[0]
    # print(eng_text)
    return eng_text

In [7]:
def eng2tr(english_text):
    model_name = "Helsinki-NLP/opus-mt-tc-big-en-tr"
    tokenizer_eng2tr = MarianTokenizer.from_pretrained(model_name)
    model_eng2tr = MarianMTModel.from_pretrained(model_name)
    translated = model_eng2tr.generate(**tokenizer_eng2tr(english_text, return_tensors="pt", padding=True))
    decoded = []
    for t in translated:
        decoded.append(tokenizer_eng2tr.decode(t, skip_special_tokens=True) )

    return decoded

In [8]:
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
db = FAISS.load_local('drive/MyDrive/QQ_PROJECTS/turkish_pro_OS/data/data/vector_stores/translated_faiss_index_helsinki',
                      embeddings = embeddings)

In [9]:
english_questions = []
for i in questions:
    # print(i)
    english_questions.append(tr2eng(questions[i]))



In [10]:
turkish_questions = [questions[i] for i in questions]
turkish_questions

['İşçi sözleşme süresinin bitmesinden önce yahut bildirim süresine uymaksızın işini bırakıp başka bir işverenin işine girerse sözleşmenin bu suretle feshinden ötürü, işçinin sorumluluğu yanında, yeni işveren hangi hallerde işçi ile birlikte sorumludur.',
 'İşveren tarafından Sözleşmenin feshinde usül nedir? Nasıl olmalıdır.',
 'İşverence fesih hakkının kötüye kullanılarak sona erdirildiği durumlarda işçiye ödenecek tazminat nasıl hesaplanır? Bu tazminat hesaplamasında hangi menfaatler de göz önünde tutulur?',
 'Çalışma Koşullarında esaslı değişiklik nedir? İşveren bu değişikliği nasıl yapar, işçinin kabul etmeme usulü ve işçinin kabul etmemesi durumunda işverenin yapması gerekenler nelerdir',
 'İş Sözleşmesi feshinin geçersizliğine karar verildiğinde işveren, işçiyi ne kadar süre içinde işe başlatmak zorundadır. İşveren, İşçiyi başvurusu üzerine hangi süre içinde işe başlatmaz ise, ne kadar tazminat ödemekle yükümlü olur. Kararın kesinleşmesine kadar çalıştırılmadığı süre için işçiye n

In [11]:
questions_df = pd.DataFrame({'turkish_questions':turkish_questions,
              'english_questions':english_questions})

questions_df.head(2)

Unnamed: 0,turkish_questions,english_questions
0,İşçi sözleşme süresinin bitmesinden önce yahut...,If the employee leaves his job before the term...
1,İşveren tarafından Sözleşmenin feshinde usül n...,What is the procedure of dissolution of the co...


In [12]:
query = questions_df['english_questions'][0]

In [13]:
def sim_results(query,k):
    var1 = db.similarity_search(query, k)
    content = [i.page_content for i in var1]
    meta = [i.metadata for i in var1]
    return content, meta

In [14]:
page_content, metadata = sim_results(query,10)

In [15]:
page_content

["Article 23 - The responsibility of the new employer is the responsibility of the employee, as well as the new employer is responsible for the dissolution of the contract if the employee's behavior is caused by the employee's new employer before the end of the contract or if the contract expires before its deadline or if the contract expires without its notice.",
 'During the probationary period, the parties can terminate the contract of employment without notice and without compensation.',
 'Even though the term specified in the contract expires, the temporary employment relationship is under way, as of the end of the contract of employment between the employer and the worker who employs a temporary worker.',
 'The contract of employment with the temporary worker states that if the worker is not called to work within a period of time, he can terminate the contract of employment for the right reason.',
 'If not, the termination of the employer is a valid dissolution, and the employer 

In [16]:
multiple_input_prompt = PromptTemplate(
    input_variables=["Inputs", "Question"],
    template="""{Inputs}
                QUESTION :{Question}."""
)
# multiple_input_prompt.format(Inputs=" ".join(page_content), Question=query)

## OS results

In [17]:
# db_os = FAISS.load_local('drive/MyDrive/QQ_PROJECTS/turkish_pro_OS/data/data/vector_stores/faiss_index_MCdocs_en_500_30_translated_nllb-200-distilled-600M_embedded_all-MiniLM-L6-v2',
#                       embeddings = embeddings)

# page_contents_os = db_os.similarity_search('What is Nightwork?', k=10)
# page_contents_os_total = []
# for i in page_contents_os:
#     page_contents_os_total.append(i.page_content)

# page_contents_os_total

LLM

In [18]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import pipeline
import torch

checkpoint = "MBZUAI/LaMini-Flan-T5-783M"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
base_model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint,
                                             device_map='auto',
                                             torch_dtype=torch.float32)

pipe = pipeline('text2text-generation',
                 model = base_model,
                 tokenizer = tokenizer,
                 max_length = 1024,
                 do_sample=True,
                 temperature=0.1,
                 top_p=0.95,
                 )

In [19]:
# multiple_input_prompt = PromptTemplate(
#     input_variables=["Inputs", "Question"],
#     template="{Inputs} {Question}."
# )
# multiple_input_prompt.format(Inputs=" ".join(page_contents[1:]), Question=english_questions[11])

In [20]:
# english_questions[0]

In [21]:
# " ".join(page_contents)

In [22]:
# %%time
# import textwrap
# response = ''
# instruction = multiple_input_prompt.format(Inputs=" ".join(page_content), Question=query)

# # instruction = multiple_input_prompt.format(Inputs=" ".join(page_contents), Question=english_questions[0])
# generated_text = pipe(instruction)
# for text in generated_text:
#   response += text['generated_text']
# wrapped_text = textwrap.fill(response, width=1200)
# print(wrapped_text)



In [23]:
# input_prompt = multiple_input_prompt.format(Inputs=" ".join(page_content), Question=query)
# print(input_prompt)

In [24]:
def llm_cpu(prompt):
    response = ''
    instruction = prompt

    # instruction = multiple_input_prompt.format(Inputs=" ".join(page_contents), Question=english_questions[0])
    generated_text = pipe(instruction)
    for text in generated_text:
        response += text['generated_text']
    wrapped_text = textwrap.fill(response, width=1200)
    print(wrapped_text)
    return wrapped_text

In [25]:
# llm_cpu(input_prompt)

In [26]:
total_contents = []
total_metadatas = []
answers = []
for i in range(len(questions_df)):
    page_contents , metadatas = sim_results(questions_df['english_questions'][i], k = 10)
    total_contents.append(page_contents)
    total_metadatas.append(metadatas)
    input_prompt = multiple_input_prompt.format(Inputs=" ".join(page_contents), Question=questions_df['english_questions'][i])
    answers.append(llm_cpu(input_prompt))

Yes.
The procedure of dissolution of the contract by the employer is to make a written declaration of dissolution and make the reason for the dissolution clear and clear. The employer must also make sure that the notice of dissolution is provided to the employee.
The compensation for the worker when the right to dissolution has been misused by the employer is immediately dismissed by the employer or journalist to the other side. The amount of compensation which corresponds to the amount of notice period in writing, has been immediately dismissed by the employer or journalist to the other side. The last journalist will receive compensation for a month's salary each year for each year or so of the service year or change of the contract, which has been reported to be dissolution. The employer has the right to ask for compensation for the damage he suffered.
The fundamental change in working conditions is the ability to work in a certain capacity. The employer can make this change by relyi

Token indices sequence length is longer than the specified maximum sequence length for this model (591 > 512). Running this sequence through the model will result in indexing errors


No.
The notice was made to the other side six weeks after the notice was made to the worker.
The answer is: The answer is not provided in the given text.
Yes.
The wages and other rights of a part-time worker are reserved for working days.
The temporary employer should declare business accidents and occupational health notifications to the employment office immediately under the 13th and 14th Articles of Social Insurance and General Health Insurance Act 31/5/2006.
The employer must report the employment situation to the Employment Regional Directorate by the end of the month, when the latest work is completed by the institutional document.
Yes.
Yes.
Mails are arranged to replace each other by working day-to-day during the second week of work that comes at the most in a work week at night.
Yes.
Yes.
Yes.
Women's employees can not be employed in the night mail for a year or so, until they are born with the doctor's report that they are pregnant.
Yes.
Yes.


In [27]:
answer_set_1 = pd.DataFrame({'answers': [answers]})
answer_set_1.to_csv('./answer_set_3')