In [1]:
pip install  transformers torch langchain-community youtube_transcript_api

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


# 1. RAG - Porozmawiaj z danymi
Można użyć `document_loaders` z biblioteki `langchain-community` żeby załadować różne typy danych i "porozmawiać z nimi" używając LLMów.

Biblioteka `langchain-community` oferuje wiele różnych loaderów, między innymi:
- [Web](https://python.langchain.com/docs/integrations/document_loaders/web_base/)
- [Twitter](https://python.langchain.com/docs/integrations/document_loaders/twitter/)
- [Discord](https://python.langchain.com/docs/integrations/document_loaders/discord/)
- [Github](https://python.langchain.com/docs/integrations/document_loaders/github/)
- [CSV](https://python.langchain.com/docs/integrations/document_loaders/csv/)
- [Youtube](https://python.langchain.com/docs/integrations/document_loaders/youtube_transcript/)

i wiele więcej.

### Zaimportować Loader
Na początek trzeba zaimportować odpowiedni moduł z `langchain_community.document_loaders`. [Tutaj](https://python.langchain.com/docs/integrations/document_loaders/) można znaleźć wszystkie dostępne opcje.

Niektóre loadery wymagają dodatkowych dependencji które trzeba samodzielnie doinstalować.

In [2]:
from langchain.document_loaders import PyPDFLoader



### Załadować dane
Stworzyć loader i użyć metody `load` żeby załadować dane.

In [16]:
loader = PyPDFLoader(file_path="zaliczenie.pdf")

data = loader.load()

### Załadować Model

In [10]:
from transformers import pipeline  # huggingface

model_id = "gpt2"
model = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype="auto",
)

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use mps:0


### Zaimplementować funkcję `generate`


In [11]:
def generate(prompt: str) -> str:
    response = model(prompt, max_new_tokens=100, temperature=0.7)
    if response and isinstance(response, list) and "generated_text" in response[0]:
        return response[0]["generated_text"]
    else:
        raise ValueError("Nie można odczytać wygenerowanego tekstu z odpowiedzi modelu.")

In [12]:
# Test the `generate` function
generate("Hello World!")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'Hello World!\n\nI have an old friend who is a good friend of mine, a man named Ben. He has been with me at the last minute and told me that he wanted to play the game with me. I told him he could play it with me but he had to do it the way it was programmed. When Ben came out of his bed he was so excited to play and so excited to play, he told me that he would make me a card with the word "Haven" in'

### Napisać prompt
Napisać prompt bazowy na podstawie którego zostanie wygenerowana odpowiedź na `query` użytkownika w oparciu o dostępne dane.

In [17]:
BASE_PROMPT = """
You are an advanced assistant. Use the provided data to generate a precise and helpful response to the query below.

Query: {query}

Data:
{data}

Your response should be clear, concise, and directly address the query based on the data provided.
"""

### Wygenerować odpowiedź

In [18]:
query = "Summarize this project"
prompt = BASE_PROMPT.format(query=query, data=data)

In [None]:
generate(prompt)

Token indices sequence length is longer than the specified maximum sequence length for this model (4347 > 1024). Running this sequence through the model will result in indexing errors
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


# 2. Prompt Chaining
Można łączyć wiele promptów jeden po drugim, aby przeprowadzać transformacje lub dodatkowe procesy na generowanych odpowiedziach przed osiągnięciem  pożądanego rezultatu.

Zadaniem będzie przekształcić zadanie programistyczne w gotowy fragment kodu, łącząc prompty w następujący łańcuch:
1. Wygenerować plan rozwiązania problemu (najlepiej w krokach)
2. Wygenerować dodatkowe kwestie, które należy wziąć pod uwagę
3. Wygenerować ostateczny kod

### Zdefiniować prompty

Prompt musi być odpowiednio sformatowany. Możesz użyć tagów HTML, markdown lub innych opcji formatowania.
W zapytaniach należy używać placeholderów.

In [None]:
GENERATE_PLAN_PROMPT = """
Your prompt here. Include the placeholder for `query`.

Make sure that this prompt generates a step-by-step plan to solve the problem, not the final code.
"""

In [None]:
GENERATE_CONSIDERATIONS_PROMPT = """
Your prompt here. Include the placeholder for `query` and `plan`.

Make sure that this prompt generates additional considerations, not the final code or the new plan.
"""

In [None]:
GENERATE_CODE_PROMPT = """
Your prompt here. Include the placeholders for `query`, `plan`, and `considerations`.

Make sure that this prompt generates just the final code snippet without any additional information or comments from the model.
"""

### Stwórzyć łańcuch

In [None]:
def run_chain(query: str) -> str:
    # 1. Generate a step-by-step plan
    print("Generating a step-by-step plan...")
    prompt = GENERATE_PLAN_PROMPT.format(query=query)
    plan = generate(prompt)
    print(plan)

    # 2. Generate additional considerations
    print("\n\nGenerating additional considerations...")
    prompt = GENERATE_CONSIDERATIONS_PROMPT.format(query=query, plan=plan)
    considerations = generate(prompt)
    print(considerations)

    # 3. Generate the final code snippet
    print("\n\nGenerating the final code snippet...")
    prompt = GENERATE_CODE_PROMPT.format(query=query, plan=plan, considerations=considerations)
    code = generate(prompt)
    print(code)

    return code

### Przetestować łańcuch

In [None]:
example_query_1 = "Write a Python function to find all prime numbers in a range from 1 to n."
example_query_2 = "Write a function that takes a list of words and a single word, and returns all the words in the list that are anagrams of the given word."
example_query_3 = ""  # Add your own query to test the chain

In [None]:
code_snippet = run_chain(example_query_1)

In [None]:
from IPython.display import display, Code

# Display the generated code snippet
display(Code(code_snippet, language='python'))

Wkleić wygenerowany kod do komórki poniżej żeby sprawdzić czy działa poprawnie.

In [None]:
# Paste the generated code snippet here

# 3. Walidator tekstu - Zadanie domowe
Napisać walidator tekstu, który sprawdzi, czy tekst nie łamie żadnych reguł. Jeśli łamie, walidator powinien zwrócić odpowiednią informację.

### Zdefiniować kryteria

In [None]:
RULES = {
    "no_personal_info": "Should not contain any personal information.",
    "english_only": "Should be in English.",
    "no_questions": "Should not contain any questions.",
    # Feel free to add more rules here
}

### Zaimplementować walidator

In [None]:
VALIDATION_PROMPT = """
You are a validator. You need to ensure that the provided text meets the criteria.

<Criteria>
Code: {rule_code}
Description: {rule_description}
</Criteria>

<Text to check>
{text_to_check}
</Text to check>

# Output format
Output the result in the following JSON format:
{{
    "criteria_met": bool,  # True if the criteria is met, False otherwise
    "feedback": str  # Provide feedback if the criteria is not met, otherwise leave empty string
}}

Return just the JSON without any additional information or comments.
"""

In [None]:
import json


def validate_rule(text: str, rule_code: str) -> dict:
    # 1. Load the rule description from the RULES dictionary for the given `rule_code`
    # 2. Prepare the prompt using `VALIDATION_PROMPT` and `format` method
    # 3. Run the `generate` function
    # 4. Use `json.dumps` to transform the string output into a dictionary
    # 5. Add the `rule_code` to the dictionary
    # 6. Return the dictionary. The dictionary should contain the following keys: "criteria_met", "feedback", "rule_code"
    pass

### Przetestować walidator

In [None]:
def run_validator(text: str):
    for rule_code in RULES.keys():
        print(f"Checking rule '{rule_code}'...")
        result = validate_rule(text, rule_code)

        assert result["criteria_met"], f"Rule '{rule_code}' is not met. Feedback: {result['feedback']}"

        print("Rule is met.")

In [None]:
text_to_check = "My name is John and I like to play basketball. Do you know how to play basketball?"

In [None]:
run_validator(text_to_check)

### Zaimplementować funkcję anonimizacji `anonymize` - Zadanie dodatkowe
Jeśli tekst zawiera dane które łamią powyższe reguły (np. dane osobowe), funkcja `anonymize` ma podmienić te dane na placeholder.

In [None]:
ANONYMIZE_PROMPT = """
Your prompt here.
"""


def anonymize(text: str) -> str:
    # Implement the function that will replace the personal information with a placeholder
    # Make sure to return the anonymized text (string)
    pass

In [None]:
print(f"Checking rule 'no_personal_info'...")
result = validate_rule(text_to_check, rule_code="no_personal_info")

if not result["criteria_met"]:
    print("Personal information found. Anonymizing the text...")
    anonymized_text = anonymize(text_to_check)
    print(anonymized_text)

    print("Re-running the validation...")
    validate_rule(anonymized_text, rule_code="no_personal_info")

    assert result["criteria_met"], "Anonymized text still contains personal information. Refine your prompt."

print("Rule is met.")