<a href="https://colab.research.google.com/github/lykskai/HodgkinAvatar/blob/main/llama.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1) Install Required Libraries

In [1]:
!pip install transformers datasets torch pdfplumber
!pip install faiss-cpu

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting pdfplumber
  Downloading pdfplumber-0.11.5-py3-none-any.whl.metadata (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.5/42.5 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Collecting pdfminer.six==20231228 (from pdfplumber)
  Downloading pdfminer.six-20231228-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2

2) Importing the necessary libraries for our code

In [2]:
from transformers import pipeline, Trainer, TrainingArguments
from datasets import Dataset
from transformers import RagTokenizer, RagRetriever, RagSequenceForGeneration
import pdfplumber
import torch
import json

** For uploading and extracting files!

In [None]:
# Upload a file manually
from google.colab import files
uploaded = files.upload()

# Extract full text from the uploaded file
for filename in uploaded.keys():
    print(f"Processing file: {filename}")
    with pdfplumber.open(filename) as pdf:
        full_text = "\n".join(page.extract_text() for page in pdf.pages)

    # Save extracted text for manual editing
    output_file = filename.replace(".pdf", "_extracted.txt")
    with open(output_file, "w") as f:
        f.write(full_text)
        print(f"Text extracted and saved to {output_file}")

### b. Manually Edit Extracted Text

Saving tf9332901032.pdf to tf9332901032.pdf
Processing file: tf9332901032.pdf
Text extracted and saved to tf9332901032_extracted.txt


** For downloading extracted files!

In [None]:
from google.colab import files

files.download("tf9332901032_extracted.txt")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

3) Upload and Process Text file

In [3]:
# Upload the text file
from google.colab import files ##

uploaded = files.upload()

# Read the uploaded file
for filename in uploaded.keys():
    if filename == "4articles-dch.txt":
        with open(filename, "r") as f:
            combined_text = f.read()

print("Combined text loaded successfully!")

Saving 4articles-dch.txt to 4articles-dch.txt
Combined text loaded successfully!


4) Create a dataset, convert the combined text into a Hugging Face Dataset:


In [4]:
lines = combined_text.split("\n")  # Split into lines or entries

dataset = Dataset.from_dict({"text": lines})

5)  Tokenize the Dataset

In [5]:
from transformers import AutoTokenizer

# Initialize the tokenizer
model_id = "meta-llama/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)

# Ensure the tokenizer has a pad token
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Map:   0%|          | 0/5503 [00:00<?, ? examples/s]

6) Split the dataset

In [6]:
train_test_split = tokenized_datasets.train_test_split(test_size=0.1)
train_dataset = train_test_split["train"]
eval_dataset = train_test_split["test"]

7) Incorporate RAG for retrieval

In [7]:
### a. Prepare Passages for Retrieval
# Use the combined dataset from previous steps for retrieval
passages = [{"text": line} for line in lines if line.strip() != ""]  # Convert non-empty lines to passages

# Save passages to a Hugging Face dataset and add Faiss index
from datasets import Dataset
passages_dataset = Dataset.from_dict({"text": [p["text"] for p in passages]})

# Use a simple embedding function (e.g., MiniLM) for Faiss indexing
from transformers import AutoModel, AutoTokenizer
embedding_model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
embedding_tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

def embed_passages(batch):
    inputs = embedding_tokenizer(batch["text"], padding=True, truncation=True, return_tensors="pt")
    with torch.no_grad():
        embeddings = embedding_model(**inputs).last_hidden_state.mean(dim=1)
    return {"embeddings": embeddings.numpy()}

passages_dataset = passages_dataset.map(embed_passages, batched=True, batch_size=8)
passages_dataset.add_faiss_index(column="embeddings", index_name="embeddings_index")

# Save the dataset and Faiss index
passages_dataset.drop_index("embeddings_index")  # Drop the Faiss index before saving the dataset
passages_dataset.save_to_disk("./passages_dataset")  # Save the dataset without the index
passages_dataset.add_faiss_index(column="embeddings", index_name="embeddings_index")  # Re-add the index after saving
passages_dataset.get_index("embeddings_index").save("./passages_index.faiss")  # Save the Faiss index separately

### b. Load the RAG Model and Retriever
# Load the RAG tokenizer, retriever, and model
rag_model_name = "facebook/rag-sequence-base"
rag_retriever = RagRetriever.from_pretrained(
    rag_model_name,
    dataset_path="./passages_dataset",
    index_name="embeddings_index"  # Use the correct index name
)

rag_model = RagSequenceForGeneration.from_pretrained(rag_model_name, retriever=rag_retriever)

### c. Test the RAG Pipeline
# Example question
input_question = "What did Dorothy Hodgkin study?"
inputs = rag_retriever.tokenizer(input_question, return_tensors="pt")
outputs = rag_model.generate(input_ids=inputs["input_ids"])
print(rag_retriever.tokenizer.decode(outputs[0], skip_special_tokens=True))


config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Map:   0%|          | 0/5491 [00:00<?, ? examples/s]

  0%|          | 0/6 [00:00<?, ?it/s]

Saving the dataset (0/1 shards):   0%|          | 0/5491 [00:00<?, ? examples/s]

  0%|          | 0/6 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/4.55k [00:00<?, ?B/s]

(…)_encoder_tokenizer/tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

question_encoder_tokenizer/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

(…)ncoder_tokenizer/special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizerFast'.


(…)enerator_tokenizer/tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

generator_tokenizer/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

generator_tokenizer/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

(…)erator_tokenizer/special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'BartTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'BartTokenizerFast'.


ValueError: Please provide `index_name` or `index_path`.

8) define the data collator

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

9) Set Training Arguments

In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    save_steps=10_000,
    save_total_limit=2,
    logging_dir="./logs",
    logging_steps=200,
    push_to_hub=False,
    fp16=torch.cuda.is_available(),
)

NEW) Fine tune LLaMA for style

In [None]:
from transformers import LlamaForCausalLM, LlamaTokenizer, Trainer

# Load tokenizer and model
model_id = "meta-llama/Meta-Llama-3-8B"
model = LlamaForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = LlamaTokenizer.from_pretrained(model_id)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# Train the Model
trainer.train()

# Test the model using a text-generation pipeline
from transformers import pipeline

generation_pipeline = pipeline("text-generation", model=model, tokenizer=tokenizer, device_map="auto")
print(generation_pipeline("Hey, how are you doing today?", max_length=50))

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

KeyboardInterrupt: 

11) Save the Fine-Tuned Model

In [None]:
trainer.save_model("./fine_tuned_llama")

3) Load the pretrained LLaMA model and tokenizer

In [None]:
# Load the LLaMA 3 model using Hugging Face
model_id = "meta-llama/Meta-Llama-3-8B"

pipeline = pipeline("text-generation", model=model_id,
                    model_kwargs={"torch_dtype": torch.bfloat16},
                    device_map="auto")

# Test the pipeline
print(pipeline("Hey, how are you doing today?"))

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/177 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


[{'generated_text': 'Hey, how are you doing today? I hope you are having a great day and that you are ready for a fun and exciting blog post! Today I have a review of the new Dior Addict Lip Glow Colour Reviver Balm in the shade 002 Pink. This lip balm is a new release from Dior and I have been dying to get my hands on it since I first saw it on Instagram. I have been a huge fan of the original Dior Addict Lip Glow for years and years now and it is one of my all time favourite lip products, so I had to try the new balm!\nThe Dior Addict Lip Glow Colour Reviver Balm in 002 Pink is a lip balm that is formulated with a complex of hyaluronic acid and mango butter to provide your lips with a boost of hydration. It is also enriched with a colour pigments that are designed to react with your natural pH to create a custom colour that is perfect for you. The lip balm is also infused with rose oil to provide a subtle scent and it is formulated without parabens and mineral oils.\nThe lip balm com