## Introduction

In this Jupyter Notebook, our AI Tools team will take you on a journey to fine-tune the cutting-edge natural language processing model, Databricks Dolly v2-3b. Our objective is to enhance its capabilities to answer questions based on specific Wikipedia articles. By leveraging the vast knowledge available on Wikipedia, we aim to create a highly accurate and context-aware question-answering system. This project will encompass various stages, including data collection, preprocessing, model fine-tuning, and testing, providing a comprehensive overview of the entire process.

Our fine-tuned Dolly v2-3b model is the result of advanced deep learning techniques applied to vast amounts of text data, enabling it to understand complex language patterns and generate meaningful responses in real-time. We have customized the model to excel at answering questions based on specific Wikipedia articles, providing a highly accurate and context-aware solution.

Throughout this project, we will remain focused on achieving our primary goal: to create a powerful, effective, and highly customized model capable of understanding and responding to queries within the context of the chosen articles. We believe that this approach will provide a valuable tool for a wide range of applications, from customer support to research and education.

## Parameters
TBD

In [24]:
# File paths
articles_file_path = "../data/articles.csv"

# Dataset paths
train_dataset_path = "../datasets/train_dataset.pkl"
val_dataset_path = "../datasets/val_dataset.pkl"

# Model parameters
tokenizer_for_sentence_splitting = 'punkt'
base_model_name = "databricks/dolly-v2-3b"
model_path = "./finetuned_model"

# Article parameters
articles_category = 'Science_fiction_films'

# Training parameters
batch_size = 1

## Data Collection

Code for downloading and processing the Wikipedia articles using the WikipediaBulkDownloader class provided in the first message.

In [25]:
import json
import os
from typing import List, Dict

import pandas as pd
import requests
from tqdm.auto import tqdm

if not os.path.exists(articles_file_path):
    # Retrieve elements (subcategories or articles) for the specified category.
    def get_category_members(category: str, member_type: str) -> List[str]:
        base_url = 'https://en.wikipedia.org/w/api.php'
        params = {
            'action': 'query',
            'list': 'categorymembers',
            'cmtitle': category,
            'cmtype': member_type,
            'format': 'json',
            'cmlimit': 500
        }
        response = requests.get(base_url, params=params)
        data = json.loads(response.text)
        members = [item['title'] for item in data['query']['categorymembers']]
        return members


    # Retrieve full wikitext content of articles.
    def get_article_texts(articles: List[str]) -> Dict[str, str]:
        base_url = 'https://en.wikipedia.org/w/api.php'
        params = {
            'action': 'query',
            'prop': 'revisions',
            'rvprop': 'content',
            'format': 'json',
            'titles': '|'.join(articles)
        }
        response = requests.get(base_url, params=params)
        data = json.loads(response.text)
        texts = {}
        for page in data['query']['pages'].values():
            texts[page['title']] = page['revisions'][0]['*']
        return texts


    # Download articles from subcategories of the specified category.
    def download_articles(category: str) -> pd.DataFrame:
        # Get subcategories of the given category
        subcategories = get_category_members(f'Category:{category}', 'subcat')
        all_articles = []

        # Loop through subcategories and get articles from each subcategory
        for subcategory in tqdm(subcategories, desc="Downloading subcategories"):
            articles = get_category_members(subcategory, 'page')
            all_articles.extend(articles)

        article_data = []
        # Process articles in batches to avoid hitting API limits
        for i in tqdm(range(0, len(all_articles), 50), desc="Downloading articles"):
            batch = all_articles[i:i + 50]
            texts = get_article_texts(batch)
            for title, text in texts.items():
                article_data.append({'title': title, 'text': text})

        return pd.DataFrame(article_data)


    # Download the articles and save them to a CSV file
    articles_df = download_articles(articles_category)
    articles_df.to_csv(articles_file_path, index=False)
else:
    # Load the articles from the existing CSV file
    articles_df = pd.read_csv(articles_file_path)

articles_df.head()

Unnamed: 0,title,text
0,Alfonso Cuarón,{{short description|Mexican filmmaker}}\n{{Red...
1,Brad Bird,"{{Short description|American film director, sc..."
2,Brian Patrick Butler,{{Short description|American actor and filmmak...
3,Charles Band,{{short description|American film director}}\n...
4,Chris Carter (screenwriter),{{Short description|American television and fi...


## Creating a Dialogue Dataset

Code for generating question-answer pairs based on the Wikipedia articles and storing them in a text file.

In [26]:
from transformers import AutoTokenizer
import nltk

# Load the prerequisities
nltk.download(tokenizer_for_sentence_splitting)  # Download the punkt tokenizer for sentence splitting
tokenizer = AutoTokenizer.from_pretrained(base_model_name)  # Load the tokenizer

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\andre\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [27]:
from sklearn.model_selection import train_test_split
from nltk import sent_tokenize
from nltk.corpus import stopwords
from torch.utils.data import Dataset
import os
import pickle


# Load the articles
class QADataset(Dataset):
    def __init__(self, df, qa_indices, tokenizer, max_input_length=512, max_output_length=512):
        self.df = df
        self.qa_indices = qa_indices
        self.tokenizer = tokenizer
        self.max_input_length = max_input_length
        self.max_output_length = max_output_length

    def __len__(self):
        return len(self.qa_indices)

    def __getitem__(self, idx):
        index, question, answer = self.qa_indices[idx]
        tokenized_input = self.tokenizer(question, max_length=self.max_input_length, padding="max_length",
                                         truncation=True, return_tensors="pt")
        tokenized_output = self.tokenizer(answer, max_length=self.max_output_length, padding="max_length",
                                          truncation=True, return_tensors="pt")
        return {"input_ids": tokenized_input["input_ids"].squeeze(),
                "attention_mask": tokenized_input["attention_mask"].squeeze(),
                "labels": tokenized_output["input_ids"].squeeze()}


# Replace keywords in a sentence with a placeholder
def replace_keywords(sentence, keywords):
    words = sentence.split()
    replaced = False
    for i, word in enumerate(words):
        if word.strip(".,?;!") in keywords and not replaced:
            words[i] = "____"
            replaced = True
    return " ".join(words)


# Extract keywords from a sentence
def extract_keywords(sentence, num_keywords=1):
    words = sentence.split()
    words = [word for word in words if word not in stopwords.words("english")]
    words = sorted(words, key=lambda x: len(x), reverse=True)
    return words[:num_keywords]


# Create question-answer pairs from the articles
def create_qa_pairs(df):
    qa_pairs = []
    for index in tqdm(df.index, total=df.shape[0], desc="Creating QA pairs"):
        text = df.loc[index, "text"]
        sentences = sent_tokenize(text)
        for sentence in sentences:
            keywords = extract_keywords(sentence)
            question = replace_keywords(sentence, keywords)
            if "____" in question:
                qa_pairs.append((index, question, keywords[0]))
    return qa_pairs


# Split data into train and validation sets
def split_data(df, test_size=0.2, random_state=42):
    train_df, val_df = train_test_split(df, test_size=test_size, random_state=random_state)
    return train_df, val_df


# Check if datasets exist, otherwise create and save them
if os.path.exists(train_dataset_path) and os.path.exists(val_dataset_path):
    # Load the datasets
    with open(train_dataset_path, "rb") as train_file:
        train_dataset = pickle.load(train_file)

    with open(val_dataset_path, "rb") as val_file:
        val_dataset = pickle.load(val_file)
else:
    # Create the datasets
    qa_pairs = create_qa_pairs(articles_df)
    train_data, val_data = split_data(qa_pairs)

    train_dataset = QADataset(train_data, tokenizer)
    val_dataset = QADataset(val_data, tokenizer)

    # Save the datasets
    with open(train_dataset_path, "wb") as train_file:
        pickle.dump(train_dataset, train_file)

    with open(val_dataset_path, "wb") as val_file:
        pickle.dump(val_dataset, val_file)

# Train the model
TBD

In [28]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(base_model_name)  # Load the model

MemoryError: 

In [None]:
from transformers import TrainingArguments, Trainer
from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    eval_accumulation_steps=1,
    prediction_loss_only=True,
    learning_rate=5e-5,
    weight_decay=0.01,
    save_total_limit=1,
    load_best_model_at_end=True,
    metric_for_best_model="loss",
    greater_is_better=False,
    evaluation_strategy="epoch",
    save_strategy="epoch",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

# Train the model
trainer.train()
trainer.evaluate()

# Save the model and tokenizer
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

In [None]:
from transformers import pipeline

# Test the model
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Create a pipeline for question answering
qa_pipeline = pipeline("text2text-generation", model=model, tokenizer=tokenizer)

# Test the model on some questions based on Science Fiction movies
questions = [
    "What is the main theme of the movie Blade Runner?",
    "Who is the author of the novel Dune?",
    "What is the name of the spaceship in the Alien movie?",
    "Who directed the movie The Matrix?",
    "What is the setting of the Star Wars series?",
    "What is the similarity between the movie Matrix and Star Wars?",
]

for question in questions:
    answer = qa_pipeline(question, max_length=50)[0]["generated_text"]
    print(f"Q: {question}\nA: {answer}\n")