<a href="https://colab.research.google.com/github/keyom-ai/fine-tuning/blob/main/dataset_prep.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Processing WikiHow Data for LLM Fine-tuning

## Summary

This Jupyter/Colab notebook demonstrates the end-to-end process of collecting, cleaning, and formatting data from WikiHow for fine-tuning a Large Language Model (LLM) like Mistral or LLaMA. The notebook covers several key steps in the data preparation pipeline, from web scraping to creating the final dataset in a format suitable for LLM fine-tuning.

By following this notebook, students will learn:
1. How to scrape data from a website
2. Techniques for cleaning and formatting text data
3. The process of tokenization for LLMs
4. Creating prompt-completion pairs for instruction fine-tuning
5. Basic data augmentation techniques
6. Splitting data into training and validation sets
7. Converting data into a common format for LLM fine-tuning (JSONL)

Let's go through each step in detail:

In [65]:
!pip install requests beautifulsoup4 tqdm transformers scikit-learn



In [66]:
# Importing necessary libraries
import requests
from bs4 import BeautifulSoup
import json
import re
from tqdm import tqdm
from transformers import AutoTokenizer
from sklearn.model_selection import train_test_split


## Step 1: Data Collection

In this step, we use web scraping techniques to collect data from WikiHow articles.

**Explanation:**
- We define a function that uses the `requests` library to fetch the HTML content of a WikiHow page and `BeautifulSoup` to parse it.
- The function attempts to find the title and steps of the article using various HTML tags and class names.
- We then loop through a list of URLs, scraping each article and storing the results.


In [67]:
# Step 1: Data Collection
def scrape_wikihow_article(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Try multiple ways to find the title
    title_element = (
        soup.find('h1', class_='whb') or  # First try
        soup.find('h1', class_='firstHeading') or  # Second try
        soup.find('h1', class_='title') or  # Third try
        soup.find('h1')  # Last resort: any h1 tag
    )

    if title_element:
        title = title_element.text.strip()
    else:
        title = "Title not found"  # Handle cases where title isn't found

    # Try to find steps
    steps = [step.text.strip() for step in soup.find_all('div', class_='step')]

    # If no steps found, try to find list items in the main content
    if not steps:
        main_content = soup.find('div', class_='mf-section-1')
        if main_content:
            steps = [li.text.strip() for li in main_content.find_all('li')]

    return {'title': title, 'steps': steps}

In [68]:
# For demonstration, we'll use a small list of URLs
urls = [
    'https://www.wikihow.com/Budget-Your-Money',
    'https://www.wikihow.com/Tie-a-Tie',
    'https://www.wikihow.com/Plant-a-Tree'
]


In [69]:
articles = []
for url in tqdm(urls, desc="Scraping articles"):
    articles.append(scrape_wikihow_article(url))


Scraping articles: 100%|██████████| 3/3 [00:01<00:00,  2.08it/s]


In [70]:
# Print statement after scraping
print("\nStep 1: Raw scraped data (first article):")
print(json.dumps(articles[0], indent=2))


Step 1: Raw scraped data (first article):
{
  "title": "How to Budget Your Money",
  "steps": [
    "Create a budgeting spreadsheet.[2]\nX\nExpert Source\n\nSamantha Gorelick, CFP\u00aeFinancial Planner\nExpert Interview.  6 May 2020.\n\n  You can create a simple spreadsheet using Google Sheets or Excel. Your goal is to chart all your expenses and income during the course of a year. So, make a spreadsheet that shows all your information clearly, allowing you to quickly identify any areas where you can spend smarter.[3]\nX\nResearch source\n\n\n\n\nLabel the top row with the 12 months of the year.",
    "Find your monthly income after taxes and other payroll deductions. Your net income, or the income that is yours to spend, is your monthly income after taxes, FICA, and other payroll deductions. If you are on a salary, this will be a fixed amount each month, which you can find on your paystub. If you work an hourly position, your income may vary from month to month, but you can find an 

## Step 2 & 3: Data Cleaning and Formatting

These steps involve cleaning the raw scraped data and formatting it into a structured format.

**Explanation:**
- We use regular expressions to remove special characters from the title and steps.
- The steps are formatted into a numbered list.
- The cleaned and formatted data is stored in a new structure with 'title' and 'content' fields.

In [71]:
# Step 2 & 3: Data Cleaning and Formatting
def clean_and_format_article(article):
    clean_title = re.sub(r'[^\w\s]', '', article['title'])
    clean_steps = [re.sub(r'[^\w\s]', '', step) for step in article['steps']]
    formatted_steps = "\n".join([f"{i+1}. {step}" for i, step in enumerate(clean_steps)])
    return {
        'title': clean_title,
        'content': formatted_steps
    }

cleaned_articles = [clean_and_format_article(article) for article in articles]

# Print statement after cleaning and formatting
print("\nStep 2 & 3: Cleaned and formatted data (first article):")
print(json.dumps(cleaned_articles[0], indent=2))


Step 2 & 3: Cleaned and formatted data (first article):
{
  "title": "How to Budget Your Money",
  "content": "1. Create a budgeting spreadsheet2\nX\nExpert Source\n\nSamantha Gorelick CFPFinancial Planner\nExpert Interview  6 May 2020\n\n  You can create a simple spreadsheet using Google Sheets or Excel Your goal is to chart all your expenses and income during the course of a year So make a spreadsheet that shows all your information clearly allowing you to quickly identify any areas where you can spend smarter3\nX\nResearch source\n\n\n\n\nLabel the top row with the 12 months of the year\n2. Find your monthly income after taxes and other payroll deductions Your net income or the income that is yours to spend is your monthly income after taxes FICA and other payroll deductions If you are on a salary this will be a fixed amount each month which you can find on your paystub If you work an hourly position your income may vary from month to month but you can find an average amount by loo

**Explanation:**
These steps involve setting up Hugging Face authentication as the model we are going to use "Mistral-7B" is a gated model.

In [72]:
!pip install huggingface_hub
from huggingface_hub import notebook_login

notebook_login() # This will prompt you to enter your token



VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Step 4: Tokenization

In this step, we tokenize the text using the tokenizer of our target LLM (in this case, Mistral). Remember, Mistral model is 'gated' model, that means you must have to go to Hugging Face website, and agree to the terms of Mistral before you can use this model and before you can proceed to the next step. If you get 403 error, that means you have not agreed to their terms.

**Explanation:**
- We use the Hugging Face `transformers` library to load the Mistral tokenizer.
- Each article is tokenized as a combination of a prompt and the article content.
- We limit the token length to a maximum of 1024 tokens (adjust this based on your LLM's requirements).


In [73]:
# Step 4: Tokenization
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")


In [74]:
def tokenize_article(article, max_length=1024):
    prompt = f"How do I {article['title']}?"
    completion = article['content']
    tokens = tokenizer.encode(prompt + completion)
    if len(tokens) > max_length:
        tokens = tokens[:max_length]
    return tokens

tokenized_articles = [tokenize_article(article) for article in cleaned_articles]
# Print statement after tokenization
print("\nStep 4: Tokenized data (first 10 tokens of first article):")
print(tokenized_articles[0][:10])
print(f"Total tokens in first article: {len(tokenized_articles[0])}")



Step 4: Tokenized data (first 10 tokens of first article):
[1, 1602, 511, 315, 1602, 298, 7095, 527, 3604, 17786]
Total tokens in first article: 1024


## Step 5: Creating Prompt-Completion Pairs

Here, we format our data into prompt-completion pairs, which is a common format for instruction fine-tuning.

**Explanation:**
- We create a prompt in the form of a question: "How do I [article title]?"
- The article content serves as the completion (answer) to this prompt.

In [75]:
# Step 5: Creating Prompt-Completion Pairs
def create_prompt_completion_pair(article):
    return {
        'prompt': f"How do I {article['title']}?",
        'completion': article['content']
    }

prompt_completion_pairs = [create_prompt_completion_pair(article) for article in cleaned_articles]
# Print statement after creating prompt-completion pairs
print("\nStep 5: Prompt-Completion Pair (first article):")
print(json.dumps(prompt_completion_pairs[0], indent=2))
print(f"Total prompt-completion pairs: {len(prompt_completion_pairs)}")



Step 5: Prompt-Completion Pair (first article):
{
  "prompt": "How do I How to Budget Your Money?",
  "completion": "1. Create a budgeting spreadsheet2\nX\nExpert Source\n\nSamantha Gorelick CFPFinancial Planner\nExpert Interview  6 May 2020\n\n  You can create a simple spreadsheet using Google Sheets or Excel Your goal is to chart all your expenses and income during the course of a year So make a spreadsheet that shows all your information clearly allowing you to quickly identify any areas where you can spend smarter3\nX\nResearch source\n\n\n\n\nLabel the top row with the 12 months of the year\n2. Find your monthly income after taxes and other payroll deductions Your net income or the income that is yours to spend is your monthly income after taxes FICA and other payroll deductions If you are on a salary this will be a fixed amount each month which you can find on your paystub If you work an hourly position your income may vary from month to month but you can find an average amount 

## Step 6: Data Augmentation

This step demonstrates a simple data augmentation technique to increase the size and diversity of our dataset.

**Explanation:**
- We create a new prompt variation for each article: "What are the steps to [article title]?"
- This doubles our dataset size and provides some variety in how questions can be asked.

In [76]:
# Step 6: Data Augmentation (simplified example)
def augment_data(pair):
    augmented_pairs = [pair]
    augmented_pairs.append({
        'prompt': f"What are the steps to {pair['prompt'][9:]}",
        'completion': pair['completion']
    })
    return augmented_pairs

augmented_data = [aug_pair for pair in prompt_completion_pairs for aug_pair in augment_data(pair)]
# Print statement after data augmentation
print("\nStep 6: Augmented Data (first two pairs for first article):")
print(json.dumps(augmented_data[:2], indent=2))




Step 6: Augmented Data (first two pairs for first article):
[
  {
    "prompt": "How do I How to Budget Your Money?",
    "completion": "1. Create a budgeting spreadsheet2\nX\nExpert Source\n\nSamantha Gorelick CFPFinancial Planner\nExpert Interview  6 May 2020\n\n  You can create a simple spreadsheet using Google Sheets or Excel Your goal is to chart all your expenses and income during the course of a year So make a spreadsheet that shows all your information clearly allowing you to quickly identify any areas where you can spend smarter3\nX\nResearch source\n\n\n\n\nLabel the top row with the 12 months of the year\n2. Find your monthly income after taxes and other payroll deductions Your net income or the income that is yours to spend is your monthly income after taxes FICA and other payroll deductions If you are on a salary this will be a fixed amount each month which you can find on your paystub If you work an hourly position your income may vary from month to month but you can fin

## Step 7: Train-Validation Split

Here, we split our data into training and validation sets.

**Explanation:**
- We use scikit-learn's `train_test_split` function to randomly split our data.
- 80% of the data goes into the training set, and 20% into the validation set.

In [77]:
# Step 7: Train-Validation Split
train_data, val_data = train_test_split(augmented_data, test_size=0.2, random_state=42)
# Print statement after train-validation split
print(f"\nStep 7: Train-Validation Split:")
print(f"Training samples: {len(train_data)}")
print(f"Validation samples: {len(val_data)}")



Step 7: Train-Validation Split:
Training samples: 4
Validation samples: 2


## Step 8: Format Conversion

Finally, we convert our data into the JSONL format, which is commonly used for LLM fine-tuning.

**Explanation:**
- We write each prompt-completion pair as a JSON object on a new line in the output file.
- We create separate files for the training and validation data.

In [78]:
# Step 8: Format Conversion
def convert_to_jsonl(data, filename):
    with open(filename, 'w') as f:
        for item in data:
            json.dump(item, f)
            f.write('\n')


In [79]:
convert_to_jsonl(train_data, 'train_data.jsonl')
convert_to_jsonl(val_data, 'val_data.jsonl')


## Conclusion

This notebook has walked through the entire process of preparing WikiHow data for fine-tuning an LLM. The resulting JSONL files can be used with various LLM fine-tuning frameworks. Remember that this is a simplified example, and in a real-world scenario, you would likely need to handle more edge cases, implement more robust error handling, and possibly include more sophisticated data cleaning and augmentation techniques.

In [80]:
# Print statement after format conversion
print("\nStep 8: Format Conversion")
print("First line of train_data.jsonl:")
with open('train_data.jsonl', 'r') as f:
    print(f.readline().strip())

print(f"\nProcessed {len(articles)} articles.")
print(f"Created {len(train_data)} training samples and {len(val_data)} validation samples.")
print("Data has been saved to 'train_data.jsonl' and 'val_data.jsonl'.")
print("Done!")


Step 8: Format Conversion
First line of train_data.jsonl:
{"prompt": "What are the steps to How to Plant a Tree?", "completion": "1. Select a healthy tree that naturally thrives in your climate Trees live a long time so its important to pick a local species that wont struggle to survive If you arent sure which species grow locally spend some time researching trees that are native to your area1\nX\nResearch source\n\n\n\n\n\nYou can also to a local nursery owner for species suggestions\nTree roots always grow best in their native soil2\nX\nResearch source\n\n\n\n You shouldnt need to amend or fertilize the soil as long as the species is native and climateappropriate3\nX\nResearch source\n2. Plant most tree species in the fall or early spring Cool weather is the best time for planting since the trees are dormant during that time Planting a tree in late spring or summer when the roots are actively growing puts too much stress on the tree and it may not survive4\nX\nResearch source\n\n\n\