# About this Notebook
This notebook outlines the process of gathering data for a LLM application. This is important because it allows us to understand the ingredients of an LLM and thus decipher how and why it may be behaving in certain way. 

Pre training for properietary models is not transparent, so it is diffcult for us to know how and why models such as GPT-4, Claude, and Llama are behaving in certain ways. 

By gathering data from the internet, we can train our own models to understand how they are behaving. 


### Ingredients of an LLM 

1. Pretraining Data 
Garbage in, garbage out. In this notebook we will explore popular pretraining data sets and dig into preprocessing steps taken to ensur ehigh quality data is fed into our models. 

2. Vocabulary and Tokenizer
In order to builda model over a langauge, we have to first determine the vocabulary of the language we are modelling. We also need to break the stream of text into the corerect vocabulary units, referred to as tokenization. 

3. Learning Objective
What is the model being trained to do? Pretraining focuses on general skills such as syntax, semantics, reasoning etc such that it can handle any task we throw at it, even if it was not explicitly trained on it. Therefore training objectives should be sufficiently general to capture all these skills. 

4. Architecture
What is the internal architecture of the model? How does it process information? The arhcitecture refers to the components of the model, how they connnect and interact with each other and how they process input. Each architecture has its own inductive bias, a set of assumptions made about the data and tasks it will be used for. Biaising the model towards certain types of solutions can lead to better performance on certain tasks. 

5. Fine Tuning. 
The above described is the base model. We can then fine tune the model using supervised instruction fine tuning, RLHF, domain adaptive or task adaptive continured pretraining, so that the model is better attuned to specific domains and tasks.

# 
# ![Base Models and Derivatives](images/basemodels_and_derivatives.png)
 Base models are pretrained on large amounts of data and can be fine-tuned into derivative models for specific tasks and domains
# 


### Popular Pre Training Datasets
# ![Popular Pre Training Datasets](images/popular_pretraining_datasets.png)


### Exploring content of pre training datasets

In [7]:
from datasets import load_dataset

# Load the dataset
realnewslike = load_dataset("allenai/c4", "realnewslike", streaming=True, split="train")

# Iterate over the dataset
for i, example in enumerate(realnewslike):
    if "Iceland" in example["text"]:
        print(example)
    if i == 10000:  # Limit to 10,000 iterations for demonstration
        break

{'text': 'Kerth, 67, was selected from 164 applicants to be the organization’s new chief executive. He sits on the board of the Sacramento Municipal Utility District and is a scion of the Kerth family who founded the American Ice Co. and the Iceland skating rink in North Sacramento. He is responsible for the design of ice rinks in what’s now called Sleep Train Arena in Sacramento and an outdoor rink at Squaw Valley’s High Camp in 1990.\nVeteran nonprofit leader Pam Saltenberger, who served as Sacramento Habitat’s interim CEO for about six months, said she and other members of the executive search team felt that Kerth’s experience in the political arena would help the organization lead and navigate discussions about how governments and nonprofits can work together to ensure there’s adequate housing for low-income residents.\nThe loss of government funding is not the only challenge for the Sacramento Habitat, which has a budget of $5.9 million for the fiscal year ending in June. Nationwi

In [9]:
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jamespotter/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [13]:
from collections import Counter
import re
from nltk.corpus import stopwords
import nltk


# Download stopwords if not already downloaded
try:
    stop_words = set(stopwords.words('english'))
except LookupError:
    nltk.download('stopwords')
    stop_words = set(stopwords.words('english'))

# Initialize counter
word_freq = Counter()

# Process first 10000 documents
for i, example in enumerate(realnewslike):
    # Convert to lowercase and split into words
    words = example["text"].lower().split()
    
    # Remove stopwords and count remaining words
    content_words = [word for word in words if word not in stop_words]
    word_freq.update(content_words)
    
    if i == 10000:  # Limit to 10,000 documents for demonstration
        break

# Print the 20 most common words
most_common_first_pass = []
print("20 most common words:")
for word, count in word_freq.most_common(20):
    print(f"{word}: {count}")
    most_common_first_pass.append(word)
    



20 most common words:
said: 10766
one: 9141
new: 9111
also: 8372
would: 7899
two: 5796
first: 5732
people: 5603
said.: 5512
like: 5465
last: 5250
could: 4663
time: 4502
—: 4475
get: 4277
many: 3951
even: 3881
make: 3776
–: 3696
may: 3689


In [14]:
# The most common stop words aren't representative of the topics in the data - so we will omit them and check for actual topics that occur. 
# Initialize counter for topic analysis
topic_freq = Counter()

# Process first 10000 documents, looking for meaningful topics
for i, example in enumerate(realnewslike):
    # Convert to lowercase and tokenize
    text = example["text"].lower()
    
    # Remove special characters and extra whitespace
    text = re.sub(r'[^\w\s]', ' ', text)
    text = ' '.join(text.split())
    
    # Split into words and remove stopwords
    words = text.split()
    content_words = [word for word in words 
                    if word not in stop_words 
                    and len(word) > 3 
                    and not word.isdigit() 
                    and word not in most_common_first_pass
                    and not any(c in word for c in ['—', '–'])] 
    
    # Count meaningful words that could represent topics
    topic_freq.update(content_words)
    
    if i == 10000:
        break

# Print the 20 most common meaningful topics
print("\n20 most common topics:")
for word, count in topic_freq.most_common(20):
    print(f"{word}: {count}")



20 most common topics:
year: 8716
years: 5273
state: 4518
well: 3862
three: 3841
back: 3661
world: 3616
company: 3581
work: 3467
city: 3459
much: 3245
million: 3212
government: 3175
high: 3164
take: 3131
president: 3128
home: 3125
made: 3093
school: 3058
since: 3007


In [15]:
# again, this is getting better but we can do better. We will only choose nouns.
# Initialize counter for topic analysis with only nouns
topic_freq = Counter()

# Process first 10000 documents, looking for meaningful topics
for i, example in enumerate(realnewslike):
    # Convert to lowercase and tokenize
    text = example["text"].lower()
    
    # Remove special characters and extra whitespace 
    text = re.sub(r'[^\w\s]', ' ', text)
    text = ' '.join(text.split())
    
    # Split into words and remove stopwords
    words = text.split()
    
    # Tag parts of speech
    pos_tags = nltk.pos_tag(words)
    
    # Only keep nouns (NN = noun singular, NNS = noun plural, NNP = proper noun singular, NNPS = proper noun plural)
    content_words = [word for word, pos in pos_tags 
                    if pos.startswith('NN')
                    and word not in stop_words
                    and len(word) > 3
                    and not word.isdigit()
                    and word not in most_common_first_pass
                    and not any(c in word for c in ['—', '–'])]
    
    # Count meaningful nouns that could represent topics
    topic_freq.update(content_words)
    
    if i == 10000:
        break

# Print the 20 most common meaningful noun topics
print("\n20 most common noun topics:")
for word, count in topic_freq.most_common(20):
    print(f"{word}: {count}")





20 most common noun topics:
year: 8716
years: 5273
state: 4518
world: 3616
company: 3581
city: 3459
government: 3175
president: 3128
school: 3058
home: 2975
team: 2923
game: 2895
percent: 2855
week: 2737
part: 2727
business: 2697
season: 2679
country: 2358
life: 2352
group: 2255


In [16]:
# Print the 20 least common meaningful noun topics
print("\n20 least common noun topics:")
for word, count in topic_freq.most_common()[:-21:-1]:
    print(f"{word}: {count}")


20 least common noun topics:
husak: 1
montgomerie: 1
banqueting: 1
mallaghan: 1
traum: 1
giesler: 1
wherley: 1
bock: 1
votech: 1
purvis: 1
lavash: 1
djan: 1
noshe: 1
logbooks: 1
partnerwere: 1
esquino: 1
learjet: 1
domene: 1
ávila: 1
yihai: 1


### Synthetic Data 
Cosmopedia is a synthetic dataset located in Huggingface.

In [20]:
from datasets import load_dataset
import random

# Load the Cosmopedia-100K dataset
dataset = load_dataset("HuggingFaceTB/cosmopedia-100k")

# Define some varied prompts to test
varied_prompts = [
    "Write a scientific article about black holes",
    "Compose a historical essay about ancient Rome",
    "Create a technical guide about machine learning",
    "Write a philosophical discussion about consciousness",
    "Explain quantum physics for beginners"
]

# Look at examples with different prompts
print("\nExamining Cosmopedia synthetic data samples with varied prompts:\n")

# Get random samples and pair with different prompts
samples = dataset['train'].select(random.sample(range(len(dataset['train'])), 5))

for i, (example, prompt) in enumerate(zip(samples, varied_prompts)):
    print(f"\nExample {i+1}:")
    print(f"Original Prompt: {example['prompt']}")
    print(f"Alternative Prompt: {prompt}")
    print(f"\nGenerated text: {example['text'][:500]}...")  # Show first 500 chars
    print("\n" + "="*80)

# Analyze dataset statistics
print("\nDataset Statistics:")
print(f"Total number of examples: {len(dataset['train'])}")
print(f"Average text length: {sum(len(ex['text']) for ex in dataset['train'].select(range(100)))/100:.0f} characters")

# Analyze prompt diversity
unique_prompts = set(example['prompt'] for example in dataset['train'].select(range(1000)))
print(f"Number of unique prompts in first 1000 examples: {len(unique_prompts)}")


Generating train split: 100%|██████████| 100000/100000 [00:00<00:00, 170309.29 examples/s]


Examining Cosmopedia synthetic data samples:


Example 1:
Prompt: Here is an extract from a webpage: "What can cause my settlement offer to be delayed?
When you’ve been injured in an Austin truck accident, one of the most common questions is how long it will take for the insurance company to make an offer to settle your case. The answer depends on a variety of factors.
The process starts with filing an insurance claim and providing evidence that shows exactly what happened during the accident and who was at fault. This can involve gathering key Austin truck accident evidence such as:
- Medical records
- Photographs or video footage of the crash scene
- Witness statements
- Other documents related to your injuries and damages.
Once this information has been collected by both sides, negotiations may begin between your Austin truck accident lawyer and the insurance company on how much compensation should be offered in exchange for settling the case out of court.
It is important to rememb


