<a href="https://colab.research.google.com/github/nageswar307/LLMFromScratch/blob/main/LLMFromScratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Working with text and building our own tokenizer**

In [6]:
import re

In [7]:
with open("/content/drive/MyDrive/the-verdict.txt","r",encoding="utf-8") as f:
  raw_text=f.read()
print(len(raw_text))
print(raw_text[:100])

20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no g


In [8]:
text = "Hello, world. This, is a test."
result = re.split(r"(\s)", text)
print(result)

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


In [9]:
result = re.split('([.,]|\s)',text)
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


In [10]:
text = "Hello, world. Is this-- a test?"
result  = re.split('([,.:;?_!"()\']|--|\s)',text)
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [11]:
preprocessed = re.split(r'([,.?_!"()\']|--|\s)', raw_text)
preprocessed = [item for item in preprocessed if item.strip()]
len(preprocessed)
print(preprocessed[:50])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in', 'the', 'height', 'of', 'his', 'glory', ',', 'he', 'had', 'dropped', 'his', 'painting', ',', 'married', 'a', 'rich', 'widow', ',', 'and', 'established', 'himself']


In [12]:
all_words = sorted(list(set(preprocessed)))
vocab_size = len(all_words)
print(vocab_size)

1159


In [13]:
vocab = {token: idx for idx,token in enumerate(all_words)}
for i,j in enumerate(vocab.items()):
  print(j)
  if i==20:
    break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)


In [14]:
class SimpleTokenizerV1:
  def __init__(self, vocab):
    self.str_to_int = vocab #A
    self.int_to_str = {i:s for s,i in vocab.items()} #B
  def encode(self, text): #C
    preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
    preprocessed = [item.strip() for item in preprocessed if item.strip()]
    ids = [self.str_to_int[s] for s in preprocessed]
    return ids
  def decode(self, ids): #D
    text = " ".join([self.int_to_str[i] for i in ids])
    text = re.sub(r'\s+([,.?!"()\'])', r'\1', text) #E
    return text

In [15]:
tokenizer = SimpleTokenizerV1(vocab)
tokenizer.int_to_str[1000]

'tea'

In [16]:
text = """"It's the last he painted, you know," Mrs. Gisburn said with pardonable """
ids = tokenizer.encode(text)
print(ids)

[1, 58, 2, 872, 1013, 615, 541, 763, 5, 1155, 608, 5, 1, 69, 7, 39, 873, 1136, 773]


In [17]:
tokenizer.decode(ids)

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable'

In [18]:
txt = """From Wikipedia, the free encyclopedia
This article is about natural and human-made phenomena and structures of the world. For other uses of "Wonders of the World", see Wonders of the World (disambiguation).

The Seven Wonders of the Ancient World (from left to right, top to bottom): Great Pyramid of Giza, Hanging Gardens of Babylon, Temple of Artemis at Ephesus, Statue of Zeus at Olympia, Mausoleum at Halicarnassus (also known as the Mausoleum of Mausolus), Colossus of Rhodes, and the Lighthouse of Alexandria as depicted by 16th-century Dutch artist Maarten van Heemskerck.

Map of places listed in various Wonders of the World lists
Various lists of the Wonders of the World have been compiled from antiquity to the present day, in order to catalogue the world's most spectacular natural features and human-built structures.

The Seven Wonders of the Ancient World is the oldest known list of this type, documenting the most iconic and remarkable human-made creations of classical antiquity; it was based on guidebooks popular among Hellenic sightseers and as such only includes works located around the Mediterranean rim and in the ancient Near East. The number seven was chosen because the Greeks believed it represented perfection and plenty, and because it reflected the number of planets known in ancient times (five) plus the Sun and Moon.[1]

Seven Wonders of the Ancient World
Main article: Seven Wonders of the Ancient World

The Great Pyramid of Giza, the only wonder of the ancient world still in existence
The Greek historian Herodotus (484 – c. 425 BC) and the scholar Callimachus of Cyrene (c. 305–240 BC), at the Museum of Alexandria, made early lists of seven wonders. These lists have not survived, however, except as references in other writings.

The classic Seven Wonders were:

Great Pyramid of Giza, in Giza, Egypt, the earliest of the wonders to be completed, as well as the only one that still exists in the present day.
Colossus of Rhodes, in the harbor of the city of Rhodes, on the Greek island of the same name.
Hanging Gardens of Babylon, in Babylon, near present-day Hillah, Babylon Governorate, Iraq; or Nineveh, Mosul, Nineveh Governorate, Iraq.
Lighthouse of Alexandria, in Alexandria, Egypt.
Mausoleum at Halicarnassus, in Halicarnassus, a city of the Achaemenid Empire in present-day Turkey.
Statue of Zeus at Olympia, in Olympia, Greece.
Temple of Artemis at Ephesus, in the city of Ephesus, near present-day Selçuk, Turkey.
Lists from other eras
In the 19th and early 20th centuries, some writers emulated the classical list by creating their own lists with names such as "Wonders of the Middle Ages", "Seven Wonders of the Middle Ages", "Seven Wonders of the Medieval Mind", and "Architectural Wonders of the Middle Ages".[2] It is unlikely that any of these lists actually originated in the Middle Ages since the concept of a "Middle Age" did not become popular until at least the 16th century and the word "medieval" was not invented until the Enlightenment era. Brewer's Dictionary of Phrase and Fable refers to them as "later list[s]",[3] suggesting the lists were created after the Middle Ages.

Many of the structures on these lists were built much earlier than the Middle Ages but were well known throughout the world.[4][5] Typically representative of such lists are:[3][4][6][7]

Catacombs of Kom El Shoqafa, a 2nd-century funerary complex in Alexandria, Egypt.
Colosseum, a 1st-century amphitheatre in the centre of the city of Rome, Italy.
Great Wall of China, a series of defensive fortifications built across the historical northern borders of China, with some segments dating to as early as the 7th century BC.
Hagia Sophia, a 6th-century cathedral and mosque in Istanbul, Turkey.
Leaning Tower of Pisa, a 12th-century bell tower in Pisa, Italy.
Porcelain Tower of Nanjing, a 15th-century pagoda on the south bank of the external Qinhuai River in Nanjing, China.
Stonehenge, a Neolithic henge monument in Wiltshire, England dated to the 3rd millennium BC.
Other structures sometimes included on such lists include:

Cairo Citadel, a 13th-century Islamic fortification in Cairo, Egypt.[8]
Cluny Abbey, a 10th-century Benedictine monastery in Cluny, Saône-et-Loire, France.[9]
Ely Cathedral, a (currently Anglican) cathedral originally built in the 11th century in Ely, Cambridgeshire, England.[10]

Recent lists
Following in the tradition of the classical list, modern people and organisations have made their own lists of wonderful things, both ancient and modern, natural and artificial. Some of the most notable lists are presented below.

American Society of Civil Engineers

CN Tower in Toronto, Canada
In 1994, the American Society of Civil Engineers compiled a list of Seven Wonders of the Modern World, paying tribute to the "greatest civil engineering achievements of the 20th century".[11][12]

American Society of Civil Engineers Wonders
Wonder	Date started	Date finished	Location	Significance
Channel Tunnel	December 1, 1987	May 6, 1994	Strait of Dover, in the English Channel between the United Kingdom and France	Longest undersea portion of any tunnel in the world
CN Tower	February 6, 1973	June 26, 1976	Toronto, Ontario, Canada	Tallest freestanding structure in the world from 1976 to 2007
Empire State Building	March 17, 1930	April 11, 1931	New York City, New York, United States	Tallest structure in the world from 1931 to 1954; tallest freestanding structure in the world from 1931 to 1967; tallest building in the world from 1931 to 1970; first building with 100+ stories
Golden Gate Bridge	January 5, 1933	May 27, 1937	Golden Gate Strait, north of San Francisco, California, United States	Longest main span of any suspension bridge in the world from 1937 to 1964
Itaipu Dam	January 1970	May 5, 1984	Paraná River, on the border between Brazil and Paraguay	Largest operating hydroelectric facility in the world in terms of annual energy generation[13]
Netherlands North Sea Protection Works (Delta and Zuiderzee Works)	1920	May 10, 1997	Zeeland, South Holland, North Holland, Friesland and Flevoland, Netherlands	Largest hydraulic engineering project undertaken by the Netherlands during the 20th century
Panama Canal	January 1, 1880	January 7, 1914	Isthmus of Panama	Allows passage of oceangoing vessels between the Atlantic and Pacific oceans; one of the largest and most difficult engineering projects ever undertaken
USA Today's New Seven Wonders

Old City of Jerusalem
In November 2006, the American national newspaper USA Today and the American television show Good Morning America revealed a list of the "New Seven Wonders", both natural and human-made, as chosen by six judges.[14] The Grand Canyon was added as an eighth wonder on November 24, 2006, in response to viewer feedback.[15]

USA Today's New Seven Wonders
Wonder	Location
Potala Palace	Lhasa, Tibet
Old City of Jerusalem	Israel[n 1]
Polar ice caps	Earth's polar regions (Arctic and Antarctic)
Papahānaumokuākea Marine National Monument	Hawaii, United States
The Internet	Worldwide
Mayan ruins	Yucatán Peninsula, México
Great Migration of Serengeti and Masai Mara	Tanzania and Kenya
Grand Canyon (viewer-chosen eighth wonder)	Arizona, United States
Seven Natural Wonders of the World

Victoria Falls
Similar to the other lists of wonders, there is no consensus on a list of seven natural wonders of the world, and there has been debate over how large such a list should be. One of many existing versions of this list was compiled by CNN in 1997:[16]

Aurora, in the Earth's high-latitude regions (around the Arctic and Antarctic)
Grand Canyon, in Arizona, United States
Great Barrier Reef, off the coast of Queensland, Australia
Harbor of Rio de Janeiro, Brazil
Mount Everest, on the border of Nepal and China
Parícutin volcano, located in the state of Michoacán, Mexico
Victoria Falls, on the border of Zambia and Zimbabwe
New 7 Wonders of the World

El Castillo at Chichen Itza
In 2001, an initiative was started by the Swiss corporation New7Wonders Foundation to choose the New 7 Wonders of the World from a selection of 200 existing monuments through online votes.[17] The Great Pyramid of Giza—part of the Giza Pyramids, the only remaining wonder of the traditional Seven Wonders of the Ancient World, was not one of the winners announced in 2007 but was added as an honorary candidate.[18][19]

Wonder	Date of construction	Present-day location
Great Wall of China	Since 7th century BC[20]	China
Petra	c. 100 BC	Ma'an, Jordan
Christ the Redeemer	opened to the public October 12, 1931	Rio de Janeiro, Brazil
Machu Picchu	c. AD 1450	Urubamba Province, Peru
Chichen Itza	c. AD 600	Yucatán, Mexico
Colosseum	completed AD 80	Rome, Italy
Taj Mahal	completed c. AD 1648	Agra, India
Giza Pyramids (honorary candidates)	completed c. 2560 BC	Giza, Egypt

New 7 Wonders of Nature

Jeju Island
A similar contemporary effort to create a list of seven natural (as opposed to human-made) wonders chosen through a global poll, called the New 7 Wonders of Nature, was organized from 2007 to 2011 by the same group as the New 7 Wonders of the World campaign.

Iguazu Falls, on the border of the Argentine province of Misiones and the Brazilian state of Paraná
Hạ Long Bay, in Quảng Ninh province, Vietnam
Jeju Island, in the Jeju Province of South Korea
Puerto Princesa Underground River, in Palawan, Philippines
Table Mountain, overlooking the city of Cape Town, South Africa
Komodo Island, one of the 17,508 islands that comprise the Republic of Indonesia
Amazon rainforest, located in Brazil, Peru, Colombia, Venezuela, Ecuador, Bolivia, Guyana, Suriname, and French Guiana
New 7 Wonders Cities

Calle Crisologo, Vigan City
New 7 Wonders Cities, a third list organized by New7Wonders and determined by another global vote, includes entire cities:

Durban, South Africa
Vigan, Philippines
Havana, Cuba
Kuala Lumpur, Malaysia
Beirut, Lebanon
Doha, Qatar
La Paz, Bolivia
Seven Wonders of the Underwater World

The Great Barrier Reef
The list of "Seven Wonders of the Underwater World" was drawn up by CEDAM International, an American-based non-profit group for divers that is dedicated to ocean preservation and research. In 1989, CEDAM brought together a panel of marine scientists, including Eugenie Clark, to choose underwater areas which they considered worthy of protection. The results were announced at The National Aquarium in Washington, D.C., by actor Lloyd Bridges, star of TV's Sea Hunt:[21]

Palau
Belize Barrier Reef, Belize
Great Barrier Reef, Australia
Deep-sea hydrothermal vents (worldwide)
Galápagos Islands, Ecuador
Lake Baikal, Russia
Northern Red Sea, bordered by Saudi Arabia and Yemen on the eastern shore, and Egypt, Sudan, Eritrea, and Djibouti on the western shore
Seven Wonders of the Industrial World

Bell Rock Lighthouse
British author Deborah Cadbury wrote Seven Wonders of the Industrial World, a book telling the stories of seven great feats of engineering of the 19th and early 20th centuries.[22] In 2003, the BBC aired a seven-part docudrama exploring the same feats, with Cadbury as a producer.[23]

Wonder	Description	Completed
SS Great Eastern	British oceangoing passenger steamship	1858
Bell Rock Lighthouse	in the North Sea off the coast of Angus, Scotland	1810
Brooklyn Bridge	in New York City, New York, United States	1883
London sewerage system	serving London, England	1870
First transcontinental railroad	1,912-mile (3,077 km) continuous railroad line connecting existing rail networks in Iowa, Nebraska, Wyoming, Utah, Nevada, and California in the United States	1869
Panama Canal	51-mile (82 km) artificial waterway crossing the Isthmus of Panama and connecting the Atlantic and Pacific oceans	1914
Hoover Dam	on the Colorado River, spanning the border between Nevada and Arizona in the United States	1936
Seven Wonders of the Solar System

Enceladus
In a 1999 article, Astronomy magazine listed the "Seven Wonders of the Solar System". This article was later made into a video.[24]

Enceladus, a moon of Saturn
The Great Red Spot of Jupiter, a massive and persistent anticyclonic storm in the planet's southern hemisphere
The asteroid belt, a region of innumerable small solid bodies located between the orbits of Mars and Jupiter
The surface of the Sun
The oceans of Earth
The Rings of Saturn
Olympus Mons, an enormous shield volcano on Mars and the tallest planetary mountain in the Solar System
Other lists of wonders of the world
Many authors and organisations have composed lists of the wonders of the world that have been published in book or magazine form.

Seven Wonders of the World is a 1956 film in which Lowell Thomas searches the world for natural and artificial wonders and invites the audience to try to update the ancient Wonders of the World list."""

In [19]:
lst_txt = re.split(r"([,.?_!\"()\']|--|\s)",txt)
lst_txt = [item for item in lst_txt if item.strip()]
len(lst_txt)

2371

In [20]:
vocab = sorted(list(set(lst_txt)))
vocab_size = len(vocab)
vocab = {token: idx for idx,token in enumerate(vocab)}

In [21]:
class tokenizerV1:
  def __init__(self, vocab):
    self.str_to_int = vocab
    self.int_to_str = {v:k for k,v in vocab.items()}

  def encode(self,text):
    preprocessed_text = re.split(r"([,.?_!\"()\']|--|\s)",text)
    preprocessed_text = [item.strip() for item in preprocessed_text if item.strip()] # ["1","am","man",....]
    print("preprocessed_text --",preprocessed_text)
    token_ids = [self.str_to_int[token] for token in preprocessed_text]
    return token_ids

  def decode(self, token_ids): #[100,234,908,567,10000.....]
    decoded_text = " ".join([self.int_to_str[id] for id in token_ids])
    decoded_text = re.sub(r'\s+([,.?!"()\'])',r'\1',decoded_text)
    return decoded_text


In [22]:
tknzr = tokenizerV1(vocab)
print(tknzr.str_to_int)
print(tknzr.int_to_str)

{'"': 0, "'": 1, '(': 2, ')': 3, ',': 4, '.': 5, '077': 6, '1': 7, '10': 8, '100': 9, '100+': 10, '10th-century': 11, '11': 12, '11th': 13, '12': 14, '12th-century': 15, '13th-century': 16, '1450': 17, '15th-century': 18, '1648': 19, '16th': 20, '16th-century': 21, '17': 22, '1810': 23, '1858': 24, '1869': 25, '1870': 26, '1880': 27, '1883': 28, '1914': 29, '1920': 30, '1930': 31, '1931': 32, '1933': 33, '1936': 34, '1937': 35, '1954;': 36, '1956': 37, '1964': 38, '1967;': 39, '1970': 40, '1970;': 41, '1973': 42, '1976': 43, '1984': 44, '1987': 45, '1989': 46, '1994': 47, '1997': 48, '1997:[16]': 49, '1999': 50, '19th': 51, '1]': 52, '1st-century': 53, '200': 54, '2001': 55, '2003': 56, '2006': 57, '2007': 58, '2011': 59, '20th': 60, '24': 61, '2560': 62, '26': 63, '27': 64, '2nd-century': 65, '3': 66, '305–240': 67, '3rd': 68, '425': 69, '484': 70, '5': 71, '508': 72, '51-mile': 73, '6': 74, '600': 75, '6th-century': 76, '7': 77, '7th': 78, '80': 79, '82': 80, '912-mile': 81, ':': 82,

In [23]:
test_txt = "right scientists results............"
tknzr.encode(test_txt)

preprocessed_text -- ['right', 'scientists', 'results', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.']


[790, 796, 788, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]

In [24]:
test_ids = [790, 796, 788, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]
tknzr.decode(test_ids)

'right scientists results............'

##**BPE - Byte Pair Encoding**

In [31]:
!pip install tiktoken
import tiktoken

Collecting tiktoken
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.7.0


In [32]:
tokenizer = tiktoken.get_encoding("gpt2")

In [33]:
tokenizer.decode([3506, 5519, 2482])

'right scientists results'

In [34]:
text = "Hello, do you like tea? <|endoftext|> In the sunlit terra"
tkn_ids = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(tkn_ids)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 1059, 430]


In [35]:
tokenizer.decode(tkn_ids)

'Hello, do you like tea? <|endoftext|> In the sunlit terra'

In [36]:
gibberish_txt = "awkyummy simplysuperb neetlykept"
tkn_ids = tokenizer.encode(gibberish_txt)
print(tkn_ids)

[707, 2584, 13513, 2391, 16668, 65, 497, 316, 306, 45089]


In [37]:


tokenizer.decode(tkn_ids)

'awkyummy simplysuperb neetlykept'

In [38]:
tokenizer.encode("qwghnyklmnaorcbddbcj")

[80, 86, 456, 3281, 41582, 76, 2616, 273, 21101, 1860, 15630, 73]

In [39]:
tokenizer.decode([80, 86, 456, 3281, 41582, 76, 2616, 273, 21101, 1860, 15630, 73])

'qwghnyklmnaorcbddbcj'

In [40]:
for i in [707, 2584, 13513, 2391, 16668, 65, 497, 316, 306, 45089]:
  print(tokenizer.decode([i]))

aw
ky
ummy
 simply
super
b
 ne
et
ly
kept


In [41]:
tokenizer.encode("Akwirwier")

[33901, 86, 343, 86, 959]

##**2.6 Data sampling with a sliding window**
it's the process of creating input seq's and target words

In [42]:
with open("/content/drive/MyDrive/the-verdict.txt","r",encoding="utf-8") as f:
  raw_text = f.read()

raw_text

'I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a rich widow, and established himself in a villa on the Riviera. (Though I rather thought it would have been Rome or Florence.)\n\n"The height of his glory"--that was what the women called it. I can hear Mrs. Gideon Thwing--his last Chicago sitter--deploring his unaccountable abdication. "Of course it\'s going to send the value of my picture \'way up; but I don\'t think of that, Mr. Rickham--the loss to Arrt is all I think of." The word, on Mrs. Thwing\'s lips, multiplied its _rs_ as though they were reflected in an endless vista of mirrors. And it was not only the Mrs. Thwings who mourned. Had not the exquisite Hermia Croft, at the last Grafton Gallery show, stopped me before Gisburn\'s "Moon-dancers" to say, with tears in her eyes: "We shall not look upon its like again"?\n\nWell!--even 

In [43]:
enc_txt = tokenizer.encode(raw_text)
enc_txt[:10]

[40, 367, 2885, 1464, 1807, 3619, 402, 271, 10899, 2138]

In [44]:
tokenizer.decode([40, 367, 2885, 1464, 1807, 3619, 402, 271, 10899, 2138])

'I HAD always thought Jack Gisburn rather'

In [45]:
enc_sample = enc_txt[50:]
context_size = 4
x = enc_sample[:context_size]
y = enc_sample[1:context_size]
print(enc_sample[:10])
print(x,y)
print(tokenizer.decode(enc_sample[:10]))
print(tokenizer.decode(x))
print(tokenizer.decode(y))

[290, 4920, 2241, 287, 257, 4489, 64, 319, 262, 34686]
[290, 4920, 2241, 287] [4920, 2241, 287]
 and established himself in a villa on the Riv
 and established himself in
 established himself in


In [46]:
for i in range(1,context_size+1):
  context = enc_sample[:i]
  desired = enc_sample[i:i+1]
  print(tokenizer.decode(context),"====>" ,tokenizer.decode(desired))

 and ====>  established
 and established ====>  himself
 and established himself ====>  in
 and established himself in ====>  a


In [47]:
!pip3 install torch



In [48]:
import torch
from torch.utils.data import Dataset, DataLoader

In [49]:
class GPTDatasetV1(Dataset):
  def __init__(self,text,tokenizer, max_length,stride):
    self.text = text
    self.tokenizer = tokenizer
    self.max_length = max_length
    self.stride = stride
    self.input_ids = []
    self.target_ids = []

    token_ids = self.tokenizer.encode(self.text)
    for i in range(0,len(token_ids)-self.max_length,self.stride):
      input_chunks = token_ids[i:i+self.max_length]
      target_chunks = token_ids[i+1:i+self.max_length+1]
      self.input_ids.append(torch.tensor(input_chunks))
      self.target_ids.append(torch.tensor(target_chunks))


  def __len__(self):
    return len(self.input_ids)

  def __getitem__(self, idx):
    return self.input_ids[idx], self.target_ids[idx]




In [50]:
text = """
In Python, when working with classes, using self.input_ids.append(...) instead of just input_ids.append(...) is necessary to modify the instance variable input_ids of the class rather than a local variable. Let's break this down to understand the difference.

Why Use self.input_ids.append(...)?
Instance Variables vs. Local Variables:

When you define self.input_ids in the __init__ method, it creates an instance variable that belongs to the instance of the class (self). This means self.input_ids is accessible throughout the entire instance of the class, and its state will be preserved across method calls.
On the other hand, if you were to just use input_ids.append(...), you would be referring to a local variable named input_ids within the scope of the __init__ method. However, there is no such local variable named input_ids; only self.input_ids exists as an instance variable.
Scope of Variables:

self.input_ids is an instance variable that can be accessed from any method within the class.
input_ids (without self) would be a local variable within the __init__ method, and using it would result in an UnboundLocalError because input_ids has not been defined locally. The only input_ids defined in your class is self.input_ids.
Modifying Instance State:

By using self.input_ids.append(...), you are appending to the list that is an attribute of the instance (self). This ensures that any other method within the class that accesses self.input_ids will see the changes made.
If you use input_ids.append(...), even if input_ids were defined locally, it would only modify a local copy, and changes would not be reflected in the instance variable.
An Example to Illustrate
Consider the following simplified example to illustrate the difference:"""

In [51]:
tokenizer = tiktoken.get_encoding("gpt2")

In [52]:
dataset = GPTDatasetV1(text,tokenizer,10,5)

In [53]:
for i in range(len(dataset.input_ids)):
  print(dataset.input_ids[i])
  print(dataset.target_ids[i])
  if i == 0:
    break

tensor([  198,   818, 11361,    11,   618,  1762,   351,  6097,    11,  1262])
tensor([  818, 11361,    11,   618,  1762,   351,  6097,    11,  1262,  2116])


##**PyTorch Practice**

In [54]:
import torch

In [55]:
tensor0d = torch.tensor(100)
tensor1d = torch.tensor([1,2,3,4,5])
tensor2d = torch.tensor([[1,2,3],[4,5,6]])
tensor3d = torch.tensor([[[1,2,3],[4,5,6]],[[7,8,9],[10,11,12]]])
print(tensor0d)
print(tensor1d)
print(tensor2d)
print(tensor3d)

tensor(100)
tensor([1, 2, 3, 4, 5])
tensor([[1, 2, 3],
        [4, 5, 6]])
tensor([[[ 1,  2,  3],
         [ 4,  5,  6]],

        [[ 7,  8,  9],
         [10, 11, 12]]])


In [56]:

print(tensor3d.dtype)
floatvec = torch.tensor([1.0,2.0,3.0])
print(floatvec.dtype)

torch.int64
torch.float32


If we create tensors from Python floats, **PyTorch creates tensors with a 32-bit
precision by default**, as seen in above cell.
1. This choice is primarily due to the balance between precision and
computational efficiency. A 32-bit floating point number offers **sufficient
precision for most deep learning tasks, while consuming less memory** and
computational resources than a 64-bit floating point number.
2. Moreover, **GPU
architectures are optimized for 32-bit computations, and using this data type
can significantly speed up model training and inference.**
3. Moreover, it is **possible to readily change the precision using a tensor's .to
method.**
The following code demonstrates this by changing a 64-bit integer
tensor into a 32-bit float tensor:

In [57]:
tensor0d.to(torch.float32)

tensor(100.)

 **Common PyTorch tensor operations**

 .shape, .size, .reshape(to_size) --> alternate is .view(to_size), .T, matmul(or @)


In [58]:
tensor2d.shape, tensor2d.size()

(torch.Size([2, 3]), torch.Size([2, 3]))

In [59]:
print(tensor2d)

tensor([[1, 2, 3],
        [4, 5, 6]])


In [60]:
print(tensor2d.reshape(3,2))

tensor([[1, 2],
        [3, 4],
        [5, 6]])


In [61]:
tensor3d,tensor3d.shape

(tensor([[[ 1,  2,  3],
          [ 4,  5,  6]],
 
         [[ 7,  8,  9],
          [10, 11, 12]]]),
 torch.Size([2, 2, 3]))

In [62]:
tensor3d.view(2,6)

tensor([[ 1,  2,  3,  4,  5,  6],
        [ 7,  8,  9, 10, 11, 12]])

In [63]:
tensor3d.view(6,2)

tensor([[ 1,  2],
        [ 3,  4],
        [ 5,  6],
        [ 7,  8],
        [ 9, 10],
        [11, 12]])

In [64]:
tensor3d.view(3,2,2)

tensor([[[ 1,  2],
         [ 3,  4]],

        [[ 5,  6],
         [ 7,  8]],

        [[ 9, 10],
         [11, 12]]])

In [65]:
tensor2d

tensor([[1, 2, 3],
        [4, 5, 6]])

In [66]:
tensor2d.T

tensor([[1, 4],
        [2, 5],
        [3, 6]])

In [67]:
tensor3d

tensor([[[ 1,  2,  3],
         [ 4,  5,  6]],

        [[ 7,  8,  9],
         [10, 11, 12]]])

In [68]:
tensor3d.mT

tensor([[[ 1,  4],
         [ 2,  5],
         [ 3,  6]],

        [[ 7, 10],
         [ 8, 11],
         [ 9, 12]]])

In [69]:
tensor3d.shape , tensor3d.mT.shape

(torch.Size([2, 2, 3]), torch.Size([2, 3, 2]))

In [70]:
tensor2d.matmul(tensor2d.T)

tensor([[14, 32],
        [32, 77]])

In [71]:
tensor2d @ tensor2d.T

tensor([[14, 32],
        [32, 77]])

1. load input text
2. tokenize the inptut text
3. create input and target tensors using dataset and dataloader modules of PyTorch
4. creating embeddings for thos input and target tensors

## 1. load input text

In [72]:
with open("/content/drive/MyDrive/the-verdict.txt","r",encoding="utf-8") as f:
  raw_text = f.read()

In [73]:
# !pip install tiktoken
import tiktoken

## 2. tokenize the inptut text


In [74]:
tokenizer = tiktoken.get_encoding("gpt2")

## 3 .create input and target tensors using dataset and dataloader modules of PyTorch

In [75]:
# !pip install torch
import torch
from torch.utils.data import Dataset, DataLoader

In [76]:
class GPTDatasetV1(Dataset):
  def __init__(self,text,max_length,stride):
    self.text = text
    self.max_length = max_length
    self.stride = stride
    self.input_ids = []
    self.target_ids = []

    token_ids = tokenizer.encode(text)
    for i in range(0,len(token_ids)-max_length,stride):
      input_chunks = token_ids[i:i+max_length]
      target_chunks = token_ids[i+1:i+max_length+1]
      self.input_ids.append(torch.tensor(input_chunks))
      self.target_ids.append(torch.tensor(target_chunks))


  def __len__(self):
    return len(self.input_ids)

  def __getitem__(self,idx):
    return self.input_ids[idx], self.target_ids[idx]



In [77]:
def dataloader_v1(txt,max_length,stride,batch_size,shuffle=True,drop_last=True):
  dataset = GPTDatasetV1(txt,max_length,stride)
  dataloader = DataLoader(dataset,batch_size=batch_size,shuffle=shuffle,drop_last=drop_last)
  return dataloader

In [78]:
dataloader = dataloader_v1(raw_text,max_length=10,stride=5,batch_size=8)

In [88]:
input_iter = iter(dataloader)

## 4. creating embeddings for those input and target tensors

it is important to note
that we initialize these embedding weights with random values as a
preliminary step. This initialization serves as the starting point for the LLM's
learning process. We will optimize the embedding weights as part of the
LLM training

In [80]:
input_ids = torch.tensor([2,3,5,1])

For the sake of simplicity and illustration purposes, suppose we have a small
vocabulary of only 6 words (instead of the 50,257 words in the BPE
tokenizer vocabulary), and we want to create embeddings of size 3 (in GPT3,
the embedding size is 12,288 dimensions):
vocab_size = 6
output_dim = 3

In [81]:
vocab_size = 6
output_dim = 3

In [82]:
torch.manual_seed(42)
embedding_layer = torch.nn.Embedding(vocab_size,output_dim)
embedding_layer.weight

Parameter containing:
tensor([[ 1.9269,  1.4873, -0.4974],
        [ 0.4396, -0.7581,  1.0783],
        [ 0.8008,  1.6806,  0.3559],
        [-0.6866,  0.6105,  1.3347],
        [-0.2316,  0.0418, -0.2516],
        [ 0.8599, -0.3097, -0.3957]], requires_grad=True)

We can see that the weight matrix of the embedding layer contains small,
random values. These values are optimized during LLM training as part of
the LLM optimization itself
Moreover,
we can see that the weight matrix has six rows and three columns. There is
one row for each of the six possible tokens in the vocabulary. And there is
one column for each of the three embedding dimensions.

In [83]:
embedding_layer(torch.tensor(3))

tensor([-0.6866,  0.6105,  1.3347], grad_fn=<EmbeddingBackward0>)

If we compare the embedding vector for token ID 3 to the previous
embedding matrix, we see that it is identical to the 4th row (Python starts
with a zero index, so it's the row corresponding to index 3). In other words,
the embedding layer is essentially a look-up operation that retrieves rows
from the embedding layer's weight matrix via a token ID.

In [84]:
vocab_size = 50257
output_dim = 256 #in actual gpt it's 12288
torch.manual_seed(42)
token_embedding_layer = torch.nn.Embedding(vocab_size,output_dim)
token_embedding = token_embedding_layer(torch.tensor(0))
token_embedding.size

<function Tensor.size>

positional encoding layer

In [85]:
#we assume context_length same as max_length
max_length = 4
context_length = max_length
torch.manual_seed(42)
pos_embedding_layer = torch.nn.Embedding(context_length,output_dim)
pos_embedding = pos_embedding_layer(torch.arange(max_length))
pos_embedding

tensor([[ 1.9269,  1.4873,  0.9007,  ...,  0.5655,  0.5058,  0.2225],
        [-0.6855,  0.5636, -1.5072,  ...,  0.4232, -0.3389,  0.5180],
        [-1.3638,  0.1930, -0.6103,  ..., -1.6034, -0.4298,  0.5762],
        [ 0.3444, -3.1016, -1.4587,  ...,  1.1085,  0.5544,  1.5818]],
       grad_fn=<EmbeddingBackward0>)

In [86]:
torch.arange(max_length)

tensor([0, 1, 2, 3])

In [87]:
pos_embedding_layer(torch.arange(max_length))

tensor([[ 1.9269,  1.4873,  0.9007,  ...,  0.5655,  0.5058,  0.2225],
        [-0.6855,  0.5636, -1.5072,  ...,  0.4232, -0.3389,  0.5180],
        [-1.3638,  0.1930, -0.6103,  ..., -1.6034, -0.4298,  0.5762],
        [ 0.3444, -3.1016, -1.4587,  ...,  1.1085,  0.5544,  1.5818]],
       grad_fn=<EmbeddingBackward0>)

In [90]:
dataloader = dataloader_v1(raw_text,max_length=10,stride=5,batch_size=8)
embedding_layer = torch.nn.Embedding(vocab_size,output_dim)
for input_batch, target_batch in dataloader:
    # Convert input_batch to a tensor if it's a list of tensors
    input_batch = torch.stack(input_batch) if isinstance(input_batch, list) else input_batch

    print("Batch of Input Indices:\n", input_batch)

    # Generate embeddings for the input batch
    embeddings = embedding_layer(input_batch)
    print("Batch of Embeddings:\n", embeddings)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
         [-1.6872,  0.3549,  1.5316,  ...,  0.1461,  0.0456,  0.0372],
         ...,
         [-0.6723, -1.4593,  0.9186,  ..., -1.2261, -0.7068,  0.0707],
         [ 2.1120,  1.0673, -1.0083,  ...,  0.2292,  1.3644, -0.4726],
         [-1.8818, -0.1470,  1.4267,  ...,  1.2142, -0.4687,  1.1407]],

        [[ 0.3529,  0.7288,  0.2554,  ...,  1.3014, -0.3387, -1.1856],
         [ 1.7881,  1.9280,  0.7520,  ..., -1.0498,  1.3583,  0.1712],
         [ 1.9184,  0.0880,  0.8104,  ..., -0.6285, -0.0428, -0.4515],
         ...,
         [-1.8782,  0.7290,  0.7753,  ..., -0.0713,  0.1626, -0.8310],
         [-1.1602,  0.3133, -1.2542,  ...,  1.0247,  0.0166, -0.2712],
         [ 2.5644, -0.0383, -0.2427,  ...,  1.1613, -0.2773,  1.5121]],

        ...,

        [[ 0.0133, -0.0784, -0.4398,  ...,  1.2505, -2.7738,  0.2461],
         [-1.3971,  0.5285,  0.3505,  ...,  0.7033, -0.1221,  0.8168],
         [-0.0513,  0.6189, -1.7173, 