In [None]:
#pip install datasets

#### Dataset: `wikipedia/20220301.simple` (Simple English Wikipedia)

The `wikipedia/20220301.simple` dataset is a snapshot of Simple English Wikipedia content as of **March 1, 2022**, available through the Hugging Face `datasets` library.

#### Key Features

- **Source**: Simple English Wikipedia, a version of Wikipedia written in simplified English, accessible to a broader audience, including English learners and individuals with reading difficulties.
- **Snapshot Date**: Captures articles as they were on **March 1, 2022**, including content created or edited up until that date.
- **Content**:
  - Contains the **main text content** of articles, generally excluding metadata (e.g., contributor information, revision history).
  - Each entry typically includes an article **title** and its **main body content**.
  - Language used is straightforward, with simplified vocabulary and sentence structures.

#### Purpose

This dataset is widely used in **language modeling, summarization, classification**, and other **NLP tasks** that benefit from simpler language structures.

#### Split Information

- **Single Split**: Contains only the `train` split, which includes all available entries. The dataset can be used flexibly for training, testing, or evaluation as needed.

In [1]:
from datasets import load_dataset

In [2]:
data = load_dataset("wikipedia", "20220301.simple", split='train[:50]', trust_remote_code=True)
data

Dataset({
    features: ['id', 'url', 'title', 'text'],
    num_rows: 50
})

In [3]:
import pandas as pd

In [4]:
# take a couple of mins to load
df = pd.DataFrame(data)

In [5]:
df.shape

(50, 4)

In [9]:
df.head(6)

Unnamed: 0,id,url,title,text
0,1,https://simple.wikipedia.org/wiki/April,April,April is the fourth month of the year in the J...
1,2,https://simple.wikipedia.org/wiki/August,August,August (Aug.) is the eighth month of the year ...
2,6,https://simple.wikipedia.org/wiki/Art,Art,Art is a creative activity that expresses imag...
3,8,https://simple.wikipedia.org/wiki/A,A,A or a is the first letter of the English alph...
4,9,https://simple.wikipedia.org/wiki/Air,Air,Air refers to the Earth's atmosphere. Air is a...
5,12,https://simple.wikipedia.org/wiki/Autonomous%2...,Autonomous communities of Spain,Spain is divided in 17 parts called autonomous...


In [10]:
# save to local file
# 210 MB
df.to_csv(r'D:\AI-DATASETS\02-MISC-large/wikipedia_data.csv', index=False)

In [11]:
data[2]

{'id': '6',
 'url': 'https://simple.wikipedia.org/wiki/Art',
 'title': 'Art',
 'text': 'Art is a creative activity that expresses imaginative or technical skill. It produces a product, an object. Art is a diverse range of human activities in creating visual, performing artifacts, and expressing the author\'s imaginative mind. The product of art is called a work of art, for others to experience.\n\nSome art is useful in a practical sense, such as a sculptured clay bowl that can be used. That kind of art is sometimes called a craft.\n\nThose who make art are called artists. They hope to affect the emotions of people who experience it. Some people find art relaxing, exciting or informative. Some say people are driven to make art due to their inner creativity.\n\n"The arts" is a much broader term. It includes drawing, painting, sculpting, photography, performance art, dance, music, poetry, prose and theatre.\n\nTypes of art \n\nArt is divided into the plastic arts, where something is made,

In [12]:
print(data[2]['text'])

Art is a creative activity that expresses imaginative or technical skill. It produces a product, an object. Art is a diverse range of human activities in creating visual, performing artifacts, and expressing the author's imaginative mind. The product of art is called a work of art, for others to experience.

Some art is useful in a practical sense, such as a sculptured clay bowl that can be used. That kind of art is sometimes called a craft.

Those who make art are called artists. They hope to affect the emotions of people who experience it. Some people find art relaxing, exciting or informative. Some say people are driven to make art due to their inner creativity.

"The arts" is a much broader term. It includes drawing, painting, sculpting, photography, performance art, dance, music, poetry, prose and theatre.

Types of art 

Art is divided into the plastic arts, where something is made, and the performing arts, where something is done by humans in action. The other division is betwee

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      50 non-null     object
 1   url     50 non-null     object
 2   title   50 non-null     object
 3   text    50 non-null     object
dtypes: object(4)
memory usage: 1.7+ KB


In [14]:
#!pip install langchain

In [9]:
#!pip install openai

In [10]:
#!pip install pinecone-client

In [11]:
#!pip install tiktoken

**Pre-processing of the text**

- chunking and related information

In [15]:
import tiktoken

tiktoken.encoding_for_model('gpt-3.5-turbo')

<Encoding 'cl100k_base'>

In [16]:
tokenizer = tiktoken.get_encoding('cl100k_base')

In [17]:
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

In [18]:
tiktoken_len("hello I am a chunk of text and using the tiktoken_len function "
             "we can find the length of this chunk of text in tokens")

26

In [19]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [21]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size     = 400,
    chunk_overlap  = 20,
    length_function= tiktoken_len,
    separators     = ["\n\n", "\n", " ", ""]
)

In [23]:
chunks = text_splitter.split_text(data[2]['text'])
chunks

['Art is a creative activity that expresses imaginative or technical skill. It produces a product, an object. Art is a diverse range of human activities in creating visual, performing artifacts, and expressing the author\'s imaginative mind. The product of art is called a work of art, for others to experience.\n\nSome art is useful in a practical sense, such as a sculptured clay bowl that can be used. That kind of art is sometimes called a craft.\n\nThose who make art are called artists. They hope to affect the emotions of people who experience it. Some people find art relaxing, exciting or informative. Some say people are driven to make art due to their inner creativity.\n\n"The arts" is a much broader term. It includes drawing, painting, sculpting, photography, performance art, dance, music, poetry, prose and theatre.\n\nTypes of art \n\nArt is divided into the plastic arts, where something is made, and the performing arts, where something is done by humans in action. The other divis

In [24]:
tiktoken_len(chunks[0]) #, tiktoken_len(chunks[1]), tiktoken_len(chunks[2])

332

In [25]:
from langchain_openai import OpenAIEmbeddings

In [26]:
model_name = 'text-embedding-ada-002'

embed = OpenAIEmbeddings(
    model = model_name,
    #openai_api_key=api_key
)

In [27]:
texts = [
    'this is the first chunk of text',
    'then another second chunk of text is here'
]

In [28]:
res = embed.embed_documents(texts)
len(res), len(res[0])

(2, 1536)

#### Vector database

In [29]:
import os
from pinecone import Pinecone

In [30]:
pc_api_key = os.getenv('PINECONE_API_KEY')

In [31]:
# configure client
pc = Pinecone(api_key=pc_api_key)

In [32]:
from pinecone import ServerlessSpec

spec = ServerlessSpec(
    cloud = "aws", 
    region= "us-east-1"
)

In [33]:
index_name = 'langchain-retrieval-augmentation'

existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

In [34]:
existing_indexes

['langchain-retrieval-augmentation']

In [35]:
# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension = 1536,  # dimensionality of ada 002
        metric    = 'dotproduct',
        spec      = spec
    )
    
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(2)

In [36]:
import time

In [37]:
# connect to index
index = pc.Index(index_name)

time.sleep(1)

In [38]:
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 21}},
 'total_vector_count': 21}

In [39]:
len(data)

50

In [40]:
df.columns

Index(['id', 'url', 'title', 'text'], dtype='object')

In [42]:
# Loop through a sample of 10 rows with tqdm for progress visualization
for i, record in df.sample(10).iterrows():
    print(i)

    # Retrieve metadata fields for the current record
    metadata = {
        'wiki-id': str(record['id']),
        'source':  record['url'],
        'title':   record['title']
    }
    print(metadata)

29
{'wiki-id': '53', 'source': 'https://simple.wikipedia.org/wiki/Angola', 'title': 'Angola'}
40
{'wiki-id': '71', 'source': 'https://simple.wikipedia.org/wiki/Bankruptcy', 'title': 'Bankruptcy'}
27
{'wiki-id': '51', 'source': 'https://simple.wikipedia.org/wiki/Asteroid', 'title': 'Asteroid'}
7
{'wiki-id': '14', 'source': 'https://simple.wikipedia.org/wiki/Alanis%20Morissette', 'title': 'Alanis Morissette'}
30
{'wiki-id': '54', 'source': 'https://simple.wikipedia.org/wiki/Argentina', 'title': 'Argentina'}
2
{'wiki-id': '6', 'source': 'https://simple.wikipedia.org/wiki/Art', 'title': 'Art'}
43
{'wiki-id': '80', 'source': 'https://simple.wikipedia.org/wiki/Beekeeping', 'title': 'Beekeeping'}
22
{'wiki-id': '45', 'source': 'https://simple.wikipedia.org/wiki/Algebra', 'title': 'Algebra'}
25
{'wiki-id': '49', 'source': 'https://simple.wikipedia.org/wiki/Architecture', 'title': 'Architecture'}
42
{'wiki-id': '76', 'source': 'https://simple.wikipedia.org/wiki/Browser', 'title': 'Browser'}


In [43]:
from tqdm.auto import tqdm
from uuid import uuid4

batch_limit = 100

texts     = []
metadatas = []

# Loop through a sample of 10 rows with tqdm for progress visualization
for i, record in tqdm(df.sample(10).iterrows(), total=10):
    
    # first get metadata fields for this record
    metadata = {
        'wiki-id': str(record['id']),
        'source':  record['url'],
        'title':   record['title']
    }
    
    # now we create chunks from the record text
    record_texts = text_splitter.split_text(record['text'])
    
    # create individual metadata dicts for each chunk
    record_metadatas = [{
        "chunk": j, "text": text, **metadata
    } for j, text in enumerate(record_texts)]
    
    # append these to current batches
    texts.extend(record_texts)
    metadatas.extend(record_metadatas)
    
    # if we have reached the batch_limit we can add texts
    if len(texts) >= batch_limit:
        ids    = [str(uuid4()) for _ in range(len(texts))]
        embeds = embed.embed_documents(texts)
        
        index.upsert(vectors=zip(ids, embeds, metadatas))
        
        texts     = []
        metadatas = []

if len(texts) > 0:
    ids    = [str(uuid4()) for _ in range(len(texts))]
    embeds = embed.embed_documents(texts)
    
    index.upsert(vectors=zip(ids, embeds, metadatas))

  0%|          | 0/10 [00:00<?, ?it/s]

In [44]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 21}},
 'total_vector_count': 21}

#### Create the vector store and apply some querying
- vector store is NOT = index
- vector store (LLM frameworks) - include the indexes

In [45]:
from langchain.vectorstores import Pinecone

In [46]:
text_field = 'text'

In [47]:
vectorstore = Pinecone(
    index, embed.embed_query, text_field
)

  vectorstore = Pinecone(


In [48]:
query = "who was Benito Mussolini?"

In [49]:
vectorstore.similarity_search(
    query,  # our search query
    k =3    # return 3 most relevant docs
)

[Document(metadata={'chunk': 1.0, 'source': 'https://simple.wikipedia.org/wiki/Argentina', 'title': 'Argentina', 'wiki-id': '54'}, page_content='The oldest signs of people in Argentina are in the Patagonia (Piedra Museo, Santa Cruz), and are more than 13,000 years old. In 1480 the Inca Empire conquered northwestern Argentina, making it part of the empire. In the northeastern area, the Guaraní developed a culture based on yuca and sweet potato however typical dishes all around Argentina are pasta, red wines (Italian influence) and beef.\n\nOther languages spoken are Italian, English and German. Lunfardo is Argentinean slang and is a mix of Spanish and Italian. Argentinians are said to speak Spanish with an Italian accent.\n\nArgentina declared independent from Spain in 1816, and achieved it in a War led by José de San Martín in 1818. Many immigrants from Europe came to the country. By the 1920s it was the 7th wealthiest country in the world, but it began a decline after this. In the 194

#### Generative QnA
- Documents retrieval from vector store or database
- RetrieveQA sends these documents (plus the question) to the LLM
- LLM then generates a cohesive answer



In [50]:
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

In [51]:
# completion llm
llm = ChatOpenAI(
    #openai_api_key=api_key,
    model_name = 'gpt-3.5-turbo',
    temperature= 0.0
)


In [52]:
qa = RetrievalQA.from_chain_type(
    llm       = llm,
    chain_type= "refine",
    retriever = vectorstore.as_retriever()
)

In [87]:
qa.run(query)

'Benito Mussolini was an Italian politician and leader who founded the Fascist Party in Italy. He ruled as Prime Minister from 1922 to 1943 and then as dictator from 1925 to 1945. Mussolini was a key figure in the creation of Fascism, an authoritarian and nationalistic political ideology.'