In [None]:
#pip install datasets

#### Dataset: `wikipedia/20220301.simple` (Simple English Wikipedia)

The `wikipedia/20220301.simple` dataset is a snapshot of Simple English Wikipedia content as of **March 1, 2022**, available through the Hugging Face `datasets` library.

#### Key Features

- **Source**: Simple English Wikipedia, a version of Wikipedia written in simplified English, accessible to a broader audience, including English learners and individuals with reading difficulties.
- **Snapshot Date**: Captures articles as they were on **March 1, 2022**, including content created or edited up until that date.
- **Content**:
  - Contains the **main text content** of articles, generally excluding metadata (e.g., contributor information, revision history).
  - Each entry typically includes an article **title** and its **main body content**.
  - Language used is straightforward, with simplified vocabulary and sentence structures.

#### Purpose

This dataset is widely used in **language modeling, summarization, classification**, and other **NLP tasks** that benefit from simpler language structures.

#### Split Information

- **Single Split**: Contains only the `train` split, which includes all available entries. The dataset can be used flexibly for training, testing, or evaluation as needed.

In [1]:
from datasets import load_dataset

In [2]:
data = load_dataset("wikipedia", "20220301.simple", split='train', trust_remote_code=True)
data

Dataset({
    features: ['id', 'url', 'title', 'text'],
    num_rows: 205328
})

In [17]:
import pandas as pd

In [18]:
# take a couple of mins to load
df = pd.DataFrame(data)

In [19]:
df.shape

(205328, 4)

In [20]:
df.sample(6)

Unnamed: 0,id,url,title,text
98461,429538,https://simple.wikipedia.org/wiki/Cha%20Tae-hyun,Cha Tae-hyun,Cha Tae-hyun is a South Korean actor. He was b...
192717,860032,https://simple.wikipedia.org/wiki/Grugliasco,Grugliasco,Grugliasco is a comune in the Metropolitan Cit...
189274,847473,https://simple.wikipedia.org/wiki/T.%20B.%20Jo...,T. B. Joshua,"Temitope Balogun Joshua (June 12, 1963 – June ..."
77285,334179,https://simple.wikipedia.org/wiki/List%20of%20...,List of counties in Tennessee,There are 95 counties in the State of Tennesse...
96603,420665,https://simple.wikipedia.org/wiki/Evapotranspi...,Evapotranspiration,Evapotranspiration is the movement of water fr...
131207,619769,https://simple.wikipedia.org/wiki/Dirk%20Berna...,Dirk Bernard Joseph Schouten,Dirk Bernard Joseph (Dick) Schouten (25 Januar...


In [16]:
# save to local file
# 210 MB
df.to_csv(r'D:\AI-DATASETS\02-MISC-large/wikipedia_data.csv', index=False)

In [21]:
data[205324]

{'id': '910287',
 'url': 'https://simple.wikipedia.org/wiki/Bachhan%20Paandey',
 'title': 'Bachhan Paandey',
 'text': 'Bachchhan Paandey is an upcoming Indian Hindi-language action comedy film. It is directed by Farhad Samji, produced by Sajid Nadiawala, and stars Akshay Kumar, Kriti Sanon, Jacqueline Fernandez and Arshad Warsi. It is a remake of the 2014 Tamil film, which is called Jigarthanda,  which itself was inspired by the 2006 South Korean movie A Dirty Carnival. The film is scheduled to be released theatrically on 18 March 2022.\n\nPremise\nA budding director Myra (Kriti Sanon) wants to make a movie on gangsters. She chooses Bachchan Pandey (Akshay Kumar), who is a ruthless gangster. But her secret attempts to conduct the research on him fail when she gets caught for snooping over him.\n\nCast \n Akshay Kumar as Bachchhan Paandey \n Kriti Sanon as Myra Devekar\n Jacqueline Fernandez as Sophie\n Arshad Warsi as Vishu\n Pankaj Tripathi as Bhaves Bhoplo\n Prateik Babbar\n Sanjay M

In [22]:
print(data[205324]['text'])

Bachchhan Paandey is an upcoming Indian Hindi-language action comedy film. It is directed by Farhad Samji, produced by Sajid Nadiawala, and stars Akshay Kumar, Kriti Sanon, Jacqueline Fernandez and Arshad Warsi. It is a remake of the 2014 Tamil film, which is called Jigarthanda,  which itself was inspired by the 2006 South Korean movie A Dirty Carnival. The film is scheduled to be released theatrically on 18 March 2022.

Premise
A budding director Myra (Kriti Sanon) wants to make a movie on gangsters. She chooses Bachchan Pandey (Akshay Kumar), who is a ruthless gangster. But her secret attempts to conduct the research on him fail when she gets caught for snooping over him.

Cast 
 Akshay Kumar as Bachchhan Paandey 
 Kriti Sanon as Myra Devekar
 Jacqueline Fernandez as Sophie
 Arshad Warsi as Vishu
 Pankaj Tripathi as Bhaves Bhoplo
 Prateik Babbar
 Sanjay Mishra as Bufferia Chacha
 Abhimanyu Singh as Pendulum
 Snehal Daabbi
 Saharsh Kumar Shukla as Kaandi

References

2022 movies


In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205328 entries, 0 to 205327
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   id      205328 non-null  object
 1   url     205328 non-null  object
 2   title   205328 non-null  object
 3   text    205328 non-null  object
dtypes: object(4)
memory usage: 6.3+ MB


In [8]:
#!pip install langchain

In [9]:
#!pip install openai

In [10]:
#!pip install pinecone-client

In [11]:
#!pip install tiktoken

**Pre-processing of the text**

- chunking and related information

In [24]:
import tiktoken

tiktoken.encoding_for_model('gpt-3.5-turbo')

<Encoding 'cl100k_base'>

In [25]:
tokenizer = tiktoken.get_encoding('cl100k_base')

In [26]:
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

In [27]:
tiktoken_len("hello I am a chunk of text and using the tiktoken_len function "
             "we can find the length of this chunk of text in tokens")

26

In [30]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [31]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size     = 400,
    chunk_overlap  = 20,
    length_function= tiktoken_len,
    separators     = ["\n\n", "\n", " ", ""]
)

In [33]:
chunks = text_splitter.split_text(data[205324]['text'])
chunks

['Bachchhan Paandey is an upcoming Indian Hindi-language action comedy film. It is directed by Farhad Samji, produced by Sajid Nadiawala, and stars Akshay Kumar, Kriti Sanon, Jacqueline Fernandez and Arshad Warsi. It is a remake of the 2014 Tamil film, which is called Jigarthanda,  which itself was inspired by the 2006 South Korean movie A Dirty Carnival. The film is scheduled to be released theatrically on 18 March 2022.\n\nPremise\nA budding director Myra (Kriti Sanon) wants to make a movie on gangsters. She chooses Bachchan Pandey (Akshay Kumar), who is a ruthless gangster. But her secret attempts to conduct the research on him fail when she gets caught for snooping over him.\n\nCast \n Akshay Kumar as Bachchhan Paandey \n Kriti Sanon as Myra Devekar\n Jacqueline Fernandez as Sophie\n Arshad Warsi as Vishu\n Pankaj Tripathi as Bhaves Bhoplo\n Prateik Babbar\n Sanjay Mishra as Bufferia Chacha\n Abhimanyu Singh as Pendulum\n Snehal Daabbi\n Saharsh Kumar Shukla as Kaandi\n\nReferences

In [35]:
tiktoken_len(chunks[0]) #, tiktoken_len(chunks[1]), tiktoken_len(chunks[2])

277

In [36]:
from langchain_openai import OpenAIEmbeddings

In [37]:
model_name = 'text-embedding-ada-002'

embed = OpenAIEmbeddings(
    model = model_name,
    #openai_api_key=api_key
)

In [38]:
texts = [
    'this is the first chunk of text',
    'then another second chunk of text is here'
]

In [39]:
res = embed.embed_documents(texts)
len(res), len(res[0])

(2, 1536)

#### Vector database

In [42]:
import os
from pinecone import Pinecone

In [43]:
pc_api_key = os.getenv('PINECONE_API_KEY')

In [44]:
# configure client
pc = Pinecone(api_key=pc_api_key)

In [45]:
from pinecone import ServerlessSpec

spec = ServerlessSpec(
    cloud = "aws", 
    region= "us-east-1"
)

In [46]:
index_name = 'langchain-retrieval-augmentation'

existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

In [47]:
existing_indexes

[]

In [48]:
# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension = 1536,  # dimensionality of ada 002
        metric    = 'dotproduct',
        spec      = spec
    )
    
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(2)

In [49]:
import time

In [50]:
# connect to index
index = pc.Index(index_name)

time.sleep(1)

In [51]:
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [52]:
len(data)

205328

In [61]:
df.columns

Index(['id', 'url', 'title', 'text'], dtype='object')

In [63]:
# Loop through a sample of 10 rows with tqdm for progress visualization
for i, record in tqdm(df.sample(10).iterrows(), total=10):
    print(i)

    # Retrieve metadata fields for the current record
    metadata = {
        'wiki-id': str(record['id']),
        'source':  record['url'],
        'title':   record['title']
    }
    print(metadata)

  0%|          | 0/10 [00:00<?, ?it/s]

97844
{'wiki-id': '426904', 'source': 'https://simple.wikipedia.org/wiki/Chico%20Hamilton', 'title': 'Chico Hamilton'}
190023
{'wiki-id': '850494', 'source': 'https://simple.wikipedia.org/wiki/The%20Conspiracy%20Files', 'title': 'The Conspiracy Files'}
63422
{'wiki-id': '263072', 'source': 'https://simple.wikipedia.org/wiki/Nitrogen%20oxide', 'title': 'Nitrogen oxide'}
35468
{'wiki-id': '140304', 'source': 'https://simple.wikipedia.org/wiki/Dominion%20of%20Pakistan', 'title': 'Dominion of Pakistan'}
45517
{'wiki-id': '159626', 'source': 'https://simple.wikipedia.org/wiki/Auquainville', 'title': 'Auquainville'}
112212
{'wiki-id': '501298', 'source': 'https://simple.wikipedia.org/wiki/Wang%20Guozhen', 'title': 'Wang Guozhen'}
169867
{'wiki-id': '780208', 'source': 'https://simple.wikipedia.org/wiki/Oro%20y%20plata', 'title': 'Oro y plata'}
38149
{'wiki-id': '149503', 'source': 'https://simple.wikipedia.org/wiki/M%C4%83lini', 'title': 'Mălini'}
88696
{'wiki-id': '384652', 'source': 'https

In [66]:
from tqdm.auto import tqdm
from uuid import uuid4

batch_limit = 100

texts     = []
metadatas = []

# Loop through a sample of 10 rows with tqdm for progress visualization
for i, record in tqdm(df.sample(10).iterrows(), total=10):
    
    # first get metadata fields for this record
    metadata = {
        'wiki-id': str(record['id']),
        'source':  record['url'],
        'title':   record['title']
    }
    
    # now we create chunks from the record text
    record_texts = text_splitter.split_text(record['text'])
    
    # create individual metadata dicts for each chunk
    record_metadatas = [{
        "chunk": j, "text": text, **metadata
    } for j, text in enumerate(record_texts)]
    
    # append these to current batches
    texts.extend(record_texts)
    metadatas.extend(record_metadatas)
    
    # if we have reached the batch_limit we can add texts
    if len(texts) >= batch_limit:
        ids    = [str(uuid4()) for _ in range(len(texts))]
        embeds = embed.embed_documents(texts)
        
        index.upsert(vectors=zip(ids, embeds, metadatas))
        
        texts     = []
        metadatas = []

if len(texts) > 0:
    ids    = [str(uuid4()) for _ in range(len(texts))]
    embeds = embed.embed_documents(texts)
    
    index.upsert(vectors=zip(ids, embeds, metadatas))

  0%|          | 0/10 [00:00<?, ?it/s]

In [67]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 21}},
 'total_vector_count': 21}

#### Create the vector store and apply some querying
- vector store is NOT = index
- vector store (LLM frameworks) - include the indexes

In [68]:
from langchain.vectorstores import Pinecone

In [69]:
text_field = 'text'

In [70]:
vectorstore = Pinecone(
    index, embed.embed_query, text_field
)

  vectorstore = Pinecone(


In [71]:
query = "who was Benito Mussolini?"

In [72]:
vectorstore.similarity_search(
    query,  # our search query
    k =3    # return 3 most relevant docs
)

[Document(metadata={'chunk': 0.0, 'source': 'https://simple.wikipedia.org/wiki/1884', 'title': '1884', 'wiki-id': '10526'}, page_content='Births \n May 8 – Harry S. Truman\n October 11 – Eleanor Roosevelt, First Lady of the United States, wife of Franklin D. Roosevelt (d. 1962)\n December 30 – Hideki Tōjō, 40th Prime Minister of Japan, Led the Attack on Pearl Harbour (d. 1948)\n\nDeaths \n January 25 – Johann Gottfried Piefke, German conductor and composer (b. 1815)\n March 21 – Ezra Abbot, American Bible scholar (b. 1819)\n April 4 – Marie Bashkirtseff, Russian artist (b. 1858)\n May 12 – Bedrich Smetana, Czech composer (b. 1824)\n May 13 –  Cyrus McCormick, American inventor (b. 1809)\n June 25 – Hans Rott, Austrian composer (b. 1858)\n July 1 – Allan Pinkerton, American detective (b. 1819)\n July 10 – Paul Morphy, American chess player (b. 1837)\n November 25 – Adolph Wilhelm Hermann Kolbe, German chemist (b. 1818)'),
 Document(metadata={'chunk': 0.0, 'source': 'https://simple.wikip

#### Generative QnA
- Documents retrieval from vector store or database
- RetrieveQA sends these documents (plus the question) to the LLM
- LLM then generates a cohesive answer



In [84]:
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

In [85]:
# completion llm
llm = ChatOpenAI(
    #openai_api_key=api_key,
    model_name = 'gpt-3.5-turbo',
    temperature= 0.0
)


In [86]:
qa = RetrievalQA.from_chain_type(
    llm       = llm,
    chain_type= "stuff",
    retriever = vectorstore.as_retriever()
)

In [87]:
qa.run(query)

'Benito Mussolini was an Italian politician and leader who founded the Fascist Party in Italy. He ruled as Prime Minister from 1922 to 1943 and then as dictator from 1925 to 1945. Mussolini was a key figure in the creation of Fascism, an authoritarian and nationalistic political ideology.'