## Gathering the data
![image.png](attachment:d73cc778-5a4f-44c5-8d87-af6f6bb7d069.png) ![image.png](attachment:55cb9daa-a478-4873-b885-ca36ae10297a.png) ![image.png](attachment:1de2cf40-28ac-49bf-b33e-014a1a72faf4.png)

### Preparing a Dataset

In [1]:
import requests 

In [2]:
response = requests.get("https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exlimit=1&titles=2022&explaintext=1&formatversion=2&format=json")

In [3]:
print(response) 

<Response [200]>


In [4]:
response.json() # this is the entire response from the API

{'batchcomplete': True,
 'query': {'pages': [{'pageid': 52412,
    'ns': 0,
    'title': '2022',

In [5]:
response.json()['query']['pages'][0]['extract'].split('\n')

['2022 (MMXXII) was a common year starting on Saturday of the Gregorian calendar, the 2022nd year of the Common Era (CE) and Anno Domini (AD) designations, the 22nd  year of the 3rd millennium and the 21st century, and the  3rd   year of the 2020s decade.  ',
 'The year saw the removal of nearly all COVID-19 restrictions and the reopening of international borders in most countries, while the global rollout of COVID-19 vaccines continued. The global economic recovery from the pandemic continued, though many countries experienced an ongoing inflation surge; in response, many central banks raised their interest rates to landmark levels. The world population reached eight billion people in 2022. The year also witnessed numerous natural disasters, including two devastating Atlantic hurricanes (Fiona and Ian), and the most powerful volcano eruption of the century so far. The later part of the year also saw the first public release of ChatGPT by OpenAI starting an arms race in artificial inte

In [6]:
import pandas as pd

df = pd.DataFrame()

df # we have an empty dataframe

In [7]:
# Now try to set the above data to the df as list of strings

df['text'] = response.json()['query']['pages'][0]['extract'].split('\n')
df

Unnamed: 0,text
0,2022 (MMXXII) was a common year starting on Sa...
1,The year saw the removal of nearly all COVID-1...
2,2022 was also dominated by wars and armed conf...
3,
4,
...,...
254,
255,== Nobel Prizes ==
256,
257,


In [8]:
# Handle the empty strings 
df = df[df['text'].str.len() >0]
df # Now we do not have any black lines here 

Unnamed: 0,text
0,2022 (MMXXII) was a common year starting on Sa...
1,The year saw the removal of nearly all COVID-1...
2,2022 was also dominated by wars and armed conf...
5,== Events ==
8,=== January ===
...,...
248,== Demographics ==
249,The world population was estimated to have rea...
252,== Deaths ==
255,== Nobel Prizes ==


In [9]:
# Now let's deal with the headings also

df = df[~df['text'].str.startswith("==")] # Now we do not have any rows which starts with == also which means we have almost clean data in hand 
df

Unnamed: 0,text
0,2022 (MMXXII) was a common year starting on Sa...
1,The year saw the removal of nearly all COVID-1...
2,2022 was also dominated by wars and armed conf...
9,January 1 – The Regional Comprehensive Econom...
10,January 2 – Abdalla Hamdok resigns as Prime Mi...
...,...
242,December 21–December 26 – A major winter storm...
243,December 24 – 2022 Fijian general election: Th...
244,December 29 – Brazilian football legend Pelé d...
245,December 31 – Former Pope Benedict XVI dies at...


In [10]:
df.tail(20)

Unnamed: 0,text
224,"November 16 – NASA launches Artemis 1, the fir..."
225,November 19 – The 2022 Malaysian general elect...
226,November 19–November 26 – The 2022 Central Ame...
227,November 20–December 18 – The 2022 FIFA World ...
228,November 20 – 2022 Nepalese general election: ...
229,November 21 – A 5.6 earthquake strikes near Ci...
230,"November 30 – OpenAI releases ChatGPT, an arti..."
234,December 2 – The G7 and Australia join the EU ...
235,December 5 – The National Ignition Facility ac...
236,December 7


In [11]:
from dateutil.parser import parse
prefix = ""
for (i, row) in df.iterrows():
    # If the row already has " - ", it already has the needed date prefix
    if " – " not in row["text"]:
        try:
            # If the row's text is a date, set it as the new prefix
            parse(row["text"])
            prefix = row["text"]
        except:
            # If the row's text isn't a date, add the prefix
            row["text"] = prefix + " – " + row["text"]
df = df[df["text"].str.contains(" – ")]

df

Unnamed: 0,text
0,– 2022 (MMXXII) was a common year starting on...
1,– The year saw the removal of nearly all COVI...
2,– 2022 was also dominated by wars and armed c...
9,January 1 – The Regional Comprehensive Econom...
10,January 2 – Abdalla Hamdok resigns as Prime Mi...
...,...
242,December 21–December 26 – A major winter storm...
243,December 24 – 2022 Fijian general election: Th...
244,December 29 – Brazilian football legend Pelé d...
245,December 31 – Former Pope Benedict XVI dies at...


In [12]:
df.reset_index(inplace=True, drop=True)
df

Unnamed: 0,text
0,– 2022 (MMXXII) was a common year starting on...
1,– The year saw the removal of nearly all COVI...
2,– 2022 was also dominated by wars and armed c...
3,January 1 – The Regional Comprehensive Econom...
4,January 2 – Abdalla Hamdok resigns as Prime Mi...
...,...
180,December 21–December 26 – A major winter storm...
181,December 24 – 2022 Fijian general election: Th...
182,December 29 – Brazilian football legend Pelé d...
183,December 31 – Former Pope Benedict XVI dies at...


In [13]:
df.to_csv('files/text.csv')
df

Unnamed: 0,text
0,– 2022 (MMXXII) was a common year starting on...
1,– The year saw the removal of nearly all COVI...
2,– 2022 was also dominated by wars and armed c...
3,January 1 – The Regional Comprehensive Econom...
4,January 2 – Abdalla Hamdok resigns as Prime Mi...
...,...
180,December 21–December 26 – A major winter storm...
181,December 24 – 2022 Fijian general election: Th...
182,December 29 – Brazilian football legend Pelé d...
183,December 31 – Former Pope Benedict XVI dies at...


![image.png](attachment:adf8f8ac-77c1-4d1c-99f6-71f5bc93d299.png)

```import pandas as pd

    # Load page text into a dataframe
    df = pd.DataFrame()
    df["text"] = response_dict["query"]["pages"][0]["extract"].split("\n")
```   

__Wrangling the Data for Ingestion into the Model__

Data from the API is much cleaner than raw website source code, but it still needs some work to be ideally configured for our purposes.

In this demo, we walked through how to wrangle and clean the data in df:

1. Addressing the problem of empty rows by subsetting to include only rows where the length is > 0
2. Addressing the problem of headings by subsetting to exclude rows where the text starts with ==
3. Addressing the problem of rows without dates using a date parser and somewhat more complex logic
   
Don't worry too much about the details here; data wrangling is different for every dataset!

Below is the version of this code used in the case study notebook:
```
from dateutil.parser import parse

# Clean up text to remove empty lines and headings
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]

# In some cases dates are used as headings instead of being part of the
# text sample; adjust so dated text samples start with dates
prefix = ""
for (i, row) in df.iterrows():
    # If the row already has " - ", it already has the needed date prefix
    if " – " not in row["text"]:
        try:
            # If the row's text is a date, set it as the new prefix
            parse(row["text"])
            prefix = row["text"]
        except:
            # If the row's text isn't a date, add the prefix
            row["text"] = prefix + " – " + row["text"]
df = df[df["text"].str.contains(" – ")].reset_index(drop=True)
```

## Review: Numeric Representations for Text Data

![image.png](attachment:ee85164f-48b7-411f-9f1c-5418a67561a5.png) ![image.png](attachment:6ad411f1-27a4-458f-859c-455c9c2befba.png) ![image.png](attachment:23db7287-3e6e-4543-baef-d9032546554a.png)

__Text Embeddings__

Embeddings are a sophisticated technique for converting text into vectors of numbers using a pre-trained machine learning model. Instead of each number in the vector indicating the presence or absence of a word, each number in the vector represents a dimension identified by the model. All text samples result in the same number of columns and we can therefore compare any two text datasets based on their embeddings!

![image.png](attachment:95621470-90f5-4fdd-a3f2-24a29c82359a.png)



## Creating an Embedding Index

In [14]:
# Let's load the csv file to our datafram that we have saved before 

import pandas as pd

df = pd.read_csv('files/text.csv', index_col=0)
df

Unnamed: 0,text
0,– 2022 (MMXXII) was a common year starting on...
1,– The year saw the removal of nearly all COVI...
2,– 2022 was also dominated by wars and armed c...
3,January 1 – The Regional Comprehensive Econom...
4,January 2 – Abdalla Hamdok resigns as Prime Mi...
...,...
180,December 21–December 26 – A major winter storm...
181,December 24 – 2022 Fijian general election: Th...
182,December 29 – Brazilian football legend Pelé d...
183,December 31 – Former Pope Benedict XVI dies at...


In [18]:
import openai
openai.api_key = "voc-1532767545126677339448966a530fbbd0ba0.61572308"

EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
response = openai.Embedding.create(
    input = df["text"].tolist(),
    model= EMBEDDING_MODEL_NAME
)

RateLimitError: You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.