## **Nugen Intelligence**
<img src="https://nugen.in/logo.png" alt="Nugen Logo" width="200"/>

Domain-aligned foundational models at industry leading speeds and zero-data retention! To learn more, visit [Nugen](https://docs.nugen.in/introduction)

### **Generating Embeddings with Nugen API**

This lesson demonstrates how to generate embeddings for texts using Nugen embeddings APIs. To do that, we will be following the steps mentioned below:
1. Extract information from Wikipedia
2. Break it into smaller sections
3. Generate high-performance embeddings using the [Nugen API](https://docs.nugen.in/introduction)


With Nugen’s cutting-edge API, you can easily generate embeddings that are optimized for speed and accuracy, enabling faster and more relevant results in your applications.

### **Setup**
**Install Required Libraries**
We'll install the required Python libraries to interact with Wikipedia, split sections, and count tokens.

In [None]:
!pip install --quiet mwclient==0.11.0 mwparserfromhell==0.6.6 pandas==1.5.3 tiktoken==0.7.0 openai==1.34.0 requests

**Import Necessary Libraries**

These libraries help us work with Wikipedia articles, clean and process them, and prepare them for embedding using the Nugen API.

In [2]:
import mwclient
import mwparserfromhell
import pandas as pd
import re
import random
import requests
import tiktoken

### **Access the Nugen API**

**API Key Setup**

First, we need to set up the Nugen API to generate embeddings. To do this, you'll need an API key from Nugen. To access free API keys, you can visit [Nugen Dashboard](https://nugen-platform-frontend.azurewebsites.net/dashboard) Once you have your API key, make sure to replace <your_api_key> in the code with the actual key you get from Nugen.

In [None]:
url_api_server = "https://api.nugen.in/inference/embeddings"
api_key = <enter your api key>
headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json"
}

### **Get Wikipedia Articles**
**Choosing Wikipedia Articles**

We are going to retrieve articles related to the 2022 Winter Olympics using a Wikipedia category. This section searches for all pages within that category.

In [4]:
CATEGORY_TITLE = "Category:2022 Winter Olympics"
WIKI_SITE = "en.wikipedia.org"

**Extract Article Titles**

We now gather all the article titles under this category.


In [None]:
def titles_from_category(category, max_depth):
    """Return a set of page titles in a given Wiki category and its subcategories."""
    titles = set()
    for cm in category.members():
        if type(cm) == mwclient.page.Page:
            titles.add(cm.name)
        elif isinstance(cm, mwclient.listing.Category) and max_depth > 0:
            deeper_titles = titles_from_category(cm, max_depth=max_depth - 1)
            titles.update(deeper_titles)
    return titles

# Initialize the Wikipedia client
site = mwclient.Site(WIKI_SITE)
category_page = site.pages[CATEGORY_TITLE]
titles = titles_from_category(category_page, max_depth=1)

# Select 20% of the articles for processing you can modify this according to your use case.
sample_size = int(0.2 * len(titles))
sampled_titles = random.sample(list(titles), sample_size)
print(f"Selected {len(sampled_titles)} article titles for processing.")

Selected 35 article titles for processing.


**How It Works**

  1. **titles_from_category Function**: This function takes a Wikipedia category and retrieves all article titles within that category and its subcategories up to a specified depth.

  2. **max_depth Parameter:** Controls how deep the function will go into subcategories.

### **Chunk Documents**
Now that we have our reference documents, we need to prepare them for search.

For this specific example on Wikipedia articles, we'll:

* Discard less relevant-looking sections like External Links and Footnotes
* Clean up the text by removing reference tags (e.g., ), whitespace, and super short sections
* Split each article into sections
* Prepend titles and subtitles to each section's text, to help GPT understand the context
* If a section is long (say, > 1,600 tokens), we'll recursively split it into smaller sections, trying to split along     semantic boundaries like paragraphs

In [6]:
SECTIONS_TO_IGNORE = [
    "See also",
    "References",
    "External links",
    "Further reading",
    "Footnotes",
    "Bibliography",
    "Sources",
    "Citations",
    "Literature",
    "Footnotes",
    "Notes and references",
    "Photo gallery",
    "Works cited",
    "Photos",
    "Gallery",
    "Notes",
    "References and sources",
    "References and notes",
]

def all_subsections_from_section(section, parent_titles, sections_to_ignore):
    """Extract subsections from a Wikipedia section."""
    headings = [str(h) for h in section.filter_headings()]
    title = headings[0]
    if title.strip("=" + " ") in sections_to_ignore:
        return []
    titles = parent_titles + [title]
    full_text = str(section)
    section_text = full_text.split(title)[1]
    if len(headings) == 1:
        return [(titles, section_text)]
    else:
        first_subtitle = headings[1]
        section_text = section_text.split(first_subtitle)[0]
        results = [(titles, section_text)]
        for subsection in section.get_sections(levels=[len(titles) + 1]):
            results.extend(all_subsections_from_section(subsection, titles, sections_to_ignore))
        return results

The all_subsections_from_section function is designed to extract subsections from a specific section of a Wikipedia article. This function is used in the context of processing a page’s text, finding headings, and breaking the content down into smaller chunks (subsections). It helps you organize the text under each heading while ignoring certain sections you don't want to include (like references or external links).



In [7]:
def all_subsections_from_title(
    title: str,
    sections_to_ignore: set[str] = SECTIONS_TO_IGNORE,
    site_name: str = WIKI_SITE,
) -> list[tuple[list[str], str]]:
    """From a Wikipedia page title, return a flattened list of all nested subsections.
    Each subsection is a tuple, where:
        - the first element is a list of parent subtitles, starting with the page title
        - the second element is the text of the subsection (but not any children)
    """
    site = mwclient.Site(site_name)
    page = site.pages[title]
    text = page.text()
    parsed_text = mwparserfromhell.parse(text)
    headings = [str(h) for h in parsed_text.filter_headings()]
    if headings:
        summary_text = str(parsed_text).split(headings[0])[0]
    else:
        summary_text = str(parsed_text)
    results = [([title], summary_text)]
    for subsection in parsed_text.get_sections(levels=[2]):
        results.extend(all_subsections_from_section(subsection, [title], sections_to_ignore))
    return results


The function takes a Wikipedia page title and returns all the subsections of that page, along with their corresponding parent titles. It extracts the page's text, finds all the headings, and organizes the content into a list of tuples. 

Each tuple contains:

1. A list of parent titles (starting with the page title).
2. The text of the subsection (excluding any sub-subsections).

Function Parameters:

1. title: str: The title of the Wikipedia page you want to extract subsections from (e.g., "Python (programming language)").

2. sections_to_ignore: set[str] (default: SECTIONS_TO_IGNORE): A set of section titles that should be excluded from the results. For example, you might want to skip sections like "References" or "External Links."

3. site_name: str (default: WIKI_SITE): The name of the Wikipedia site (e.g., "en.wikipedia.org"). This is useful if you're accessing a specific language version of Wikipedia or a different wiki site.

The function above splits the articles into smaller sections.

**Clean Up Sections**

We clean the sections to remove unnecessary information, such as reference tags (<ref>).

In [8]:
wikipedia_sections = []
for title in titles:
    wikipedia_sections.extend(all_subsections_from_title(title))
print(f"Found {len(wikipedia_sections)} sections in {len(titles)} pages.")

Found 1838 sections in 179 pages.


In [9]:
# clean text
def clean_section(section: tuple[list[str], str]) -> tuple[list[str], str]:
    """
    Return a cleaned up section with:
        - <ref>xyz</ref> patterns removed
        - leading/trailing whitespace removed
    """
    titles, text = section
    text = re.sub(r"<ref.*?</ref>", "", text)
    text = text.strip()
    return (titles, text)


wikipedia_sections = [clean_section(ws) for ws in wikipedia_sections]

# Filter out short/blank sections
def keep_section(section: tuple[list[str], str]) -> bool:
    _, text = section
    return len(text) >= 16


original_num_sections = len(wikipedia_sections)
wikipedia_sections = [ws for ws in wikipedia_sections if keep_section(ws)]
print(f"Filtered out {original_num_sections-len(wikipedia_sections)} sections, leaving {len(wikipedia_sections)} sections.")

# Display example data
for ws in wikipedia_sections[:5]:
    print(ws[0])
    print(ws[1][:77] + "...")
    print()

Filtered out 89 sections, leaving 1749 sections.
['Speed skating at the 2022 Winter Olympics']
{{Short description|none}}
{{Use dmy dates|date=February 2018}}
{{Infobox Oly...

['Speed skating at the 2022 Winter Olympics', '==Qualification==']
{{Main|Speed skating at the 2022 Winter Olympics – Qualification}}
A total qu...

['Speed skating at the 2022 Winter Olympics', '==Qualification==', '===Qualification times===']
The following qualification times were released on July 1, 2021, and were unc...

['Speed skating at the 2022 Winter Olympics', '==Competition schedule==']
The following was the competition schedule for all speed skating events. With...

['Speed skating at the 2022 Winter Olympics', '==Medal summary==', '===Medal table===']
{{Medals table
 | caption  = 
 | host  = CHN
 | flag_template = flagIOC
 | ev...



**Handle Text Length (Tokens)**

Embeddings work best when the text is not too long. We count the tokens (words and characters) to ensure that each section is short enough.

In [10]:
GPT_MODEL = "gpt-4o-mini"  # Just to select tokenizer, does not use OpenAI models

def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))


This function counts the number of tokens (units of text) in a given string based on a specific tokenizer model (in this case, gpt-4o-mini). 

Here's what each part does:

1. Input: A text string.
2. Output: The number of tokens in the string.
How: It uses a tokenizer specific to the given model (like how many words or chunks the model recognizes) to "encode" the text and count its tokens.

In [11]:
def halved_by_delimiter(string: str, delimiter: str = "\n") -> list[str, str]:
    chunks = string.split(delimiter)
    if len(chunks) == 1:
        return [string, ""]
    total_tokens = num_tokens(string)
    halfway = total_tokens // 2
    best_diff = halfway
    for i, _ in enumerate(chunks):
        left = delimiter.join(chunks[: i + 1])
        left_tokens = num_tokens(left)
        diff = abs(halfway - left_tokens)
        if diff >= best_diff:
            break
        best_diff = diff
    left = delimiter.join(chunks[:i])
    right = delimiter.join(chunks[i:])
    return [left, right]

This function splits a large string into two parts at a logical breakpoint (like a sentence or paragraph).

Input:
1. string: A large text string.
2. delimiter: The point where we want to split the string (default is a new line \n).
3. Output: Two parts of the string, left and right.

How: It tries to find the point where the string should be split into two halves based on token count. It looks for the closest match to half the total number of tokens, and then splits the string into two logical parts.

In [12]:
def truncated_string(string: str, model: str, max_tokens: int, print_warning: bool = True) -> str:
    encoding = tiktoken.encoding_for_model(model)
    encoded_string = encoding.encode(string)
    truncated_string = encoding.decode(encoded_string[:max_tokens])
    if print_warning and len(encoded_string) > max_tokens:
        print(f"Warning: Truncated string from {len(encoded_string)} tokens to {max_tokens} tokens.")
    return truncated_string

This function shortens a string to a certain number of tokens, if necessary.

Input:

1. string: The text to be truncated.
2. model: The tokenizer model to use.
3. max_tokens: The maximum number of tokens allowed.
4. print_warning: If the string is shortened, a warning will be printed.

Output: The truncated string (cut down to the allowed number of tokens).
How: It encodes the string into tokens, and if the number of tokens exceeds the limit, it trims the string and prints a warning.

In [13]:
def split_strings_from_subsection(subsection: tuple[list[str], str], max_tokens: int = 1000, model: str = GPT_MODEL, max_recursion: int = 5) -> list[str]:
    titles, text = subsection
    string = "\n\n".join(titles + [text])
    if num_tokens(string) <= max_tokens:
        return [string]
    elif max_recursion == 0:
        return [truncated_string(string, model=model, max_tokens=max_tokens)]
    for delimiter in ["\n\n", "\n", ". "]:
        left, right = halved_by_delimiter(text, delimiter=delimiter)
        if left == "" or right == "":
            continue
        results = []
        for half in [left, right]:
            half_subsection = (titles, half)
            half_strings = split_strings_from_subsection(
                half_subsection,
                max_tokens=max_tokens,
                model=model,
                max_recursion=max_recursion - 1,
            )
            results.extend(half_strings)
        return results
    return [truncated_string(string, model=model, max_tokens=max_tokens)]

# Split sections into chunks
MAX_TOKENS = 1600
wikipedia_strings = []
for section in wikipedia_sections:
    wikipedia_strings.extend(split_strings_from_subsection(section, max_tokens=MAX_TOKENS))

print(f"{len(wikipedia_sections)} Wikipedia sections split into {len(wikipedia_strings)} strings.")

1749 Wikipedia sections split into 2052 strings.


This function breaks down a large section of text into smaller pieces that are below a token limit.

Input:
1. subsection: A tuple containing a list of titles (headers) and the main text.
2. max_tokens: The maximum number of tokens each part can have.
3. model: The tokenizer model.
4. max_recursion: How many times the function can call itself to keep splitting the text if it's too big.

Output: A list of smaller text strings, each below the token limit.

How: It tries to break the text into smaller parts by splitting it at logical places (like paragraph breaks or sentence breaks) and keeps doing so recursively until the chunks are small enough

### **Generate Embeddings**
**Prepare Text for Embedding**

After splitting the sections, we convert them into numerical values (embeddings). These embeddings help computers understand the content.

In [14]:
# Split sections into chunks
MAX_TOKENS = 1600
wikipedia_strings = []
for section in wikipedia_sections:
    wikipedia_strings.extend(split_strings_from_subsection(section, max_tokens=MAX_TOKENS))

print(f"{len(wikipedia_sections)} Wikipedia sections split into {len(wikipedia_strings)} strings.")

# Fetch embeddings from Nugen API
BATCH_SIZE = 100
EMBEDDING_MODEL = "nugen-flash-embed"
embeddings = []

for batch_start in range(0, len(wikipedia_strings), BATCH_SIZE):
    batch_end = batch_start + BATCH_SIZE
    batch = wikipedia_strings[batch_start:batch_end]
    print(f"Processing batch {batch_start} to {batch_end-1}")

    payload = {
        "input": batch,
        "model": EMBEDDING_MODEL
    }

    try:
        response = requests.post(url_api_server, json=payload, headers=headers)
        response.raise_for_status()
        data = response.json()
        batch_embeddings = [e['embedding'] for e in data['data']]
        embeddings.extend(batch_embeddings)
    except requests.exceptions.HTTPError as e:
        print(f"HTTP error occurred: {e}")
        print(f"Response content: {response.content}")
    except Exception as e:
        print(f"An error occurred: {e}")

1749 Wikipedia sections split into 2052 strings.
Processing batch 0 to 99
Processing batch 100 to 199
Processing batch 200 to 299
Processing batch 300 to 399
Processing batch 400 to 499
Processing batch 500 to 599
Processing batch 600 to 699
Processing batch 700 to 799
Processing batch 800 to 899
Processing batch 900 to 999
Processing batch 1000 to 1099
Processing batch 1100 to 1199
Processing batch 1200 to 1299
Processing batch 1300 to 1399
Processing batch 1400 to 1499
Processing batch 1500 to 1599
Processing batch 1600 to 1699
Processing batch 1700 to 1799
Processing batch 1800 to 1899
Processing batch 1900 to 1999
Processing batch 2000 to 2099


### **Save the Results**

In [15]:
# Save the embeddings
df = pd.DataFrame({"text": wikipedia_strings, "embedding": embeddings})
SAVE_PATH = "winter_olympics_2022.csv"
df.to_csv(SAVE_PATH, index=False)
print(f"Embeddings saved to {SAVE_PATH}")

Embeddings saved to winter_olympics_2022.csv
