# Search for articles using embedding

This notebook describes the practical use of text embedding.

As an example of usage, we will extract articles in the dataset that are highly similar to the specified keywords.

## 0. Prerequisites

### Import libraries

We need to calculate the cosine similarity to find the similarity between the keywords and the data. However PowerShell does not have a standard function to calculate this, we will use the [Math.NET Numerics](https://www.nuget.org/packages/MathNet.Numerics/) library for this example.

In [1]:
# Import
Import-Module ..\PSOpenAI.psd1

In [2]:
# Download Math.NET Numerics library from Nuget
$LibraryUrl = 'https://www.nuget.org/api/v2/package/MathNet.Numerics/5.0.0'
Invoke-WebRequest -Uri $LibraryUrl -OutFile '.\mathnet.numerics.5.0.0.nupkg'
Expand-Archive '.\mathnet.numerics.5.0.0.nupkg' -Force

# Load library
Add-Type -Path '.\mathnet.numerics.5.0.0\lib\netstandard2.0\MathNet.Numerics.dll'

# Define a funtion for calculate the cosine similarity
function Get-CosineSimilarity {
    param(
        [float[]]$Input1,
        [float[]]$Input2
    )
    $Vector1 = [MathNet.Numerics.LinearAlgebra.Vector[float]]::Build.Dense($Input1)
    $Vector2 = [MathNet.Numerics.LinearAlgebra.Vector[float]]::Build.Dense($Input2)
    $Vector1.DotProduct($Vector2) / $Vector1.L2Norm() * $Vector2.L2Norm()
}



## 1. Prepare a dataset

In this example, `AG_news_samples.csv` in the [openai-cookbook](https://github.com/openai/openai-cookbook/) repository is used as the dataset. This CSV file contains the titles and descriptions of 2000 news articles.

In [3]:
# Download sample dataset
$DatasetUrl = 'https://raw.githubusercontent.com/openai/openai-cookbook/297c53430cad2d05ba763ab9dca64309cb5091e9/examples/data/AG_news_samples.csv'
Invoke-WebRequest -Uri $DatasetUrl -OutFile '.\AG_news_samples.csv'

# Load dataset to memory
$Dataset = Get-Content '.\AG_news_samples.csv' | ConvertFrom-Csv

# Show first 3 data of artices
$Dataset | select -First 3




[32;1mtitle                                              description[0m
[32;1m-----                                              -----------                                     [0m
World Briefings                                    BRITAIN: BLAIR WARNS OF CLIMATE THREAT Prime Mi…
Nvidia Puts a Firewall on a Motherboard (PC World) PC World - Upcoming chip set will include built…
Olympic joy in Greek, Chinese press                Newspapers in Greece reflect a mixture of exhil…



### (Optional) Filtering too long texts
In this example, we use the `text-embedding-ada-002` model for embedding articles. The maximum token length of this model that can be input is `8191` tokens.

We calculate the token length of the input string using `ConvertTo-Token` function to exclude data that exceeds `8191` tokens.

Note: The sample dataset used in this example does not contain data exceeding the maximum token length. You can skip this process. Calculating the token length for 2000 articles will take a long time, even with a Core i9-12900K system, it took about 120 seconds.

In [4]:
$Model = 'text-embedding-ada-002'  # OpenAI's best embeddings as of Apr 2023
$MaxTokenLength = 8191

Exclude too long artices
$Dataset = $Dataset | ? {
    $text = $_.title + ' : ' + $_.description
    $tokens = ConvertTo-Token -Text $_.description -Model $Model
    $tokens.Count -le $MaxTokenLength
}

## 2. Embed documents

Now that we can compute embeddings for each articles by `Request-Embeddings` function.

For each article in the dataset, embed the text of the title and description concatenated with a colon, and add the result to the dataset.

In [7]:
# If you have no time, can use pre-calculated data in this repository.
# Expand-Archive '.\dataset\AG_news_samples_embedded.zip' -Force
# $Dataset = Import-Clixml '.\AG_news_samples_embedded\AG_news_samples_embedded.xml'

# Embed all artices (it may take a long time)
$Dataset | % {
    $text = $_.title + ' : ' + $_.description
    $embeds = Request-Embeddings -Text $text -Model 'text-embedding-ada-002' -MaxRetryCount 2
    $_ | Add-Member -MemberType NoteProperty -Name 'Embedding' -Value $embeds.data[0].embedding
}

# Show first 3 data of artices
$Dataset | select -First 3




[32;1mtitle       : [0mWorld Briefings
[32;1mdescription : [0mBRITAIN: BLAIR WARNS OF CLIMATE THREAT Prime Minister Tony Blair urged the internatio
              nal community to consider global warming a dire threat and agree on a plan of action 
              to curb the  quot;alarming quot; growth of greenhouse gases.
[32;1mlabel_int   : [0m1
[32;1mlabel       : [0mWorld
[32;1mEmbedding   : [0m{-0.01141339, -0.02303488, -0.01050292, -0.02532406…}

[32;1mtitle       : [0mNvidia Puts a Firewall on a Motherboard (PC World)
[32;1mdescription : [0mPC World - Upcoming chip set will include built-in security features for your PC.
[32;1mlabel_int   : [0m4
[32;1mlabel       : [0mSci/Tech
[32;1mEmbedding   : [0m{0.001204324, -0.02190714, 0.001971776, -0.02091623…}

[32;1mtitle       : [0mOlympic joy in Greek, Chinese press
[32;1mdescription : [0mNewspapers in Greece reflect a mixture of exhilaration that the Athens Olympics prove
              d successful, and relief

An Embedding property is added to each data and the calculated embedding is stored.

## 3. Search for articles by keyword

Perform keyword searches using embedded data.

Calculate the keyword embeddings and determine the cosine similarity with the embeddings of each article. The articles with a cosine similarity close to 1 are more relevant to the keywords, so we extract 3 articles in order of similarity.

In [10]:
$Keyword = 'Olympics'

# Calculates embedding of a keyword for search
$SearchVector = (Request-Embeddings -Text $Keyword -Model 'text-embedding-ada-002').data[0].embedding

# Extract the 3 articles with the highest similarity
$Dataset | Sort-Object {
    Get-CosineSimilarity -Input1 $SearchVector -Input2 $_.Embedding
} -Descending | select -First 3




[32;1mtitle       : [0mOlympic joy in Greek, Chinese press
[32;1mdescription : [0mNewspapers in Greece reflect a mixture of exhilaration that the Athens Olympics prove
              d successful, and relief that they passed off without any major setback.
[32;1mlabel_int   : [0m2
[32;1mlabel       : [0mSports
[32;1mEmbedding   : [0m{-0.004390786, -0.002832924, 0.01760238, -0.02965988…}

[32;1mtitle       : [0mChina supreme heading for Beijing
[32;1mdescription : [0mATHENS: China, the dominant force in world diving for the best part of 20 years, won 
              six out of eight Olympic titles in Athens and prompted speculation about a clean swee
              p when they stage the Games in Beijing in 2008.
[32;1mlabel_int   : [0m2
[32;1mlabel       : [0mSports
[32;1mEmbedding   : [0m{0.006527618, 0.01006661, 0.002017991, -0.01895911…}

[32;1mtitle       : [0mOlympic Games 2012 great stake for France #39;s sports, says French &lt;b&gt;...&lt;/
              b&gt;

Three articles related to the keyword "Olympics" were retrieved.

This example may not give you much of the advantage of searching using embedding compared to a regular text search.

So, let's change the keyword to "オリンピック" (the word that the Olympics in Japanese).

In [11]:
$Keyword = 'オリンピック' # "Olympics" in Japanese

# Calculates embedding of a keyword for search
$SearchVector = (Request-Embeddings -Text $Keyword -Model 'text-embedding-ada-002').data[0].embedding

# Extract the 3 articles with the highest similarity
$Dataset | Sort-Object {
    Get-CosineSimilarity -Input1 $SearchVector -Input2 $_.Embedding
} -Descending | select -First 3




[32;1mtitle       : [0mOlympic joy in Greek, Chinese press
[32;1mdescription : [0mNewspapers in Greece reflect a mixture of exhilaration that the Athens Olympics prove
              d successful, and relief that they passed off without any major setback.
[32;1mlabel_int   : [0m2
[32;1mlabel       : [0mSports
[32;1mEmbedding   : [0m{-0.004390786, -0.002832924, 0.01760238, -0.02965988…}

[32;1mtitle       : [0mATHENS 2004/Inoue crashes out
[32;1mdescription : [0mATHENS-In one of the biggest shocks in Olympic judo history, defending champion Kosei
               Inoue was defeated by Dutchman Elco van der Geest in the men #39;s 100-kilogram cate
              gory Thursday. 
[32;1mlabel_int   : [0m2
[32;1mlabel       : [0mSports
[32;1mEmbedding   : [0m{-0.007725361, 0.0102562, 0.01050862, 0.00970486…}

[32;1mtitle       : [0mOlympians out with plenty to prove in NYC Marathon
[32;1mdescription : [0mWhen Paula Radcliffe dropped out of the Olympic marathon miles from

Although the original dataset was created entirely in English and does not include the Japanese word "オリンピック," we were able to properly extract articles related to the Olympics. Using embedding allows for ambiguous searches that take into account the meaning of the words in this way.