<font color='green'>
Pip install is the command you use to install Python packages with the help of a tool called Pip package manager.
<br><br>Installing LangChain package
</font>

In [15]:
!pip install langchain
!pip install tiktoken
!pip install plotly

Collecting plotly
  Downloading plotly-5.18.0-py3-none-any.whl (15.6 MB)
Installing collected packages: plotly
Successfully installed plotly-5.18.0


<font color='green'>
Installing Openai package, which includes the classes that we can use to communicate with Openai services
<font>

In [2]:
!pip install Openai



## Let's use OpenAI

<font color='green'>
Imports the Python built-in module called "os."
<br>This module provides a way to interact with the operating system, such as accessing environment variables, working with files and directories, executing shell commands, etc
<br><br>
The environ attribute is a dictionary-like object that contains the environment variables of the current operating system session
<br><br>
By accessing os.environ, you can retrieve and manipulate environment variables within your Python program. For example, you can retrieve the value of a specific environment variable using the syntax os.environ['VARIABLE_NAME'], where "VARIABLE_NAME" is the name of the environment variable you want to access.
<font>

In [2]:
import os
os.environ["OPENAI_API_KEY"] = ""

<font color='green'>
LangChain has built a Wrapper around OpenAI APIs, using which we can get access to all the services OpenAI provides.
<br>
The code snippet below imports a specific class called 'OpenAIEmbeddings'(Wrapper around OpenAI large language models) from the 'embeddings' module of the 'langchain' library.

<font>

In [3]:
from langchain.embeddings import OpenAIEmbeddings

<font color='green'>
Initialize the OpenAIEmbeddings object
<font>

In [4]:
embeddings = OpenAIEmbeddings()

<font color='green'>
Let's read our input data and get its embedding representation, so that we use it up for our future tasks
<font>

In [8]:
import pandas as pd
df = pd.read_csv('Data.csv')
print(df)

         Words
0     Elephant
1         Lion
2        Tiger
3          Dog
4      Cricket
5      Footbal
6       Tennis
7   Basketball
8        Apple
9       Orange
10      Banana


<font color='green'>
    We can use "apply" to apply the get_embedding function to each row in the dataframe because our words are stored in a pandas dataframe. In order to save time and to save the calculated word embeddings in a new csv file called "word_embeddings.csv" rather than calling OpenAI once more to carry out these computations.
    <font>

In [9]:
df['embedding'] = df['Words'].apply(lambda x: embeddings.embed_query(x))
df.to_csv('word_embeddings.csv')

Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for text-embedding-ada-002 in organization org-m65woVAfJkLXM9KPRL1VqIHi on requests per min. Limit: 3 / min. Please try again in 20s. Visit https://platform.openai.com/account/rate-limits to learn more. You can increase your rate limit by adding a payment method to your account at https://platform.openai.com/account/billing..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for text-embedding-ada-002 in organization org-m65woVAfJkLXM9KPRL1VqIHi on requests per min. Limit: 3 / min. Please try again in 20s. Visit https://platform.openai.com/account/rate-limits to learn more. You can increase your rate limit by adding a payment method to your account at https://platform.openai.com/account/billing..
Retrying langchain.embeddings.openai.embed_with_retry.<l

<font color='green'>
    Let's load the existing file, which contains the embeddings, so that we can save chargers by not hitting the API repeatedly
    <font>

In [10]:
new_df = pd.read_csv('word_embeddings.csv')
print(new_df)

    Unnamed: 0       Words                                          embedding
0            0    Elephant  [-0.017855134320424067, -0.008739002273680945,...
1            1        Lion  [-0.001514446088710819, -0.010011047775235735,...
2            2       Tiger  [-0.013417221828848814, -0.009594874215361255,...
3            3         Dog  [-0.000993324388174965, -0.01511439587442286, ...
4            4     Cricket  [0.003905495127549336, -0.007206682981601541, ...
5            5     Footbal  [-0.011442505362835323, -0.008127146122306163,...
6            6      Tennis  [-0.0229627046214536, 0.001620174408225852, 0....
7            7  Basketball  [-0.012779986743709601, -0.013293189227440112,...
8            8       Apple  [0.01447704940326135, -0.003934278727982006, -...
9            9      Orange  [0.020745464248294754, -0.029286146214470166, ...
10          10      Banana  [-0.01299976590369686, -0.019983216193919837, ...


<font color='green'>
Let's get the embeddings for our text
<font>

In [11]:
our_Text = "Mango"

In [12]:
text_embedding = embeddings.embed_query(our_Text)

In [13]:
print (f"Our embedding is {text_embedding}")

Our embedding is [-0.0034055178024402486, -0.019832508079084002, 0.010472126941037743, -0.016190586154939356, 0.00631905655247528, 0.008516989408102721, -0.025122882089670923, -0.014197111660624236, 0.0011077517098275833, -0.03307121898826951, -0.0011668530894996386, 0.003782488764672818, 0.0019183990702938399, 0.016842297734595148, -0.021621524677748073, 0.004792004222854953, 0.03268786055033911, -0.008191132686952277, 0.009385939422740509, -0.01575611114762046, -0.01596056997459406, 0.0027809598630040206, 0.0014703466607885716, -0.012401707120601064, 0.0045587936061688315, 0.009552062219656557, 0.024483948255378432, -0.013660407239818545, 0.011577482940025025, 0.0029918080283205424, 0.04779225825565868, -0.015027725645204474, -0.0043511398771931345, -0.0099226438435462, -0.017596240590382835, 0.004654633681312705, -0.010325171693489196, -0.006373365695559505, -0.006251968267043931, -0.018324624230153727, 0.014452685194341232, -0.008325309257814974, -0.015641102126125263, -0.018107387

<font color='green'>
    We can determine how similar a word is to other words in our dataframe after we have a vector representing that word.
    <br>
By computing the cosine similarity of the word vector for our search term to each word embedding in our dataframe.
    <font>

In [18]:
from openai.embeddings_utils import cosine_similarity

print(text_embedding)

df["similarity score"] = df['embedding'].apply(lambda x: cosine_similarity(x, text_embedding))

df

[-0.0034055178024402486, -0.019832508079084002, 0.010472126941037743, -0.016190586154939356, 0.00631905655247528, 0.008516989408102721, -0.025122882089670923, -0.014197111660624236, 0.0011077517098275833, -0.03307121898826951, -0.0011668530894996386, 0.003782488764672818, 0.0019183990702938399, 0.016842297734595148, -0.021621524677748073, 0.004792004222854953, 0.03268786055033911, -0.008191132686952277, 0.009385939422740509, -0.01575611114762046, -0.01596056997459406, 0.0027809598630040206, 0.0014703466607885716, -0.012401707120601064, 0.0045587936061688315, 0.009552062219656557, 0.024483948255378432, -0.013660407239818545, 0.011577482940025025, 0.0029918080283205424, 0.04779225825565868, -0.015027725645204474, -0.0043511398771931345, -0.0099226438435462, -0.017596240590382835, 0.004654633681312705, -0.010325171693489196, -0.006373365695559505, -0.006251968267043931, -0.018324624230153727, 0.014452685194341232, -0.008325309257814974, -0.015641102126125263, -0.018107387657816828, -0.015

Unnamed: 0,Words,embedding,similarity score
0,Elephant,"[-0.017855134320424067, -0.008739002273680945,...",0.830493
1,Lion,"[-0.001514446088710819, -0.010011047775235735,...",0.827275
2,Tiger,"[-0.013417221828848814, -0.009594874215361255,...",0.852021
3,Dog,"[-0.000993324388174965, -0.01511439587442286, ...",0.77311
4,Cricket,"[0.003905495127549336, -0.007206682981601541, ...",0.817943
5,Footbal,"[-0.011442505362835323, -0.008127146122306163,...",0.777473
6,Tennis,"[-0.0229627046214536, 0.001620174408225852, 0....",0.805583
7,Basketball,"[-0.012779986743709601, -0.013293189227440112,...",0.794297
8,Apple,"[0.01447704940326135, -0.003934278727982006, -...",0.813942
9,Orange,"[0.020745464248294754, -0.029286146214470166, ...",0.843932


<font color='green'>
    Sorting by similarity values in dataframe reveals Banana, Orange, and Apple are closest to searched term, such as Mango.
    <font>

In [19]:
df.sort_values("similarity score", ascending=False).head(10)

Unnamed: 0,Words,embedding,similarity score
10,Banana,"[-0.01299976590369686, -0.019983216193919837, ...",0.898764
2,Tiger,"[-0.013417221828848814, -0.009594874215361255,...",0.852021
9,Orange,"[0.020745464248294754, -0.029286146214470166, ...",0.843932
0,Elephant,"[-0.017855134320424067, -0.008739002273680945,...",0.830493
1,Lion,"[-0.001514446088710819, -0.010011047775235735,...",0.827275
4,Cricket,"[0.003905495127549336, -0.007206682981601541, ...",0.817943
8,Apple,"[0.01447704940326135, -0.003934278727982006, -...",0.813942
6,Tennis,"[-0.0229627046214536, 0.001620174408225852, 0....",0.805583
7,Basketball,"[-0.012779986743709601, -0.013293189227440112,...",0.794297
5,Footbal,"[-0.011442505362835323, -0.008127146122306163,...",0.777473
