<a href="https://colab.research.google.com/github/juancopi81/chatMLS/blob/main/Text_Embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#1. Creating the text database

## Install required packages

In [1]:
!pip install transformers datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m40.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.9.0-py3-none-any.whl (462 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m462.8/462.8 KB[0m [31m33.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0

## Upload raw dataset from Hugging Face

In [2]:
from datasets import load_dataset
mls_dataset = load_dataset("juancopi81/mls", split="train")
mls_dataset

Downloading readme:   0%|          | 0.00/623 [00:00<?, ?B/s]



Downloading and preparing dataset None/None to /root/.cache/huggingface/datasets/juancopi81___parquet/juancopi81--mls-a645ad9f5aee714c/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.12M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/142 [00:00<?, ? examples/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/juancopi81___parquet/juancopi81--mls-a645ad9f5aee714c/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


Dataset({
    features: ['CHANNEL_NAME', 'URL', 'TITLE', 'DESCRIPTION', 'TRANSCRIPTION', 'SEGMENTS'],
    num_rows: 142
})

## Inspect dataset

In [3]:
mls_dataset[80]["TITLE"]

'6.11 Machine Learning development process | Error analysis --[ML | Andrew Ng]'

## Remove unused columns

In [4]:
mls_dataset = mls_dataset.remove_columns(["SEGMENTS", "CHANNEL_NAME", "DESCRIPTION"])

## Convert to pandas dataframe

In [5]:
mls_dataset.set_format("pandas")
df = mls_dataset[:]
df.head()

Unnamed: 0,URL,TITLE,TRANSCRIPTION
0,https://www.youtube.com/watch?v=y8JgiWcUnU8,1.1 Machine Learning Overview | Welcome to mac...,Welcome to machine learning. What is machine ...
1,https://www.youtube.com/watch?v=AISftYVyS50,1.2 Machine Learning Overview | What is machi...,"So, what is machine learning? In this video y..."
2,https://www.youtube.com/watch?v=hHYcNPfbBXQ,1.3 Machine Learning Overview | Applications ...,"In this class, you learn about the state of t..."
3,https://www.youtube.com/watch?v=EZN_uM3J3kI,1.4 Machine Learning Overview | Supervised lea...,Machine learning is creating tremendous econo...
4,https://www.youtube.com/watch?v=l16C3PKiHKg,1.5 Machine Learning Overview | Supervised lea...,"So, supervised learning algorithms learn to p..."


In [6]:
# See max and min number of words in transcripptions
out = df['TRANSCRIPTION'].str.split().str.len().agg(['min','max'])
out

min     394
max    2646
Name: TRANSCRIPTION, dtype: int64

In [7]:
# Check if duplicate rows
duplicateRows = df[df.duplicated(['URL', 'TITLE'])]
duplicateRows

Unnamed: 0,URL,TITLE,TRANSCRIPTION


## Split transcription columns in chunks of num_of_words

In [8]:
import pandas as pd

def split_by_number_of_words(df, column_to_split, num_of_words):
    """
    Takes a dataframe and split rows in columns_to_split that are larger than
    num_of_words
    :param df:
    :columns_to_split:
    :param num_of_words:
    :return: New dataframe with selected_column has less or equal num_of_words
    """
    n = num_of_words
    columns_to_duplicate = df.columns.symmetric_difference([column_to_split]).to_list()
    final_cols = columns_to_duplicate + [column_to_split]
    new_df = df.set_index(columns_to_duplicate)[column_to_split].str.split().apply(
        lambda x: pd.Series([' '.join(x[i:i+n]) for i in range(0, len(x), n)])
        ).stack().reset_index()
    new_df = new_df.rename(columns={0: column_to_split})
    final_df = new_df.loc[:, final_cols]
    return final_df

In [9]:
# Split transcription column in chunks of 500 words - To tune this
new_df = split_by_number_of_words(df, "TRANSCRIPTION", 500)

In [10]:
df

Unnamed: 0,URL,TITLE,TRANSCRIPTION
0,https://www.youtube.com/watch?v=y8JgiWcUnU8,1.1 Machine Learning Overview | Welcome to mac...,Welcome to machine learning. What is machine ...
1,https://www.youtube.com/watch?v=AISftYVyS50,1.2 Machine Learning Overview | What is machi...,"So, what is machine learning? In this video y..."
2,https://www.youtube.com/watch?v=hHYcNPfbBXQ,1.3 Machine Learning Overview | Applications ...,"In this class, you learn about the state of t..."
3,https://www.youtube.com/watch?v=EZN_uM3J3kI,1.4 Machine Learning Overview | Supervised lea...,Machine learning is creating tremendous econo...
4,https://www.youtube.com/watch?v=l16C3PKiHKg,1.5 Machine Learning Overview | Supervised lea...,"So, supervised learning algorithms learn to p..."
...,...,...,...
137,https://www.youtube.com/watch?v=4hlH4TXtNms,10.13 Continuous State Spaces|Algorithm refine...,"In the last video, we saw a neural network ar..."
138,https://www.youtube.com/watch?v=tX7L_441Jlo,10.14 Continuous State Spaces | Algorithm refi...,"In the learning algorithm that we developed, ..."
139,https://www.youtube.com/watch?v=3FkPgerAhXo,10.15 Continuous State Spaces | Algorithm refi...,"In this video, we'll look at two further refi..."
140,https://www.youtube.com/watch?v=pdeGAhJ5pbE,10.16 Continuous State Spaces |The state of re...,Reinforcement learning is an exciting set of ...


In [11]:
new_df

Unnamed: 0,TITLE,URL,TRANSCRIPTION
0,1.1 Machine Learning Overview | Welcome to mac...,https://www.youtube.com/watch?v=y8JgiWcUnU8,Welcome to machine learning. What is machine l...
1,1.2 Machine Learning Overview | What is machi...,https://www.youtube.com/watch?v=AISftYVyS50,"So, what is machine learning? In this video yo..."
2,1.2 Machine Learning Overview | What is machi...,https://www.youtube.com/watch?v=AISftYVyS50,to spend a lot of time on in this specializati...
3,1.3 Machine Learning Overview | Applications ...,https://www.youtube.com/watch?v=hHYcNPfbBXQ,"In this class, you learn about the state of th..."
4,1.3 Machine Learning Overview | Applications ...,https://www.youtube.com/watch?v=hHYcNPfbBXQ,"by McKinsey, AI and machine learning is estima..."
...,...,...,...
420,10.15 Continuous State Spaces | Algorithm refi...,https://www.youtube.com/watch?v=3FkPgerAhXo,so this inner term becomes 1 over 2 m prime su...
421,10.15 Continuous State Spaces | Algorithm refi...,https://www.youtube.com/watch?v=3FkPgerAhXo,is used more common than bash gradient descent...
422,10.15 Continuous State Spaces | Algorithm refi...,https://www.youtube.com/watch?v=3FkPgerAhXo,"times W, in which case you're back to the orig..."
423,10.16 Continuous State Spaces |The state of re...,https://www.youtube.com/watch?v=pdeGAhJ5pbE,Reinforcement learning is an exciting set of t...


In [12]:
out = new_df['TRANSCRIPTION'].str.split().str.len().agg(['min','max'])
out

min      1
max    500
Name: TRANSCRIPTION, dtype: int64

In [16]:
# Inspect rows with less than n number of words
rows = new_df[new_df["TRANSCRIPTION"].str.split().str.len() < 40]

In [14]:
rows

Unnamed: 0,TITLE,URL,TRANSCRIPTION
7,1.4 Machine Learning Overview | Supervised lea...,https://www.youtube.com/watch?v=EZN_uM3J3kI,a number. But there's also a second major type...
14,1.7 Machine Learning Overview | Unsupervised l...,https://www.youtube.com/watch?v=u7Y_b04upmQ,to share with you something that I find really...
22,1.10 Machine Learning Overview | Linear regres...,https://www.youtube.com/watch?v=vrTHO5zRq6s,you can construct a cost function.
83,3.3 Classification | Decision boundary --[Mac...,https://www.youtube.com/watch?v=QJdIpRcL_4U,video.
108,4.1 Advanced Learning Algorithms | Welcome! -...,https://www.youtube.com/watch?v=cuU8pCflXCo,to start by taking a quick look at how the hum...
121,4.4 Neural Networks Intuition | Example Recogn...,https://www.youtube.com/watch?v=3RIUt73mj3Q,one or more layers of a neural network and the...
127,4.6 Neural Networks Model | More complex neura...,https://www.youtube.com/watch?v=4-2FOgsMOpk,of the previous layer. Let's put this into an ...
188,5.11 Additional Neural Network Concepts | Adva...,https://www.youtube.com/watch?v=yo6aW-D7sCM,"video, let's take a look at some alternative l..."
214,6.6 Bias and variance |Establishing a baseline...,https://www.youtube.com/watch?v=8Rl_2WQbmlc,"is doing, there's one other thing that I found..."
219,6.7 Bias and variance | Learning curves --[Mac...,https://www.youtube.com/watch?v=m0QgVaFS6O4,I hope will now make a lot more sense to you. ...


## Convert to Hugging Face dataset

In [15]:
from datasets import Dataset

mls_ds = Dataset.from_pandas(new_df)
mls_ds

Dataset({
    features: ['TITLE', 'URL', 'TRANSCRIPTION'],
    num_rows: 425
})

## Get transcription length 

In [17]:
mls_ds = mls_ds.map(
    lambda x: {"transcription_length": len(x["TRANSCRIPTION"].split())}
)
mls_ds

  0%|          | 0/425 [00:00<?, ?ex/s]

Dataset({
    features: ['TITLE', 'URL', 'TRANSCRIPTION', 'transcription_length'],
    num_rows: 425
})

## Remove columns with less than 40 words

In [18]:
mls_ds = mls_ds.filter(
    lambda x: x["transcription_length"] > 40
)

mls_ds

  0%|          | 0/1 [00:00<?, ?ba/s]

Dataset({
    features: ['TITLE', 'URL', 'TRANSCRIPTION', 'transcription_length'],
    num_rows: 410
})

In [19]:
mls_ds[100]

{'TITLE': '3.10 Regularization to Reduce Overfitting | Regularized linear regression-- [ML | Andrew Ng]',
 'URL': 'https://www.youtube.com/watch?v=yRSKygmsvSI',
 'TRANSCRIPTION': "or w dot product x plus b. And it turns out that by the rules of calculus, the derivatives look like this is one over two m times the sum i equals one through m of w dot x plus b minus y times two xj plus the derivative of the regularization term, which is lambda over two m times two wj. Notice that the second term does not have the summation term from j equals one through n anymore. The twos cancel out here and here and also here and here. And so it simplifies to this expression over here. And finally, remember that wx plus b is f of x. And so you can rewrite it as this expression down here. So this is why this expression is used to compute the gradient in regularized linear regression. So you now know how to implement regularized linear regression. Using this, you will reduce overfitting when you have a lot

In [20]:
# Add title to transcription
def concatenate_text(examples):
    return {
        "text": examples["TITLE"]
        + ": "
        + examples["TRANSCRIPTION"]
    }

## Concatenta title to transcription

The title contains relevant information

In [21]:
mls_ds = mls_ds.map(concatenate_text)
mls_ds

  0%|          | 0/410 [00:00<?, ?ex/s]

Dataset({
    features: ['TITLE', 'URL', 'TRANSCRIPTION', 'transcription_length', 'text'],
    num_rows: 410
})

In [23]:
mls_ds[400]

{'TITLE': '10.13 Continuous State Spaces|Algorithm refinement Improved neural network architecture-ML Andrew Ng',
 'URL': 'https://www.youtube.com/watch?v=4hlH4TXtNms',
 'TRANSCRIPTION': "In the last video, we saw a neural network architecture that would input the state in action and attempt to output the Q function, Q of sA. It turns out that there's a change to neural network architecture that makes this algorithm much more efficient. So most implementations of DQN actually use this more efficient architecture that we'll see in this video. Let's take a look. This was the neural network architecture we saw previously where it would input 12 numbers and output Q of sA. Whenever we are in some state s, we would have to carry out inference in the neural network separately four times to compute these four values so as to pick the action A that gives us the largest Q value. This is inefficient because we have to carry out inference four times from every single state. Instead, it turns out 

In [24]:
len(mls_ds["text"])

410

#2. Embedding the documents of the text database

In [25]:
!pip install -qU openai #pinecone-client

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/55.6 KB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.6/55.6 KB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for openai (pyproject.toml) ... [?25l[?25hdone


In [26]:
import openai

# get API key from top-right dropdown on OpenAI website
openai.api_key = "OPEN-AI-KEY"

# Test key

In [27]:
query = "who was the 12th person on the moon and when did they land?"

# now query text-davinci-003 WITHOUT context
res = openai.Completion.create(
    engine='text-davinci-003',
    prompt=query,
    temperature=0,
    max_tokens=400,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    stop=None
)

res['choices'][0]['text'].strip()

AuthenticationError: ignored

In [None]:
# first let's make it simpler to get answers
def complete(prompt):
    # query text-davinci-003
    res = openai.Completion.create(
        engine='text-davinci-003',
        prompt=prompt,
        temperature=0,
        max_tokens=400,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
        stop=None
    )
    return res['choices'][0]['text'].strip()

In [None]:
query = (
    "Which training method should I use for sentence transformers when " +
    "I only have pairs of related sentences?"
)

complete(query)
     

'If you only have pairs of related sentences, then the best training method to use for sentence transformers is the supervised learning approach. This approach involves providing the model with labeled data, such as pairs of related sentences, and then training the model to learn the relationships between the sentences. This approach is often used for tasks such as text classification, sentiment analysis, and natural language understanding.'

In [None]:
embed_model = "text-embedding-ada-002"

res = openai.Embedding.create(
    input=[
        "Sample document text goes here",
        "there will be several phrases in each batch"
    ], engine=embed_model
)

In [None]:
# vector embeddings are stored within the 'data' key
res.keys()

dict_keys(['object', 'data', 'model', 'usage'])

In [None]:
# we have created two vectors (one for each sentence input)
len(res['data'])

2

In [None]:
# we have created two 1536-dimensional vectors
len(res['data'][0]['embedding']), len(res['data'][1]['embedding'])

(1536, 1536)

# Alternative: Use pinecone to save the embeddings

In [None]:
#import pinecone

#index_name = 'openai-yannic-youtube-transcriptions-v2'

# initialize connection (get API key at app.pinecone.io)
#pinecone.init(
#    api_key="XXXX",
#    environment="us-east1-gcp"
#)

In [None]:
#pinecone.list_indexes()

In [None]:
# check if index already exists (it shouldn't if this is first time)
#if index_name not in pinecone.list_indexes():
    # if does not exist, create index
#    pinecone.create_index(
#        index_name,
#        dimension=len(res['data'][0]['embedding']),
#        metric='cosine'
#    )
# connect to index
#index = pinecone.Index(index_name)
# view index stats
#index.describe_index_stats()

In [28]:
mls_ds.set_format("pandas")
df = mls_ds[:]
df

Unnamed: 0,TITLE,URL,TRANSCRIPTION,transcription_length,text
0,1.1 Machine Learning Overview | Welcome to mac...,https://www.youtube.com/watch?v=y8JgiWcUnU8,Welcome to machine learning. What is machine l...,415,1.1 Machine Learning Overview | Welcome to mac...
1,1.2 Machine Learning Overview | What is machi...,https://www.youtube.com/watch?v=AISftYVyS50,"So, what is machine learning? In this video yo...",500,1.2 Machine Learning Overview | What is machi...
2,1.2 Machine Learning Overview | What is machi...,https://www.youtube.com/watch?v=AISftYVyS50,to spend a lot of time on in this specializati...,397,1.2 Machine Learning Overview | What is machi...
3,1.3 Machine Learning Overview | Applications ...,https://www.youtube.com/watch?v=hHYcNPfbBXQ,"In this class, you learn about the state of th...",500,1.3 Machine Learning Overview | Applications ...
4,1.3 Machine Learning Overview | Applications ...,https://www.youtube.com/watch?v=hHYcNPfbBXQ,"by McKinsey, AI and machine learning is estima...",201,1.3 Machine Learning Overview | Applications ...
...,...,...,...,...,...
405,10.15 Continuous State Spaces | Algorithm refi...,https://www.youtube.com/watch?v=3FkPgerAhXo,so this inner term becomes 1 over 2 m prime su...,500,10.15 Continuous State Spaces | Algorithm refi...
406,10.15 Continuous State Spaces | Algorithm refi...,https://www.youtube.com/watch?v=3FkPgerAhXo,is used more common than bash gradient descent...,500,10.15 Continuous State Spaces | Algorithm refi...
407,10.15 Continuous State Spaces | Algorithm refi...,https://www.youtube.com/watch?v=3FkPgerAhXo,"times W, in which case you're back to the orig...",266,10.15 Continuous State Spaces | Algorithm refi...
408,10.16 Continuous State Spaces |The state of re...,https://www.youtube.com/watch?v=pdeGAhJ5pbE,Reinforcement learning is an exciting set of t...,464,10.16 Continuous State Spaces |The state of re...


In [29]:
def get_embedding(text, model="text-embedding-ada-002"):
   text = text.replace("\n", " ")
   return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']

In [None]:
df['ada_embedding'] = df.text.apply(lambda x: get_embedding(x, model='text-embedding-ada-002'))

In [None]:
df.head()
df.to_csv('mls_ds.csv', index=False)

In [None]:
df = df.reset_index()

In [None]:
#from tqdm.auto import tqdm

#batch_size = 32

#for i in tqdm(range(0, len(df), batch_size)):
#    i_end = min(i+batch_size, len(df))
#    df_slice = df.iloc[i:i_end]
#    embeds = [row['ada_embedding'] for _, row in df_slice.iterrows()]
#    ids_batch = [str(n) for n in range(i, i_end)]
#    meta_data = [{
#        'title': row['TITLE'],
#        'url': row['URL'],
#        'text': row['text']
#    } for _, row in df_slice.iterrows()]
#    to_upsert = list(zip(ids_batch, embeds, meta_data))
#    index.upsert(vectors=to_upsert)

In [None]:
#query = "What thinks Francois Chollet about intelligence?"

#xq = openai.Embedding.create(input=query, engine=embed_model)['data'][0]['embedding']

In [None]:
#res = index.query([xq], top_k=5, include_metadata=True)
#res

In [None]:
#df_slice

In [None]:
#texts = [x["text"] for _, x in df_slice.iterrows()]

In [None]:
#texts

# Upload embedding to Hugging Face dataset
Use pinecone for production apps.

In [30]:
from datasets import load_dataset

mls_ds = load_dataset("csv", data_files="/content/mls_ds.csv")

FileNotFoundError: ignored

In [31]:
mls_ds["train"][0]

KeyError: ignored

In [None]:
from huggingface_hub import notebook_login
notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
mls_ds.push_to_hub("juancopi81/mls_ada_embeddings")



Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/4 [00:00<?, ?ba/s]

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]