<a href="https://colab.research.google.com/github/mille055/Rec_Project/blob/main/notebooks/Get_text_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentence Transformers to embed text columns
In this notebook we will embed the textual columns using document embeddings obtained using a pre-trained [Sentence Transformer](https://www.sbert.net) model.  SentenceTransformers is a framework for sentence / text embeddings which works particularly well for shorter text.  It was developed in 2019 and uses Siamese-BERT to develop semantically meaningful sentence embeddings which can be compared using cosine similarity.  You can use a [pretrained embedding model](https://www.sbert.net/docs/pretrained_models.html) or can train your own on a corpus.

**References:**
- [Sentence-BERT paper](https://arxiv.org/abs/1908.10084) by Reimers & Gurevych

In [1]:
import os
import numpy as np
import pandas as pd
import string
import time
import urllib.request
import zipfile
import torch

from sklearn.linear_model import LogisticRegression
!pip install sentence_transformers
from sentence_transformers import SentenceTransformer
!pip install unidecode
import unidecode
import sys
import warnings
warnings.filterwarnings('ignore')

from sklearn.metrics.pairwise import cosine_similarity


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 KB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.27.4-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m37.0 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m50.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0
  Downloading huggingface_hub-0.13.4-py3-none-any.whl (200 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m200.1

## Download and prepare data

In [2]:
# Download the data
!git clone 'https://github.com/mille055/Rec_Project'

Cloning into 'Rec_Project'...
remote: Enumerating objects: 287, done.[K
remote: Counting objects: 100% (41/41), done.[K
remote: Compressing objects: 100% (32/32), done.[K
remote: Total 287 (delta 21), reused 24 (delta 9), pack-reused 246[K
Receiving objects: 100% (287/287), 94.08 MiB | 19.98 MiB/s, done.
Resolving deltas: 100% (142/142), done.
Updating files: 100% (24/24), done.


In [3]:
podcast_df = pd.read_pickle('/content/Rec_Project/data/podcast_df_040423.pkl')
podcast_df

Unnamed: 0,title,producer,genre,description,num_episodes,avg_rating,num_reviews,link,episode_descriptions,itunes_id,rating,user
0,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,"Paranormal, unexplainable, and uncanny stories...",105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,[In celebration of our new premium format—two ...,1526579247,5,RobinFerris
1,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,"Paranormal, unexplainable, and uncanny stories...",105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,[In celebration of our new premium format—two ...,1526579247,1,Pops.99
2,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,"Paranormal, unexplainable, and uncanny stories...",105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,[In celebration of our new premium format—two ...,1526579247,5,ReddEye81
3,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,"Paranormal, unexplainable, and uncanny stories...",105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,[In celebration of our new premium format—two ...,1526579247,2,Keyta7777
4,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,"Paranormal, unexplainable, and uncanny stories...",105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,[In celebration of our new premium format—two ...,1526579247,4,Okkupent
...,...,...,...,...,...,...,...,...,...,...,...,...
657194,Quality Queen Control,Asha Christina,Education,"Sophistication, Psychology, Dating, and Lifest...",111,4.8,470.0,https://podcasts.apple.com/us/podcast/quality-...,[Hey Angels!!! In today's episode of the Quali...,1512702672,5,Monijansand
657195,Quality Queen Control,Asha Christina,Education,"Sophistication, Psychology, Dating, and Lifest...",111,4.8,470.0,https://podcasts.apple.com/us/podcast/quality-...,[Hey Angels!!! In today's episode of the Quali...,1512702672,5,trinityangel13
657196,Quality Queen Control,Asha Christina,Education,"Sophistication, Psychology, Dating, and Lifest...",111,4.8,470.0,https://podcasts.apple.com/us/podcast/quality-...,[Hey Angels!!! In today's episode of the Quali...,1512702672,5,Kweenkeys
657197,Quality Queen Control,Asha Christina,Education,"Sophistication, Psychology, Dating, and Lifest...",111,4.8,470.0,https://podcasts.apple.com/us/podcast/quality-...,[Hey Angels!!! In today's episode of the Quali...,1512702672,5,JoyfulJoyfulWOG


In [4]:
## clean text from the episode_descriptions column

sys.path.append('/content/Rec_Project/scripts')


import clean_dataframe_text
from clean_dataframe_text import join_and_clean_text, clean_text




In [5]:
### reducing the size of the dataframe prior to processing as do not need user ratings and userid for this
def prepare_df(df):
 
  
  df1= podcast_df.copy()
  # get rid of duplicates based on itunes_id
  cols_drop_dup = ['itunes_id']
  df_no_dups = df1.drop_duplicates(subset=cols_drop_dup)
  print('shape of new df without duplicates is ', df_no_dups.shape)
  # remove columns containing the user and user rating
  print('removing user  and rating columns')
  df_no_dups.drop(columns=['user', 'rating'], inplace=True)

  return df_no_dups



In [6]:
podcast_cleaned_df = prepare_df(podcast_df)

# clean the podcast dataframe
podcast_cleaned_df.episode_descriptions = podcast_cleaned_df.episode_descriptions.apply(join_and_clean_text)
podcast_cleaned_df.description = podcast_cleaned_df.description.apply(clean_text)


shape of new df without duplicates is  (3936, 12)
removing user  and rating columns


In [7]:
podcast_cleaned_df

Unnamed: 0,title,producer,genre,description,num_episodes,avg_rating,num_reviews,link,episode_descriptions,itunes_id
0,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,"paranormal, unexplainable, and uncanny stories...",105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,in celebration of our new premium format two ...,1526579247
100,BibleProject,BibleProject Podcast,Religion & Spirituality,the creators of bibleproject have in depth con...,352,4.9,15000.0,https://podcasts.apple.com/us/podcast/biblepro...,"david was israel's greatest king, but even he ...",1050832450
200,The Domonique Foxworth Show,ESPN,Sports,with episodes every tuesday and thursday durin...,70,4.9,1100.0,https://podcasts.apple.com/us/podcast/the-domo...,"domonique, charlie, and ashley foxworth along ...",1642566714
300,Hacking Humans,CyberWire Inc.,Technology,"deception, influence, and social engineering i...",415,4.7,255.0,https://podcasts.apple.com/us/podcast/hacking-...,"kathleen smith, cmo from clearedjobs.net sits ...",1391915810
400,Leader Up,AMSC,Government,"leader up, a podcast by the army management st...",52,5.0,14.0,https://podcasts.apple.com/us/podcast/leader-u...,msc's mr. david howey meets with csm jason c. ...,1378682853
...,...,...,...,...,...,...,...,...,...,...
656189,Tales from the Stinky Dragon,Rooster Teeth,Leisure,a d amp;d podcast from rooster teeth! our brav...,101,4.9,781.0,https://podcasts.apple.com/us/podcast/tales-fr...,"with asafee on his deathbed, the four chosen o...",1563814788
656589,Morning Microdose,Almost 30,Education,the fact that you came across morning microdos...,159,5.0,187.0,https://podcasts.apple.com/us/podcast/morning-...,drop in for this mind expanding conversation w...,1639123211
656989,Presidential,Washington Post Audio,History,the washington post's presidential podcast exp...,52,4.4,3500.0,https://podcasts.apple.com/us/podcast/presiden...,"students, teachers and historians reflect on w...",1072170823
657089,Badlands Cola | A Strange Audio Drama,Renee Taylor Klint,Fiction,badlands cola is a cinematic mystery horror au...,17,4.6,63.0,https://podcasts.apple.com/us/podcast/badlands...,"hi listeners! it's renee, and today we're doin...",1627191206


In [8]:
podcast_cleaned_df.description.tolist()[0]

"paranormal, unexplainable, and uncanny stories aren't just in the fiction section. they happen every day, to people just like you. one strange thing brings you family friendly stories from america's newspaper archives. and they all have something in common: an element that can't be explained by logi..."

In [9]:
podcast_cleaned_df.to_pickle('cleaned_df.pkl')

## Create document embeddings
We will load a pre-trained model [('all-MiniLM-L6-v2')](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) which we will then use to create embeddings for our training and test set text.  The MiniLM-L6-v2 model was trained on 1.1 billion sentence pairs to produce high-quality sentence / short document embeddings in 384 dimensions which can be used for example to calculate similarity between documents.  

In [10]:
# Load pre-trained model
senttrans_model = SentenceTransformer('all-MiniLM-L6-v2',device=device)



Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [11]:
# Create embeddings for columns description, episode descriptions, genre
def embeddings_for_columns(df, cols):
  df1 = df.copy()
  
  new_col_names = []
  for col in cols:
    print('Now embedding column', col)
    col_data = df1[col].values.tolist()
    col_embeds = [senttrans_model.encode(doc) for doc in col_data]
    new_col_name = col + '_embedding'
    df1[new_col_name] = col_embeds
    new_col_names.append(new_col_name)

  embeddings_df = df1[new_col_names]
  embeddings_df['itunes_id'] = df1['itunes_id']

  return df1, embeddings_df



In [12]:
podcast_with_embeds, embeddings_only = embeddings_for_columns(podcast_cleaned_df, cols= ['description', 'genre','episode_descriptions'])


Now embedding column description
Now embedding column genre
Now embedding column episode_descriptions


In [13]:
podcast_with_embeds.episode_descriptions.tolist()[:3]

["in celebration of our new premium format  two premium episodes a month for both apple podcast premium subscribers and patreon subscribers starting april 2023  we are sharing this premium episode with all listeners.  the legend of kentucky's pope lick monster is inexorably tied to the train bridge it supposedly guards  but what is the real danger: the goat sheep cryptid, or our human need to seek out danger?   hosted by laurah norton  written by liv fallon  researched by jessica lee   produced and script edited by maura currie  engineered and scored by chaes gray   pre order laurah's book, lay them to rest:    butcherbox: get free chicken nuggets for a year and 10 percent off your first box when you sign up today. that's a 22 oz bag of gluten free chicken nuggets in every order for a year when you sign up at butcherbox.com ost and use code ost.  join us on patreon for early release and ad free episodes, exclusive stories, and bonus episodes:   find us on twitter:   instagram:    and f

In [14]:
podcast_with_embeds.to_pickle('podcast_base_with_embeds.pkl')

In [15]:
embeddings_only.to_pickle('podcast_embeddings_only.pkl')

In [16]:
embeddings_only.columns

Index(['description_embedding', 'genre_embedding',
       'episode_descriptions_embedding', 'itunes_id'],
      dtype='object')

## Cosine similarity of the embeddings

Finally, we will used our embeddings as features to train a softmax regression model to classify the documents.

In [46]:
embeddings_only.columns

Index(['description_embedding', 'genre_embedding',
       'episode_descriptions_embedding', 'itunes_id'],
      dtype='object')

In [63]:
# combined_features = np.hstack(embeddings_only[['description_embedding', 'genre_embedding',
#        'episode_descriptions_embedding']].values)

In [57]:
def create_cosine_similarity(df, feats = ['genre_embedding', 'description_embedding', 'episode_descriptions_embedding']):
  data = df[feats].values
  print(data.shape)
  # combined_features = np.array([np.concatenate(data, axis=1)])
  # print(combined_features.shape)
  print(data[0,:])

  # Concatenate each row to create a list of vectors
  concatenated_rows = data.reshape(-1)

  # Convert the list of concatenated vectors into a single NumPy array
  concatenated_array = np.array(concatenated_rows)

  print(concatenated_array.shape)


  similarity_matrix = cosine_similarity(concatenated_array)

  return similarity_matrix




In [58]:
maxtrix = create_cosine_similarity(embeddings_only)
matrix

(3936, 3)
[array([-3.26074772e-02,  9.88429487e-02, -2.25651972e-02,  4.51614000e-02,
        -3.72004025e-02,  5.25211096e-02,  3.33803110e-02, -2.78010368e-02,
        -5.08293733e-02,  2.28388347e-02,  1.76503044e-02,  1.51502751e-02,
         2.80257780e-02, -1.53210219e-02, -1.33113623e-01, -1.51657416e-02,
        -6.97596744e-02, -6.23268262e-02, -5.83423674e-02, -9.59453881e-02,
        -4.04164828e-02,  1.35002481e-02,  3.46580893e-02, -1.96127477e-03,
         1.81353949e-02,  1.12053394e-01,  2.13429518e-02, -1.75367878e-03,
         3.27380337e-02, -8.42143968e-02, -6.17187396e-02, -2.50946619e-02,
         5.99764995e-02, -2.84834728e-02, -8.21869075e-02, -7.65414722e-03,
         1.80905126e-02,  1.94747373e-02,  4.81259786e-02, -2.00907458e-02,
        -2.40929071e-02,  2.14603357e-02,  2.55081281e-02, -4.47611101e-02,
        -1.70028210e-02,  8.40701684e-02,  7.53422529e-02, -3.65947522e-02,
        -3.01661361e-02,  9.43538249e-02,  2.87992116e-02,  4.53202352e-02,
  

ValueError: ignored

In [31]:
sample = embeddings_only.iloc[0]
sample

description_embedding             [0.0011661326, -0.02386104, 0.05824091, 0.0531...
genre_embedding                   [-0.032607477, 0.09884295, -0.022565197, 0.045...
episode_descriptions_embedding    [-0.06146148, -0.019384934, -0.014735525, 0.02...
itunes_id                                                                1526579247
Name: 0, dtype: object

In [28]:
feats = ['genre_embeddings', 'description_embeddings', 'episode_descriptions_embeddings']

In [29]:
sample[feats]

KeyError: ignored

In [21]:
matrix = create_cosine_similarity(embeddings_only)

IndexError: ignored

In [65]:
combined_features.shape

(11808,)

## Evaluate model performance

Accuracy on the test set is 0.896
