<a href="https://colab.research.google.com/github/mille055/Rec_Project/blob/main/notebooks/Get_text_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentence Transformers to embed text columns
In this notebook we will embed the textual columns using document embeddings obtained using a pre-trained [Sentence Transformer](https://www.sbert.net) model.  SentenceTransformers is a framework for sentence / text embeddings which works particularly well for shorter text.  It was developed in 2019 and uses Siamese-BERT to develop semantically meaningful sentence embeddings which can be compared using cosine similarity.  You can use a [pretrained embedding model](https://www.sbert.net/docs/pretrained_models.html) or can train your own on a corpus.

**References:**
- Read the [Sentence-BERT paper](https://arxiv.org/abs/1908.10084) by Reimers & Gurevych

In [2]:
import os
import numpy as np
import pandas as pd
import string
import time
import urllib.request
import zipfile
import torch

from sklearn.linear_model import LogisticRegression
!pip install sentence_transformers
from sentence_transformers import SentenceTransformer

import warnings
warnings.filterwarnings('ignore')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 KB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.27.4-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m100.5 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m78.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0
  Downloading huggingface_hub-0.13.4-py3-none-any.whl (200 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m200.

## Download and prepare data

In [3]:
# Download the data
!git clone 'https://github.com/mille055/Rec_Project'

Cloning into 'Rec_Project'...
remote: Enumerating objects: 248, done.[K
remote: Counting objects: 100% (136/136), done.[K
remote: Compressing objects: 100% (119/119), done.[K
remote: Total 248 (delta 63), reused 66 (delta 16), pack-reused 112[K
Receiving objects: 100% (248/248), 57.35 MiB | 16.92 MiB/s, done.
Resolving deltas: 100% (120/120), done.


In [4]:
podcast_df = pd.read_pickle('/content/Rec_Project/data/podcast_df_040423.pkl')
podcast_df

Unnamed: 0,title,producer,genre,description,num_episodes,avg_rating,num_reviews,link,episode_descriptions,itunes_id,rating,user
0,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,"Paranormal, unexplainable, and uncanny stories...",105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,[In celebration of our new premium format—two ...,1526579247,5,RobinFerris
1,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,"Paranormal, unexplainable, and uncanny stories...",105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,[In celebration of our new premium format—two ...,1526579247,1,Pops.99
2,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,"Paranormal, unexplainable, and uncanny stories...",105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,[In celebration of our new premium format—two ...,1526579247,5,ReddEye81
3,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,"Paranormal, unexplainable, and uncanny stories...",105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,[In celebration of our new premium format—two ...,1526579247,2,Keyta7777
4,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,"Paranormal, unexplainable, and uncanny stories...",105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,[In celebration of our new premium format—two ...,1526579247,4,Okkupent
...,...,...,...,...,...,...,...,...,...,...,...,...
657194,Quality Queen Control,Asha Christina,Education,"Sophistication, Psychology, Dating, and Lifest...",111,4.8,470.0,https://podcasts.apple.com/us/podcast/quality-...,[Hey Angels!!! In today's episode of the Quali...,1512702672,5,Monijansand
657195,Quality Queen Control,Asha Christina,Education,"Sophistication, Psychology, Dating, and Lifest...",111,4.8,470.0,https://podcasts.apple.com/us/podcast/quality-...,[Hey Angels!!! In today's episode of the Quali...,1512702672,5,trinityangel13
657196,Quality Queen Control,Asha Christina,Education,"Sophistication, Psychology, Dating, and Lifest...",111,4.8,470.0,https://podcasts.apple.com/us/podcast/quality-...,[Hey Angels!!! In today's episode of the Quali...,1512702672,5,Kweenkeys
657197,Quality Queen Control,Asha Christina,Education,"Sophistication, Psychology, Dating, and Lifest...",111,4.8,470.0,https://podcasts.apple.com/us/podcast/quality-...,[Hey Angels!!! In today's episode of the Quali...,1512702672,5,JoyfulJoyfulWOG


In [8]:
## clean text from the episode_descriptions column
!pip install unidecode
import unidecode
import sys
sys.path.append('/content/Rec_Project/scripts')


import clean_dataframe_text
from clean_dataframe_text import join_and_clean_text




Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting unidecode
  Downloading Unidecode-1.3.6-py3-none-any.whl (235 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.9/235.9 KB[0m [31m21.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: unidecode
Successfully installed unidecode-1.3.6


In [9]:
podcast_cleaned_df = podcast_df.copy()
podcast_cleaned_df.episode_descriptions = podcast_cleaned_df.episode_descriptions.apply(join_and_clean_text)

In [19]:
podcast_cleaned_df.to_pickle('cleaned_df.pkl')

## Create document embeddings
We will load a pre-trained model [('all-MiniLM-L6-v2')](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) which we will then use to create embeddings for our training and test set text.  The MiniLM-L6-v2 model was trained on 1.1 billion sentence pairs to produce high-quality sentence / short document embeddings in 384 dimensions which can be used for example to calculate similarity between documents.  

In [11]:
# Load pre-trained model
senttrans_model = SentenceTransformer('all-MiniLM-L6-v2',device=device)



Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [46]:
# Create embeddings for columns description, episode descriptions, genre
def embeddings_for_columns(df, cols):
  df1 = df.copy()
  #get rid of duplicates based on itunes_id and the embedding columns first
  # Drop duplicate rows based on the list of columns
  cols_drop_dup = cols + ['itunes_id']
  print('columns to reduce: ', cols_drop_dup)
  df_no_dups = df1.drop_duplicates(subset=cols_drop_dup)
  print('shape of new df without duplicates is ', df_no_dups.shape)
  new_col_names = []
  for col in cols:
    print('Now embedding column', col)
    col_data = df_no_dups[col].values.tolist()
    col_embeds = [senttrans_model.encode(doc) for doc in col_data]
    new_col_name = col + '_embedding'
    df_no_dups[new_col_name] = col_embeds
    new_col_names.append(new_col_name)

  embeddings_df = df_no_dups[new_col_names]
  embeddings_df['itunes_id'] = df1['itunes_id']

  return df_no_dups, embeddings_df



In [47]:
podcast_with_embeds, embeddings_only = embeddings_for_columns(podcast_cleaned_df, cols= ['description', 'genre','episode_descriptions'])


columns to reduce:  ['description', 'genre', 'episode_descriptions', 'itunes_id']
shape of new df without duplicates is  (3936, 12)
Now embedding column description
Now embedding column genre
Now embedding column episode_descriptions


In [53]:
podcast_with_embeds.drop(columns=['user', 'rating'], inplace=True)

In [54]:
podcast_with_embeds

Unnamed: 0,title,producer,genre,description,num_episodes,avg_rating,num_reviews,link,episode_descriptions,itunes_id,description_embedding,genre_embedding,episode_descriptions_embedding
0,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,"Paranormal, unexplainable, and uncanny stories...",105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,In celebration of our new premium formattwo pr...,1526579247,"[-0.004768808, -0.021919668, 0.05849519, 0.046...","[-0.032607477, 0.09884295, -0.022565197, 0.045...","[-0.05365885, -0.013311425, -0.01877126, 0.016..."
100,BibleProject,BibleProject Podcast,Religion & Spirituality,The creators of BibleProject have in-depth con...,352,4.9,15000.0,https://podcasts.apple.com/us/podcast/biblepro...,David was Israels greatest king but even he fa...,1050832450,"[-0.026048623, -0.047629233, 0.00784666, -0.04...","[0.04873833, 0.06813718, -0.025173135, 0.03297...","[-0.0540651, 0.07898695, 0.057036802, 0.029304..."
200,The Domonique Foxworth Show,ESPN,Sports,With episodes every Tuesday and Thursday durin...,70,4.9,1100.0,https://podcasts.apple.com/us/podcast/the-domo...,Domonique Charlie and Ashley Foxworth along wi...,1642566714,"[-0.013152028, -0.07279262, -0.044646043, -0.0...","[0.0012439901, 0.07559641, -0.017228436, -0.02...","[-0.08152456, -0.11659737, 0.009398255, -0.108..."
300,Hacking Humans,CyberWire Inc.,Technology,"Deception, influence, and social engineering i...",415,4.7,255.0,https://podcasts.apple.com/us/podcast/hacking-...,Kathleen Smith CMO from ClearedJobsNet sits do...,1391915810,"[-0.027456347, 0.022695206, -0.024370523, -0.0...","[-0.053375818, 0.08707484, -0.026189232, -0.03...","[-0.12779826, -0.029090032, 0.0042918487, -0.0..."
400,Leader Up,AMSC,Government,"Leader Up, a podcast by the Army Management St...",52,5.0,14.0,https://podcasts.apple.com/us/podcast/leader-u...,MSCs Mr David Howey meets with CSM Jason C Por...,1378682853,"[-0.07933211, -0.0022071858, -0.0117073115, 0....","[-0.061360087, 0.04286633, 0.009105529, 0.0259...","[-0.086650096, -0.032117806, -0.069550075, -0...."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
656189,Tales from the Stinky Dragon,Rooster Teeth,Leisure,A D&amp;D podcast from Rooster Teeth! Our brav...,101,4.9,781.0,https://podcasts.apple.com/us/podcast/tales-fr...,With Asafee on his deathbed the four Chosen On...,1563814788,"[-0.058848858, -0.04257265, 0.0013850272, -0.0...","[0.0784131, 0.06623372, 0.044227276, 0.0479960...","[-0.06006845, -0.017107254, 0.0048531545, -0.0..."
656589,Morning Microdose,Almost 30,Education,The fact that you came across Morning Microdos...,159,5.0,187.0,https://podcasts.apple.com/us/podcast/morning-...,Drop in for this mindexpanding conversation wi...,1639123211,"[0.02449756, 0.006267864, 0.1133953, 0.0342372...","[0.030874394, 0.0999365, -0.020643013, 0.07698...","[-0.033951428, -0.072624005, -0.014208139, 0.0..."
656989,Presidential,Washington Post Audio,History,The Washington Post's Presidential podcast exp...,52,4.4,3500.0,https://podcasts.apple.com/us/podcast/presiden...,Students teachers and historians reflect on wh...,1072170823,"[-0.005817534, -0.020287603, 0.078154214, -0.0...","[-0.032607477, 0.09884295, -0.022565197, 0.045...","[-0.029322717, -0.027236685, -0.048946798, 0.0..."
657089,Badlands Cola | A Strange Audio Drama,Renee Taylor Klint,Fiction,Badlands Cola is a cinematic mystery/horror au...,17,4.6,63.0,https://podcasts.apple.com/us/podcast/badlands...,Hi listeners Its Renee and today were doing so...,1627191206,"[-0.05271564, -0.043252505, -0.007970767, 0.02...","[-0.04981938, 0.0027470058, 0.027653951, 0.036...","[-0.041466508, -0.071801305, 0.019250738, -0.0..."


In [55]:
podcast_with_embeds.to_pickle('podcast_base_with_embeds.pkl')

In [50]:
embeddings_only.to_pickle('podcast_embeddings_only.pkl')

Unnamed: 0,title,producer,genre,description,num_episodes,avg_rating,num_reviews,link,episode_descriptions,itunes_id,rating,user,description_embedding,genre_embedding,episode_descriptions_embedding
0,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,"Paranormal, unexplainable, and uncanny stories...",105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,In celebration of our new premium formattwo pr...,1526579247,5,RobinFerris,"[-0.004768808, -0.021919668, 0.05849519, 0.046...","[-0.032607477, 0.09884295, -0.022565197, 0.045...","[-0.05365885, -0.013311425, -0.01877126, 0.016..."
100,BibleProject,BibleProject Podcast,Religion & Spirituality,The creators of BibleProject have in-depth con...,352,4.9,15000.0,https://podcasts.apple.com/us/podcast/biblepro...,David was Israels greatest king but even he fa...,1050832450,5,nina52475,"[-0.026048623, -0.047629233, 0.00784666, -0.04...","[0.04873833, 0.06813718, -0.025173135, 0.03297...","[-0.0540651, 0.07898695, 0.057036802, 0.029304..."
200,The Domonique Foxworth Show,ESPN,Sports,With episodes every Tuesday and Thursday durin...,70,4.9,1100.0,https://podcasts.apple.com/us/podcast/the-domo...,Domonique Charlie and Ashley Foxworth along wi...,1642566714,5,nick ndd as mm name,"[-0.013152028, -0.07279262, -0.044646043, -0.0...","[0.0012439901, 0.07559641, -0.017228436, -0.02...","[-0.08152456, -0.11659737, 0.009398255, -0.108..."
300,Hacking Humans,CyberWire Inc.,Technology,"Deception, influence, and social engineering i...",415,4.7,255.0,https://podcasts.apple.com/us/podcast/hacking-...,Kathleen Smith CMO from ClearedJobsNet sits do...,1391915810,5,FreshDoughnuts,"[-0.027456347, 0.022695206, -0.024370523, -0.0...","[-0.053375818, 0.08707484, -0.026189232, -0.03...","[-0.12779826, -0.029090032, 0.0042918487, -0.0..."
400,Leader Up,AMSC,Government,"Leader Up, a podcast by the Army Management St...",52,5.0,14.0,https://podcasts.apple.com/us/podcast/leader-u...,MSCs Mr David Howey meets with CSM Jason C Por...,1378682853,5,Nolikeynewudatey,"[-0.07933211, -0.0022071858, -0.0117073115, 0....","[-0.061360087, 0.04286633, 0.009105529, 0.0259...","[-0.086650096, -0.032117806, -0.069550075, -0...."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
656189,Tales from the Stinky Dragon,Rooster Teeth,Leisure,A D&amp;D podcast from Rooster Teeth! Our brav...,101,4.9,781.0,https://podcasts.apple.com/us/podcast/tales-fr...,With Asafee on his deathbed the four Chosen On...,1563814788,5,moochwaller,"[-0.058848858, -0.04257265, 0.0013850272, -0.0...","[0.0784131, 0.06623372, 0.044227276, 0.0479960...","[-0.06006845, -0.017107254, 0.0048531545, -0.0..."
656589,Morning Microdose,Almost 30,Education,The fact that you came across Morning Microdos...,159,5.0,187.0,https://podcasts.apple.com/us/podcast/morning-...,Drop in for this mindexpanding conversation wi...,1639123211,5,Blue Sparrow Events,"[0.02449756, 0.006267864, 0.1133953, 0.0342372...","[0.030874394, 0.0999365, -0.020643013, 0.07698...","[-0.033951428, -0.072624005, -0.014208139, 0.0..."
656989,Presidential,Washington Post Audio,History,The Washington Post's Presidential podcast exp...,52,4.4,3500.0,https://podcasts.apple.com/us/podcast/presiden...,Students teachers and historians reflect on wh...,1072170823,5,ShipShore,"[-0.005817534, -0.020287603, 0.078154214, -0.0...","[-0.032607477, 0.09884295, -0.022565197, 0.045...","[-0.029322717, -0.027236685, -0.048946798, 0.0..."
657089,Badlands Cola | A Strange Audio Drama,Renee Taylor Klint,Fiction,Badlands Cola is a cinematic mystery/horror au...,17,4.6,63.0,https://podcasts.apple.com/us/podcast/badlands...,Hi listeners Its Renee and today were doing so...,1627191206,5,ck do ufk,"[-0.05271564, -0.043252505, -0.007970767, 0.02...","[-0.04981938, 0.0027470058, 0.027653951, 0.036...","[-0.041466508, -0.071801305, 0.019250738, -0.0..."


## Cosine similarity of the embeddings

Finally, we will used our embeddings as features to train a softmax regression model to classify the documents.

Accuracy on the training set is 0.900


## Evaluate model performance

Accuracy on the test set is 0.896
