<a href="https://colab.research.google.com/github/mille055/Rec_Project/blob/main/notebooks/Get_text_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentence Transformers to embed text columns
In this notebook we will embed the textual columns using document embeddings obtained using a pre-trained [Sentence Transformer](https://www.sbert.net) model.  SentenceTransformers is a framework for sentence / text embeddings which works particularly well for shorter text.  It was developed in 2019 and uses Siamese-BERT to develop semantically meaningful sentence embeddings which can be compared using cosine similarity.  You can use a [pretrained embedding model](https://www.sbert.net/docs/pretrained_models.html) or can train your own on a corpus.

**References:**
- [Sentence-BERT paper](https://arxiv.org/abs/1908.10084) by Reimers & Gurevych

In [1]:
import os
import numpy as np
import pandas as pd
import string
import time
import urllib.request
import zipfile
import torch

from sklearn.linear_model import LogisticRegression
!pip install sentence_transformers
from sentence_transformers import SentenceTransformer

import warnings
warnings.filterwarnings('ignore')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 KB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.27.4-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m56.5 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0
  Downloading huggingface_hub-0.13.4-py3-none-any.whl (200 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m200.

## Download and prepare data

In [2]:
# Download the data
!git clone 'https://github.com/mille055/Rec_Project'

Cloning into 'Rec_Project'...
remote: Enumerating objects: 263, done.[K
remote: Counting objects: 100% (17/17), done.[K
remote: Compressing objects: 100% (14/14), done.[K
remote: Total 263 (delta 8), reused 7 (delta 3), pack-reused 246[K
Receiving objects: 100% (263/263), 88.45 MiB | 16.17 MiB/s, done.
Resolving deltas: 100% (129/129), done.
Updating files: 100% (26/26), done.


In [3]:
podcast_df = pd.read_pickle('/content/Rec_Project/data/podcast_df_040423.pkl')
podcast_df

Unnamed: 0,title,producer,genre,description,num_episodes,avg_rating,num_reviews,link,episode_descriptions,itunes_id,rating,user
0,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,"Paranormal, unexplainable, and uncanny stories...",105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,[In celebration of our new premium format—two ...,1526579247,5,RobinFerris
1,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,"Paranormal, unexplainable, and uncanny stories...",105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,[In celebration of our new premium format—two ...,1526579247,1,Pops.99
2,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,"Paranormal, unexplainable, and uncanny stories...",105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,[In celebration of our new premium format—two ...,1526579247,5,ReddEye81
3,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,"Paranormal, unexplainable, and uncanny stories...",105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,[In celebration of our new premium format—two ...,1526579247,2,Keyta7777
4,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,"Paranormal, unexplainable, and uncanny stories...",105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,[In celebration of our new premium format—two ...,1526579247,4,Okkupent
...,...,...,...,...,...,...,...,...,...,...,...,...
657194,Quality Queen Control,Asha Christina,Education,"Sophistication, Psychology, Dating, and Lifest...",111,4.8,470.0,https://podcasts.apple.com/us/podcast/quality-...,[Hey Angels!!! In today's episode of the Quali...,1512702672,5,Monijansand
657195,Quality Queen Control,Asha Christina,Education,"Sophistication, Psychology, Dating, and Lifest...",111,4.8,470.0,https://podcasts.apple.com/us/podcast/quality-...,[Hey Angels!!! In today's episode of the Quali...,1512702672,5,trinityangel13
657196,Quality Queen Control,Asha Christina,Education,"Sophistication, Psychology, Dating, and Lifest...",111,4.8,470.0,https://podcasts.apple.com/us/podcast/quality-...,[Hey Angels!!! In today's episode of the Quali...,1512702672,5,Kweenkeys
657197,Quality Queen Control,Asha Christina,Education,"Sophistication, Psychology, Dating, and Lifest...",111,4.8,470.0,https://podcasts.apple.com/us/podcast/quality-...,[Hey Angels!!! In today's episode of the Quali...,1512702672,5,JoyfulJoyfulWOG


In [19]:
## clean text from the episode_descriptions column
!pip install unidecode
import unidecode
import sys
sys.path.append('/content/Rec_Project/scripts')


import clean_dataframe_text
from clean_dataframe_text import join_and_clean_text, clean_text




Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [33]:
### reducing the size of the dataframe prior to processing as do not need user ratings and userid for this

#get rid of duplicates based on itunes_id and the embedding columns first
  # Drop duplicate rows based on the list of columns
df1= podcast_df.copy()
cols_drop_dup = ['itunes_id']
df_no_dups = df1.drop_duplicates(subset=cols_drop_dup)
print('shape of new df without duplicates is ', df_no_dups.shape)
print('removing user  and rating columns')
df_no_dups.drop(columns=['user', 'rating'], inplace=True)



shape of new df without duplicates is  (3936, 12)
removing user  and rating columns


In [34]:
podcast_cleaned_df = df_no_dups.copy()
podcast_cleaned_df.episode_descriptions = podcast_cleaned_df.episode_descriptions.apply(join_and_clean_text)
podcast_cleaned_df.description = podcast_cleaned_df.description.apply(clean_text)


In [35]:
podcast_cleaned_df

Unnamed: 0,title,producer,genre,description,num_episodes,avg_rating,num_reviews,link,episode_descriptions,itunes_id
0,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,paranormal unexplainable and uncanny stories a...,105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,in celebration of our new premium formattwo pr...,1526579247
100,BibleProject,BibleProject Podcast,Religion & Spirituality,the creators of bibleproject have indepth conv...,352,4.9,15000.0,https://podcasts.apple.com/us/podcast/biblepro...,david was israels greatest king but even he fa...,1050832450
200,The Domonique Foxworth Show,ESPN,Sports,with episodes every tuesday and thursday durin...,70,4.9,1100.0,https://podcasts.apple.com/us/podcast/the-domo...,domonique charlie and ashley foxworth along wi...,1642566714
300,Hacking Humans,CyberWire Inc.,Technology,deception influence and social engineering in ...,415,4.7,255.0,https://podcasts.apple.com/us/podcast/hacking-...,kathleen smith cmo from clearedjobsnet sits do...,1391915810
400,Leader Up,AMSC,Government,leader up a podcast by the army management sta...,52,5.0,14.0,https://podcasts.apple.com/us/podcast/leader-u...,mscs mr david howey meets with csm jason c por...,1378682853
...,...,...,...,...,...,...,...,...,...,...
656189,Tales from the Stinky Dragon,Rooster Teeth,Leisure,a dampd podcast from rooster teeth our brave a...,101,4.9,781.0,https://podcasts.apple.com/us/podcast/tales-fr...,with asafee on his deathbed the four chosen on...,1563814788
656589,Morning Microdose,Almost 30,Education,the fact that you came across morning microdos...,159,5.0,187.0,https://podcasts.apple.com/us/podcast/morning-...,drop in for this mindexpanding conversation wi...,1639123211
656989,Presidential,Washington Post Audio,History,the washington posts presidential podcast expl...,52,4.4,3500.0,https://podcasts.apple.com/us/podcast/presiden...,students teachers and historians reflect on wh...,1072170823
657089,Badlands Cola | A Strange Audio Drama,Renee Taylor Klint,Fiction,badlands cola is a cinematic mysteryhorror aud...,17,4.6,63.0,https://podcasts.apple.com/us/podcast/badlands...,hi listeners its renee and today were doing so...,1627191206


In [36]:
podcast_cleaned_df.to_pickle('cleaned_df.pkl')

## Create document embeddings
We will load a pre-trained model [('all-MiniLM-L6-v2')](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) which we will then use to create embeddings for our training and test set text.  The MiniLM-L6-v2 model was trained on 1.1 billion sentence pairs to produce high-quality sentence / short document embeddings in 384 dimensions which can be used for example to calculate similarity between documents.  

In [37]:
# Load pre-trained model
senttrans_model = SentenceTransformer('all-MiniLM-L6-v2',device=device)



In [38]:
# Create embeddings for columns description, episode descriptions, genre
def embeddings_for_columns(df, cols):
  df1 = df.copy()
  
  new_col_names = []
  for col in cols:
    print('Now embedding column', col)
    col_data = df_no_dups[col].values.tolist()
    col_embeds = [senttrans_model.encode(doc) for doc in col_data]
    new_col_name = col + '_embedding'
    df_no_dups[new_col_name] = col_embeds
    new_col_names.append(new_col_name)

  embeddings_df = df_no_dups[new_col_names]
  embeddings_df['itunes_id'] = df1['itunes_id']

  return df_no_dups, embeddings_df



In [None]:
podcast_with_embeds, embeddings_only = embeddings_for_columns(podcast_cleaned_df, cols= ['description', 'genre','episode_descriptions'])


Now embedding column description
Now embedding column genre
Now embedding column episode_descriptions


In [18]:
podcast_with_embeds.episode_descriptions.tolist()[:3]

['in celebration of our new premium formattwo premium episodes a month for both apple podcast premium subscribers and patreon subscribers starting april 2023we are sharing this premium episode with all listeners  the legend of kentuckys pope lick monster is inexorably tied to the train bridge it supposedly guardsbut what is the real danger the goatsheep cryptid or our human need to seek out danger   hosted by laurah norton  written by liv fallon  researched by jessica lee   produced and scriptedited by maura currie  engineered and scored by chaes gray   preorder laurahs book lay them to rest   butcherbox get free chicken nuggets for a year and 10 percent off your first box when you sign up today thats a 22 oz bag of glutenfree chicken nuggets in every order for a year when you sign up at butcherboxcomost and use code ost  join us on patreon for early release and adfree episodes exclusive stories and bonus episodes  find us on twitter  instagram   and facebook  interested in advertising

In [55]:
podcast_with_embeds.to_pickle('podcast_base_with_embeds.pkl')

In [50]:
embeddings_only.to_pickle('podcast_embeddings_only.pkl')

In [62]:
embeddings_only.columns

Index(['description_embedding', 'genre_embedding',
       'episode_descriptions_embedding', 'itunes_id'],
      dtype='object')

## Cosine similarity of the embeddings

Finally, we will used our embeddings as features to train a softmax regression model to classify the documents.

In [59]:
from sklearn.metrics.pairwise import cosine_similarity



In [63]:
combined_features = np.hstack(embeddings_only[['description_embedding', 'genre_embedding',
       'episode_descriptions_embedding']].values)

In [71]:
row = embeddings_only.iloc[0]
conc = np.concatenate(row['description_embedding'], row['genre_embedding'])
conc

TypeError: ignored

In [66]:
combined_features_all = np.array([np.concatenate(row['description_embedding', 'genre_embedding',
       'episode_descriptions_embedding']) for row in embeddings_only.values])

IndexError: ignored

In [65]:
combined_features.shape

(11808,)

## Evaluate model performance

Accuracy on the test set is 0.896
