<a href="https://colab.research.google.com/github/mille055/Rec_Project/blob/main/notebooks/Get_text_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentence Transformers to embed text columns
In this notebook we will embed the textual columns using document embeddings obtained using a pre-trained [Sentence Transformer](https://www.sbert.net) model.  SentenceTransformers is a framework for sentence / text embeddings which works particularly well for shorter text.  It was developed in 2019 and uses Siamese-BERT to develop semantically meaningful sentence embeddings which can be compared using cosine similarity.  You can use a [pretrained embedding model](https://www.sbert.net/docs/pretrained_models.html) or can train your own on a corpus.

**References:**
- [Sentence-BERT paper](https://arxiv.org/abs/1908.10084) by Reimers & Gurevych

In [1]:
import os
import numpy as np
import pandas as pd
import string
import time
import urllib.request
import zipfile
import torch

from sklearn.linear_model import LogisticRegression
!pip install sentence_transformers
from sentence_transformers import SentenceTransformer
!pip install unidecode
import unidecode
import sys
import warnings
warnings.filterwarnings('ignore')

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 KB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.27.4-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m66.6 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m38.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0
  Downloading huggingface_hub-0.13.4-py3-none-any.whl (200 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m200.1

## Download and prepare data

In [2]:
# Clone the repository
!git clone 'https://github.com/mille055/Rec_Project'

Cloning into 'Rec_Project'...
remote: Enumerating objects: 313, done.[K
remote: Counting objects: 100% (67/67), done.[K
remote: Compressing objects: 100% (57/57), done.[K
remote: Total 313 (delta 39), reused 26 (delta 10), pack-reused 246[K
Receiving objects: 100% (313/313), 122.19 MiB | 12.34 MiB/s, done.
Resolving deltas: 100% (160/160), done.
Updating files: 100% (27/27), done.


In [6]:
# Unpickle the dataset
podcast_df = pd.read_pickle('/content/Rec_Project/data/podcast_df_040423.pkl')
podcast_df = podcast_df.reset_index(drop=True)
podcast_df

Unnamed: 0,title,producer,genre,description,num_episodes,avg_rating,num_reviews,link,episode_descriptions,itunes_id,rating,user
0,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,"Paranormal, unexplainable, and uncanny stories...",105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,[In celebration of our new premium format—two ...,1526579247,5,RobinFerris
1,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,"Paranormal, unexplainable, and uncanny stories...",105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,[In celebration of our new premium format—two ...,1526579247,1,Pops.99
2,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,"Paranormal, unexplainable, and uncanny stories...",105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,[In celebration of our new premium format—two ...,1526579247,5,ReddEye81
3,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,"Paranormal, unexplainable, and uncanny stories...",105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,[In celebration of our new premium format—two ...,1526579247,2,Keyta7777
4,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,"Paranormal, unexplainable, and uncanny stories...",105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,[In celebration of our new premium format—two ...,1526579247,4,Okkupent
...,...,...,...,...,...,...,...,...,...,...,...,...
46706,Quality Queen Control,Asha Christina,Education,"Sophistication, Psychology, Dating, and Lifest...",111,4.8,470.0,https://podcasts.apple.com/us/podcast/quality-...,[Hey Angels!!! In today's episode of the Quali...,1512702672,5,Monijansand
46707,Quality Queen Control,Asha Christina,Education,"Sophistication, Psychology, Dating, and Lifest...",111,4.8,470.0,https://podcasts.apple.com/us/podcast/quality-...,[Hey Angels!!! In today's episode of the Quali...,1512702672,5,trinityangel13
46708,Quality Queen Control,Asha Christina,Education,"Sophistication, Psychology, Dating, and Lifest...",111,4.8,470.0,https://podcasts.apple.com/us/podcast/quality-...,[Hey Angels!!! In today's episode of the Quali...,1512702672,5,Kweenkeys
46709,Quality Queen Control,Asha Christina,Education,"Sophistication, Psychology, Dating, and Lifest...",111,4.8,470.0,https://podcasts.apple.com/us/podcast/quality-...,[Hey Angels!!! In today's episode of the Quali...,1512702672,5,JoyfulJoyfulWOG


In [7]:
## clean text from the episode_descriptions column

sys.path.append('/content/Rec_Project/scripts')


import clean_dataframe_text
from clean_dataframe_text import join_and_clean_text, clean_text




In [8]:
### reducing the size of the dataframe prior to processing as do not need user ratings and userid for this
def prepare_df(df):
 
  
  df1= podcast_df.copy()
  # get rid of duplicates based on itunes_id
  cols_drop_dup = ['itunes_id']
  df_no_dups = df1.drop_duplicates(subset=cols_drop_dup)
  print('shape of new df without duplicates is ', df_no_dups.shape)
  # remove columns containing the user and user rating
  print('removing user  and rating columns')
  df_no_dups.drop(columns=['user', 'rating'], inplace=True)

  return df_no_dups



In [9]:
podcast_cleaned_df = prepare_df(podcast_df)

# clean the podcast dataframe
podcast_cleaned_df.episode_descriptions = podcast_cleaned_df.episode_descriptions.apply(join_and_clean_text)
podcast_cleaned_df.description = podcast_cleaned_df.description.apply(clean_text)


shape of new df without duplicates is  (3936, 12)
removing user  and rating columns


In [10]:
podcast_cleaned_df

Unnamed: 0,title,producer,genre,description,num_episodes,avg_rating,num_reviews,link,episode_descriptions,itunes_id
0,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,"paranormal, unexplainable, and uncanny stories...",105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,in celebration of our new premium format two ...,1526579247
10,BibleProject,BibleProject Podcast,Religion & Spirituality,the creators of bibleproject have in depth con...,352,4.9,15000.0,https://podcasts.apple.com/us/podcast/biblepro...,"david was israel's greatest king, but even he ...",1050832450
20,The Domonique Foxworth Show,ESPN,Sports,with episodes every tuesday and thursday durin...,70,4.9,1100.0,https://podcasts.apple.com/us/podcast/the-domo...,"domonique, charlie, and ashley foxworth along ...",1642566714
30,Hacking Humans,CyberWire Inc.,Technology,"deception, influence, and social engineering i...",415,4.7,255.0,https://podcasts.apple.com/us/podcast/hacking-...,"kathleen smith, cmo from clearedjobs.net sits ...",1391915810
40,Leader Up,AMSC,Government,"leader up, a podcast by the army management st...",52,5.0,14.0,https://podcasts.apple.com/us/podcast/leader-u...,msc's mr. david howey meets with csm jason c. ...,1378682853
...,...,...,...,...,...,...,...,...,...,...
46641,Tales from the Stinky Dragon,Rooster Teeth,Leisure,a d amp;d podcast from rooster teeth! our brav...,101,4.9,781.0,https://podcasts.apple.com/us/podcast/tales-fr...,"with asafee on his deathbed, the four chosen o...",1563814788
46661,Morning Microdose,Almost 30,Education,the fact that you came across morning microdos...,159,5.0,187.0,https://podcasts.apple.com/us/podcast/morning-...,drop in for this mind expanding conversation w...,1639123211
46681,Presidential,Washington Post Audio,History,the washington post's presidential podcast exp...,52,4.4,3500.0,https://podcasts.apple.com/us/podcast/presiden...,"students, teachers and historians reflect on w...",1072170823
46691,Badlands Cola | A Strange Audio Drama,Renee Taylor Klint,Fiction,badlands cola is a cinematic mystery horror au...,17,4.6,63.0,https://podcasts.apple.com/us/podcast/badlands...,"hi listeners! it's renee, and today we're doin...",1627191206


In [11]:
podcast_cleaned_df.description.tolist()[0]

"paranormal, unexplainable, and uncanny stories aren't just in the fiction section. they happen every day, to people just like you. one strange thing brings you family friendly stories from america's newspaper archives. and they all have something in common: an element that can't be explained by logi..."

In [12]:
podcast_cleaned_df.to_pickle('cleaned_df.pkl')

## Create document embeddings
We will load a pre-trained model [('all-MiniLM-L6-v2')](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) which we will then use to create embeddings for our training and test set text.  The MiniLM-L6-v2 model was trained on 1.1 billion sentence pairs to produce high-quality sentence / short document embeddings in 384 dimensions which can be used for example to calculate similarity between documents.  

In [13]:
# Load pre-trained model
senttrans_model = SentenceTransformer('all-MiniLM-L6-v2',device=device)



Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [14]:
# Create embeddings for columns description, episode descriptions, genre
def create_embeddings(df, cols):
  df1 = df.copy()
  
  new_col_names = []
  for col in cols:
    print('Now embedding column', col)
    col_data = df1[col].values.tolist()
    col_embeds = [senttrans_model.encode(doc) for doc in col_data]
    new_col_name = col + '_embedding'
    df1[new_col_name] = col_embeds
    new_col_names.append(new_col_name)

  embeddings_df = df1[new_col_names]
  embeddings_df['itunes_id'] = df1['itunes_id']

  return df1, embeddings_df



In [15]:
podcast_with_embeds, embeddings_only = create_embeddings(podcast_cleaned_df, cols= ['description', 'genre','episode_descriptions'])


Now embedding column description
Now embedding column genre
Now embedding column episode_descriptions


In [16]:
podcast_with_embeds.to_pickle('podcast_base_with_embeds.pkl')

In [17]:
embeddings_only.to_pickle('podcast_embeddings_only.pkl')

In [3]:
embeddings_only.columns

NameError: ignored

## Cosine similarity of the embeddings

Finally, we will used our embeddings as features to train a softmax regression model to classify the documents.

In [1]:
def create_cosine_similarity(df, feats = ['genre_embedding', 'description_embedding', 'episode_descriptions_embedding']):
  array_list = []
  for feat in feats:
    array_list.append(np.stack(df[feat].values))
  concat_array = np.concatenate((array_list), axis=1)
  print('after concatenate, data size is ', concat_array.shape)
  matrix = cosine_similarity(concat_array)
  
  return matrix



In [2]:
## Calculate cosine similarity matrices for different combinations of feaatures
cs_all = create_cosine_similarity(embeddings_only) # all three (genre, description, episode_descriptions)
cs_genre = create_cosine_similarity(embeddings_only, feats=['genre_embedding'])
cs_desc = create_cosine_similarity(embeddings_only, feats=['description_embedding'])
cs_episo = create_cosine_similarity(embeddings_only, feats=['episode_descriptions_embedding'])
cs_gen_desc = create_cosine_similarity(embeddings_only, feats=['genre_embedding', 'description_embedding'])


NameError: ignored

## Generate predicted ratings
Ready to generated predicted ratings for each user-item pair.  The process we will use to generate each predicted rating is as follows:  
- Filter the similarity matrix to only the movies previously watched by the user  
- Find the previously watched movie that is most similar to the movie for which we want to generate the predicted rating (nearest neighbor approach)
- Get the user's rating for the most similar previously watched movie and use that as our prediction

In [None]:
# Split our data into training and validation sets
from sklearn.model_selection import train_test_split
X = podcast_df[['user', 'itunes_id']]
y = podcast_df['rating']
X_train, X_val, y_train, y_val = train_test_split(X,y,random_state=0, test_size=0.2)

In [None]:
# First, we'll use the cosine similarity of the genre and description features
sim_matrix = pd.DataFrame(cs_gen_desc, columns=podcast_df.itunes_id,index=podcast_df.itunes_id)
sim_matrix.head()


ValueError: ignored

In [None]:
podcast_df.user.value_counts()
podcast_df[podcast_df.user=='obacker19']

Unnamed: 0,title,producer,genre,description,num_episodes,avg_rating,num_reviews,link,episode_descriptions,itunes_id,rating,user
27785,The Divorce Survival Guide Podcast,"Kate Anthony, CPCC",Education,On the Divorce Survival Guide Podcast we have ...,100,4.4,398.0,https://podcasts.apple.com/us/podcast/the-divo...,[Rebecca Zung returns to the show to share her...,1345075933,5,obacker19
81294,Simple Stories in Spanish,Small Town Spanish Teacher,Education,Simple Stories in Spanish is a weekly producti...,130,4.4,566.0,https://podcasts.apple.com/us/podcast/simple-s...,[This new season of fresh stories all about pe...,1497441201,5,obacker19
105654,The Science of Success,Matt Bodnar,Education,The #1 Evidence Based Growth Podcast on the In...,340,4.7,981.0,https://podcasts.apple.com/us/podcast/the-scie...,[In this episode we discuss how our guest help...,1059509178,5,obacker19
118104,Invest Like the Best with Patrick O'Shaughnessy,Colossus,Business,Conversations with the best investors and busi...,399,4.7,2100.0,https://podcasts.apple.com/us/podcast/invest-l...,"[Hello everyone. A few days ago, we discussed ...",1154105909,5,obacker19
121861,School of Self-Image,Tonya Leigh,Education,School of Self-Image is the go-to podcast for ...,345,4.8,938.0,https://podcasts.apple.com/us/podcast/school-o...,"[""Fostering a Positive Public Image Through De...",1071406906,5,obacker19
127313,Angela Watson's Truth for Teachers,Angela Watson,Education,"Truth for Teachers is designed to speak life, ...",291,4.8,1100.0,https://podcasts.apple.com/us/podcast/angela-w...,"[As a child, I didn’t think I was a “math and ...",954139712,5,obacker19
130445,Risky Business,Patrick Gray,Technology,Risky Business is a weekly information securit...,20,4.7,327.0,https://podcasts.apple.com/us/podcast/risky-bu...,[NOTE: Patrick’s audio is a bit degraded in a ...,216478078,5,obacker19
159150,All It Takes Is A Goal,Jon Acuff,Education,The future belongs to finishers. Join New York...,120,4.9,1200.0,https://podcasts.apple.com/us/podcast/all-it-t...,[Have you ever wondered what it takes to achie...,1547078080,5,obacker19
207922,The Wellness Cafe,Trinity Tondeleir,Education,Welcome to The Wellness Cafe Podcast. Your go ...,96,4.5,755.0,https://podcasts.apple.com/us/podcast/the-well...,[Listen to this episode if you get stressed at...,1571285630,5,obacker19
225275,Let’s Get Vulnerable: Relationship and Dating ...,Dr. Morgan Anderson,Education,Are you ready to take the mystery out of havin...,300,4.7,798.0,https://podcasts.apple.com/us/podcast/lets-get...,[Drumroll please….\n \n Inside of this specia...,1496034764,5,obacker19


In [None]:
X_val

Unnamed: 0,user,itunes_id
649285,reddbul,1539709262
630516,Pindsil,1651669475
248172,aneatsweetir,1671664183
634109,Dream1066,1091448948
80995,Romantic Tulip,1618327687
...,...,...
185959,Cpassb,1332572879
521455,jessmbg,1140333666
579452,newmanlauren,1510056899
116448,Almost Somebody,1021340531


In [None]:
def predict_rating(user_item_pair,simtable = sim_matrix,X_train=X_train, y_train=y_train):
    podcast_to_rate = user_item_pair['itunes_id']
    print(podcast_to_rate)
    user_to_assess = user_item_pair['user']
    print(user_to_assess)
    # Filter similarity matrix to only podcasts already reviewed by user
    prior_podcasts = X_train.loc[X_train['user']==user_to_assess, 'itunes_id'].tolist()
    print(prior_podcasts)
    if not prior_podcasts:
      return None
    print(simtable.loc['1651669475','1671873182' ])
    simtable_filtered = simtable.loc[podcast_to_rate, prior_podcasts]
    print(simtable_filtered)
    
    # Get the most similar podcast to current podcast to rate
    most_similar = simtable_filtered.index[np.argmax(simtable_filtered)]
    print(most_similar)
    # Get user's rating for most similar podcast
    idx = X_train.loc[(X_train['user']==user_to_assess) & (X_train['itunes_Id']==most_similar)].index.values[0]
    print('idx is ',idx)
    most_similar_rating = y_train.loc[idx]
    return most_similar_rating

In [None]:
# Get the predicted ratings for each podcast in the validation set and calculate the RMSE
ratings_valset = X_val.apply(lambda x: predict_rating(x),axis=1)
val_rmse = np.sqrt(mean_squared_error(y_val,ratings_valset))
print('RMSE of predicted ratings is {:.3f}'.format(val_rmse))

This didn't work, likely due to too few user interactions. 

In [None]:
## getting podcasts for a user
podcast_df[podcast_df.title.str.contains('Verge')]


Unnamed: 0,title,producer,genre,description,num_episodes,avg_rating,num_reviews,link,episode_descriptions,itunes_id,rating,user
363277,The Vergecast,The Verge,Technology,The Vergecast is the flagship podcast from The...,640,4.4,3300.0,https://podcasts.apple.com/us/podcast/the-verg...,"[The Verge's Nilay Patel, Alex Cranz, and Davi...",430333725,3,Mattdockside
363278,The Vergecast,The Verge,Technology,The Vergecast is the flagship podcast from The...,640,4.4,3300.0,https://podcasts.apple.com/us/podcast/the-verg...,"[The Verge's Nilay Patel, Alex Cranz, and Davi...",430333725,1,Outsidah
363279,The Vergecast,The Verge,Technology,The Vergecast is the flagship podcast from The...,640,4.4,3300.0,https://podcasts.apple.com/us/podcast/the-verg...,"[The Verge's Nilay Patel, Alex Cranz, and Davi...",430333725,2,Neverstopsmiling54
363280,The Vergecast,The Verge,Technology,The Vergecast is the flagship podcast from The...,640,4.4,3300.0,https://podcasts.apple.com/us/podcast/the-verg...,"[The Verge's Nilay Patel, Alex Cranz, and Davi...",430333725,3,conservative buyer
363281,The Vergecast,The Verge,Technology,The Vergecast is the flagship podcast from The...,640,4.4,3300.0,https://podcasts.apple.com/us/podcast/the-verg...,"[The Verge's Nilay Patel, Alex Cranz, and Davi...",430333725,3,ajesha
363282,The Vergecast,The Verge,Technology,The Vergecast is the flagship podcast from The...,640,4.4,3300.0,https://podcasts.apple.com/us/podcast/the-verg...,"[The Verge's Nilay Patel, Alex Cranz, and Davi...",430333725,2,Byrd Nick
363283,The Vergecast,The Verge,Technology,The Vergecast is the flagship podcast from The...,640,4.4,3300.0,https://podcasts.apple.com/us/podcast/the-verg...,"[The Verge's Nilay Patel, Alex Cranz, and Davi...",430333725,2,Matt_C29
363284,The Vergecast,The Verge,Technology,The Vergecast is the flagship podcast from The...,640,4.4,3300.0,https://podcasts.apple.com/us/podcast/the-verg...,"[The Verge's Nilay Patel, Alex Cranz, and Davi...",430333725,2,cdobbs7
363285,The Vergecast,The Verge,Technology,The Vergecast is the flagship podcast from The...,640,4.4,3300.0,https://podcasts.apple.com/us/podcast/the-verg...,"[The Verge's Nilay Patel, Alex Cranz, and Davi...",430333725,1,cgclack
363286,The Vergecast,The Verge,Technology,The Vergecast is the flagship podcast from The...,640,4.4,3300.0,https://podcasts.apple.com/us/podcast/the-verg...,"[The Verge's Nilay Patel, Alex Cranz, and Davi...",430333725,1,FTW28


In [None]:
wait_id = '121493804'
supderdatascience_id = '1163599059'
thisamerican_id = '1223767856'
collegebball_id = '268800565'
verge_id = '430333725'




In [None]:
userA = {'user': 'A', 'itunes_id': ['121493804', '1163599059', '1223767856', '268800565', '430333725'] }
userAdf = pd.DataFrame(userA)
userAdf

Unnamed: 0,user,itunes_id
0,A,121493804
1,A,1163599059
2,A,1223767856
3,A,268800565
4,A,430333725


## Evaluate model performance

Accuracy on the test set is 0.896
