<a href="https://colab.research.google.com/github/mille055/Rec_Project/blob/main/notebooks/Get_text_embeddings_and_content_filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentence Transformers to embed text columns
In this notebook we will embed the textual columns using document embeddings obtained using a pre-trained [Sentence Transformer](https://www.sbert.net) model.  SentenceTransformers is a framework for sentence / text embeddings which works particularly well for shorter text.  It was developed in 2019 and uses Siamese-BERT to develop semantically meaningful sentence embeddings which can be compared using cosine similarity.  You can use a [pretrained embedding model](https://www.sbert.net/docs/pretrained_models.html) or can train your own on a corpus.

**References:**
- [Sentence-BERT paper](https://arxiv.org/abs/1908.10084) by Reimers & Gurevych

In [1]:
import os
import numpy as np
import pandas as pd
import string
import time
import urllib.request
import zipfile
import torch

from sklearn.linear_model import LogisticRegression
!pip install sentence_transformers
from sentence_transformers import SentenceTransformer
!pip install unidecode
import unidecode
import sys
import warnings
warnings.filterwarnings('ignore')

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 KB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.27.4-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m73.5 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m56.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0
  Downloading huggingface_hub-0.13.4-py3-none-any.whl (200 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m200.1

## Download and prepare data

In [2]:
# Clone the repository
!git clone 'https://github.com/mille055/Rec_Project'

Cloning into 'Rec_Project'...
remote: Enumerating objects: 317, done.[K
remote: Counting objects: 100% (71/71), done.[K
remote: Compressing objects: 100% (61/61), done.[K
remote: Total 317 (delta 42), reused 26 (delta 10), pack-reused 246[K
Receiving objects: 100% (317/317), 122.20 MiB | 17.47 MiB/s, done.
Resolving deltas: 100% (163/163), done.
Updating files: 100% (27/27), done.


In [3]:
# Unpickle the dataset
podcast_df = pd.read_pickle('/content/Rec_Project/data/podcast_df_040423.pkl')
podcast_df = podcast_df.reset_index(drop=True)
podcast_df

Unnamed: 0,title,producer,genre,description,num_episodes,avg_rating,num_reviews,link,episode_descriptions,itunes_id,rating,user
0,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,"Paranormal, unexplainable, and uncanny stories...",105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,[In celebration of our new premium format—two ...,1526579247,5,RobinFerris
1,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,"Paranormal, unexplainable, and uncanny stories...",105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,[In celebration of our new premium format—two ...,1526579247,1,Pops.99
2,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,"Paranormal, unexplainable, and uncanny stories...",105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,[In celebration of our new premium format—two ...,1526579247,5,ReddEye81
3,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,"Paranormal, unexplainable, and uncanny stories...",105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,[In celebration of our new premium format—two ...,1526579247,2,Keyta7777
4,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,"Paranormal, unexplainable, and uncanny stories...",105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,[In celebration of our new premium format—two ...,1526579247,4,Okkupent
...,...,...,...,...,...,...,...,...,...,...,...,...
46706,Quality Queen Control,Asha Christina,Education,"Sophistication, Psychology, Dating, and Lifest...",111,4.8,470.0,https://podcasts.apple.com/us/podcast/quality-...,[Hey Angels!!! In today's episode of the Quali...,1512702672,5,Monijansand
46707,Quality Queen Control,Asha Christina,Education,"Sophistication, Psychology, Dating, and Lifest...",111,4.8,470.0,https://podcasts.apple.com/us/podcast/quality-...,[Hey Angels!!! In today's episode of the Quali...,1512702672,5,trinityangel13
46708,Quality Queen Control,Asha Christina,Education,"Sophistication, Psychology, Dating, and Lifest...",111,4.8,470.0,https://podcasts.apple.com/us/podcast/quality-...,[Hey Angels!!! In today's episode of the Quali...,1512702672,5,Kweenkeys
46709,Quality Queen Control,Asha Christina,Education,"Sophistication, Psychology, Dating, and Lifest...",111,4.8,470.0,https://podcasts.apple.com/us/podcast/quality-...,[Hey Angels!!! In today's episode of the Quali...,1512702672,5,JoyfulJoyfulWOG


In [4]:
## clean text from the episode_descriptions column

sys.path.append('/content/Rec_Project/scripts')


import clean_dataframe_text
from clean_dataframe_text import join_and_clean_text, clean_text




In [5]:
### reducing the size of the dataframe prior to processing as do not need user ratings and userid for this
def prepare_df(df):
 
  
  df1= podcast_df.copy()
  # get rid of duplicates based on itunes_id
  cols_drop_dup = ['itunes_id']
  df_no_dups = df1.drop_duplicates(subset=cols_drop_dup)
  print('shape of new df without duplicates is ', df_no_dups.shape)
  # remove columns containing the user and user rating
  print('removing user  and rating columns')
  df_no_dups.drop(columns=['user', 'rating'], inplace=True)

  return df_no_dups



In [6]:
podcast_cleaned_df = prepare_df(podcast_df)

# clean the podcast dataframe
podcast_cleaned_df.episode_descriptions = podcast_cleaned_df.episode_descriptions.apply(join_and_clean_text)
podcast_cleaned_df.description = podcast_cleaned_df.description.apply(clean_text)


shape of new df without duplicates is  (3936, 12)
removing user  and rating columns


In [7]:
podcast_cleaned_df

Unnamed: 0,title,producer,genre,description,num_episodes,avg_rating,num_reviews,link,episode_descriptions,itunes_id
0,One Strange Thing: Paranormal & True-Weird Mys...,One Strange Thing,History,"paranormal, unexplainable, and uncanny stories...",105,4.6,499.0,https://podcasts.apple.com/us/podcast/one-stra...,in celebration of our new premium format two ...,1526579247
10,BibleProject,BibleProject Podcast,Religion & Spirituality,the creators of bibleproject have in depth con...,352,4.9,15000.0,https://podcasts.apple.com/us/podcast/biblepro...,"david was israel's greatest king, but even he ...",1050832450
20,The Domonique Foxworth Show,ESPN,Sports,with episodes every tuesday and thursday durin...,70,4.9,1100.0,https://podcasts.apple.com/us/podcast/the-domo...,"domonique, charlie, and ashley foxworth along ...",1642566714
30,Hacking Humans,CyberWire Inc.,Technology,"deception, influence, and social engineering i...",415,4.7,255.0,https://podcasts.apple.com/us/podcast/hacking-...,"kathleen smith, cmo from clearedjobs.net sits ...",1391915810
40,Leader Up,AMSC,Government,"leader up, a podcast by the army management st...",52,5.0,14.0,https://podcasts.apple.com/us/podcast/leader-u...,msc's mr. david howey meets with csm jason c. ...,1378682853
...,...,...,...,...,...,...,...,...,...,...
46641,Tales from the Stinky Dragon,Rooster Teeth,Leisure,a d amp;d podcast from rooster teeth! our brav...,101,4.9,781.0,https://podcasts.apple.com/us/podcast/tales-fr...,"with asafee on his deathbed, the four chosen o...",1563814788
46661,Morning Microdose,Almost 30,Education,the fact that you came across morning microdos...,159,5.0,187.0,https://podcasts.apple.com/us/podcast/morning-...,drop in for this mind expanding conversation w...,1639123211
46681,Presidential,Washington Post Audio,History,the washington post's presidential podcast exp...,52,4.4,3500.0,https://podcasts.apple.com/us/podcast/presiden...,"students, teachers and historians reflect on w...",1072170823
46691,Badlands Cola | A Strange Audio Drama,Renee Taylor Klint,Fiction,badlands cola is a cinematic mystery horror au...,17,4.6,63.0,https://podcasts.apple.com/us/podcast/badlands...,"hi listeners! it's renee, and today we're doin...",1627191206


In [8]:
podcast_cleaned_df.description.tolist()[0]

"paranormal, unexplainable, and uncanny stories aren't just in the fiction section. they happen every day, to people just like you. one strange thing brings you family friendly stories from america's newspaper archives. and they all have something in common: an element that can't be explained by logi..."

In [9]:
podcast_cleaned_df.to_pickle('cleaned_df.pkl')

## Create document embeddings
We will load a pre-trained model [('all-MiniLM-L6-v2')](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) which we will then use to create embeddings for our training and test set text.  The MiniLM-L6-v2 model was trained on 1.1 billion sentence pairs to produce high-quality sentence / short document embeddings in 384 dimensions which can be used for example to calculate similarity between documents.  

In [10]:
# Load pre-trained model
senttrans_model = SentenceTransformer('all-MiniLM-L6-v2',device=device)



Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [11]:
# Create embeddings for columns description, episode descriptions, genre
def create_embeddings(df, cols):
  df1 = df.copy()
  
  new_col_names = []
  for col in cols:
    print('Now embedding column', col)
    col_data = df1[col].values.tolist()
    col_embeds = [senttrans_model.encode(doc) for doc in col_data]
    new_col_name = col + '_embedding'
    df1[new_col_name] = col_embeds
    new_col_names.append(new_col_name)

  embeddings_df = df1[new_col_names]
  embeddings_df['itunes_id'] = df1['itunes_id']

  return df1, embeddings_df



In [12]:
podcast_with_embeds, embeddings_only = create_embeddings(podcast_cleaned_df, cols= ['description', 'genre','episode_descriptions'])


Now embedding column description
Now embedding column genre
Now embedding column episode_descriptions


In [13]:
podcast_with_embeds.to_pickle('podcast_base_with_embeds.pkl')

In [14]:
embeddings_only.to_pickle('podcast_embeddings_only.pkl')

In [15]:
embeddings_only.columns

Index(['description_embedding', 'genre_embedding',
       'episode_descriptions_embedding', 'itunes_id'],
      dtype='object')

## Cosine similarity of the embeddings

Finally, we will used our embeddings as features to train a softmax regression model to classify the documents.

In [16]:
def create_cosine_similarity(df, feats = ['genre_embedding', 'description_embedding', 'episode_descriptions_embedding']):
  array_list = []
  for feat in feats:
    array_list.append(np.stack(df[feat].values))
  concat_array = np.concatenate((array_list), axis=1)
  print('after concatenate, data size is ', concat_array.shape)
  matrix = cosine_similarity(concat_array)
  
  return matrix



In [17]:
## Calculate cosine similarity matrices for different combinations of feaatures
cs_all = create_cosine_similarity(embeddings_only) # all three (genre, description, episode_descriptions)
cs_genre = create_cosine_similarity(embeddings_only, feats=['genre_embedding'])
cs_desc = create_cosine_similarity(embeddings_only, feats=['description_embedding'])
cs_episo = create_cosine_similarity(embeddings_only, feats=['episode_descriptions_embedding'])
cs_gen_desc = create_cosine_similarity(embeddings_only, feats=['genre_embedding', 'description_embedding'])


after concatenate, data size is  (3936, 1152)
after concatenate, data size is  (3936, 384)
after concatenate, data size is  (3936, 384)
after concatenate, data size is  (3936, 384)
after concatenate, data size is  (3936, 768)


## Generate predicted ratings
Ready to generated predicted ratings for each user-item pair.  The process we will use to generate each predicted rating is as follows:  
- Filter the similarity matrix to only the movies previously watched by the user  
- Find the previously watched movie that is most similar to the movie for which we want to generate the predicted rating (nearest neighbor approach)
- Get the user's rating for the most similar previously watched movie and use that as our prediction

In [29]:
# Split our data into training and validation sets
from sklearn.model_selection import train_test_split
X = podcast_df[['user', 'itunes_id']]
y = podcast_df['rating']
X_train, X_val, y_train, y_val = train_test_split(X,y,random_state=0, test_size=0.2)
print('sizes of X_train and X_val are', X_train.shape, X_val.shape)

sizes of X_train and X_val are (37368, 2) (9343, 2)


In [48]:
embeddings = pd.read_pickle('podcast_embeddings_only.pkl').reset_index(drop=True)
embeddings

Unnamed: 0,description_embedding,genre_embedding,episode_descriptions_embedding,itunes_id
0,"[0.0011661326, -0.02386104, 0.05824091, 0.0531...","[-0.032607477, 0.09884295, -0.022565197, 0.045...","[-0.06146148, -0.019384934, -0.014735525, 0.02...",1526579247
1,"[-0.02261931, -0.050708562, 0.005596548, -0.04...","[0.04873833, 0.06813718, -0.025173135, 0.03297...","[-0.076244555, 0.081214406, 0.0711065, 0.02528...",1050832450
2,"[-0.033529382, -0.06339058, -0.024569249, -0.0...","[0.0012439901, 0.07559641, -0.017228436, -0.02...","[-0.06922878, -0.0964089, -0.013171483, -0.110...",1642566714
3,"[-0.027456347, 0.022695206, -0.024370523, -0.0...","[-0.053375818, 0.08707484, -0.026189232, -0.03...","[-0.12552091, 0.0033190064, 0.015544483, -0.02...",1391915810
4,"[-0.06472258, -0.02005646, -0.017680876, -0.01...","[-0.061360087, 0.04286633, 0.009105529, 0.0259...","[-0.091607764, -0.059887405, -0.061568618, -0....",1378682853
...,...,...,...,...
3931,"[-0.060859535, -0.04368852, -0.001683569, -0.0...","[0.0784131, 0.06623372, 0.044227276, 0.0479960...","[-0.06429337, -0.042031836, 0.0018859123, -0.0...",1563814788
3932,"[0.03157413, -0.00085495564, 0.11128992, 0.034...","[0.030874394, 0.0999365, -0.020643013, 0.07698...","[0.020177335, -0.057557356, 0.015965285, 0.059...",1639123211
3933,"[-0.01610995, -0.013187734, 0.09983183, -0.001...","[-0.032607477, 0.09884295, -0.022565197, 0.045...","[-0.020552242, -0.036761332, 0.00485737, 0.020...",1072170823
3934,"[-0.04678496, -0.04345845, 0.0041277464, 0.032...","[-0.04981938, 0.0027470058, 0.027653951, 0.036...","[-0.0187642, -0.061826486, -0.005049128, -0.02...",1627191206


In [49]:
# First, we'll use the cosine similarity of the genre and description features
sim_matrix = pd.DataFrame(cs_gen_desc, columns=embeddings_only.itunes_id,index=embeddings_only.itunes_id)
sim_matrix.head()


itunes_id,1526579247,1050832450,1642566714,1391915810,1378682853,270338166,1564113746,1236581365,1492492083,1664903728,...,820889267,73329284,1316200737,1443417194,1563105123,1563814788,1639123211,1072170823,1627191206,1512702672
itunes_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1526579247,1.0,0.213808,0.305327,0.342635,0.273094,0.238568,0.601837,0.215632,0.271587,0.274726,...,0.267439,0.339208,0.223705,0.269134,0.290784,0.205596,0.372644,0.499983,0.281771,0.350177
1050832450,0.213808,1.0,0.346491,0.114829,0.283784,0.348318,0.266916,0.244482,0.284012,0.067016,...,0.327292,0.206715,0.33563,0.32496,0.209102,0.231479,0.193162,0.236213,0.177488,0.30892
1642566714,0.305327,0.346491,1.0,0.188584,0.324291,0.368471,0.297782,0.300338,0.311772,0.162914,...,0.354254,0.331577,0.338124,0.36495,0.633313,0.354219,0.267382,0.336096,0.266566,0.391271
1391915810,0.342635,0.114829,0.188584,1.0,0.252362,0.182837,0.232414,0.22994,0.291919,0.19973,...,0.330243,0.360145,0.256898,0.241604,0.202092,0.202132,0.310288,0.189712,0.259304,0.332494
1378682853,0.273094,0.283784,0.324291,0.252362,1.0,0.343525,0.319082,0.263709,0.329524,0.287968,...,0.420794,0.214495,0.285665,0.317171,0.268542,0.31157,0.230132,0.434224,0.220061,0.409019


In [55]:
def predict_rating(user_item_pair,simtable = sim_matrix,X_train=X_train, y_train=y_train):
    podcast_to_rate = user_item_pair['itunes_id']
    user_to_assess = user_item_pair['user']
    #print(user_to_assess, podcast_to_rate)
    
    # Filter similarity matrix to only podcasts already reviewed by user
    prior_podcasts = X_train.loc[X_train['user']==user_to_assess, 'itunes_id'].tolist()
    #print(prior_podcasts)
    if not prior_podcasts:
      return None
    
    simtable_filtered = simtable.loc[podcast_to_rate, prior_podcasts]
    #print(simtable_filtered)
    
    # Get the most similar podcast to current podcast to rate
    most_similar = simtable_filtered.index[np.argmax(simtable_filtered)]
    #print(most_similar)
    
    # Get user's rating for most similar podcast
    idx = X_train.loc[(X_train['user']==user_to_assess) & (X_train['itunes_id']==most_similar)].index.values[0]
    #print('idx is ',idx)
    most_similar_rating = y_train.loc[idx]
    
    return most_similar_rating

In [72]:
# Get the predicted ratings for each podcast in the validation set and calculate the RMSE
ratings_valset = X_val.apply(lambda x: predict_rating(x),axis=1)

# Have many Nan so getting rid of those by creating dataframe and dropna
df = pd.DataFrame({'ratings_valset': ratings_valset, 'y_val':y_val})
df.dropna(inplace=True)

val_rmse = np.sqrt(mean_squared_error(df.y_val.values,df.ratings_valset.values))
print('RMSE of predicted ratings is {:.3f}'.format(val_rmse))


RMSE of predicted ratings is 1.482


In [74]:
## now trying with episode description as well (ie, all three feats)
sim_matrix_all = pd.DataFrame(cs_all, columns=embeddings_only.itunes_id,index=embeddings_only.itunes_id)
ratings_valset_all = X_val.apply(lambda x: predict_rating(x, simtable=sim_matrix_all),axis=1)
# Have many Nan so getting rid of those by creating dataframe and dropna
df = pd.DataFrame({'ratings_valset': ratings_valset_all, 'y_val':y_val})
df.dropna(inplace=True)

val_rmse = np.sqrt(mean_squared_error(df.y_val.values,df.ratings_valset.values))
print('RMSE of predicted ratings is {:.3f}'.format(val_rmse))


RMSE of predicted ratings is 1.500


That did not perform any better. Now, trying genre only

In [75]:
## now trying with genre only
sim_matrix_g = pd.DataFrame(cs_genre, columns=embeddings_only.itunes_id,index=embeddings_only.itunes_id)
ratings_valset_genre = X_val.apply(lambda x: predict_rating(x, simtable=sim_matrix_g),axis=1)
# Have many Nan so getting rid of those by creating dataframe and dropna
df = pd.DataFrame({'ratings_valset': ratings_valset_genre, 'y_val':y_val})
df.dropna(inplace=True)

val_rmse = np.sqrt(mean_squared_error(df.y_val.values,df.ratings_valset.values))
print('RMSE of predicted ratings is {:.3f}'.format(val_rmse))

RMSE of predicted ratings is 1.515


This didn't work well either. Will select the first matrix with genre and the podcast description text embedding. Overall, the performance is likely low due to too few interactions of each user (large majority have 1 rating). 

## Getting similar podcast for a user

In [76]:
def predict_new_pair_rating(user,podcast,simtable=sim_matrix,X_train=X_train, y_train=y_train):
    # Filter similarity matrix to only podcasts already consumed by user
    prior_podcast = X_train.loc[X_train['user']==user, 'itunes_id'].tolist()
    simtable_filtered = simtable.loc[podcast,prior_podcast]
    # Get the most similar movie already watched to current podcast to rate
    most_similar = simtable_filtered.index[np.argmax(simtable_filtered)]
    # Get user's rating for most similar podcast
    idx = X_train.loc[(X_train['user']==user) & (X_train['itunes_id']==most_similar)].index.values[0]
    most_similar_rating = y_train.loc[idx]
    return most_similar_rating

In [149]:
def generate_recommendations(user,simtable,df):
    # Get top rated podcast by user
    user_ratings = df.loc[df['user']==user]
    user_ratings = user_ratings.sort_values(by='rating',axis=0,ascending=False)
    topratedpodcast = user_ratings.iloc[0,:]['itunes_id']
    topratedpodcast_title = df.loc[df['itunes_id']==topratedpodcast,'title'].values[0]
    # Find most similar podcasts to the user's top rated movie
    sims = simtable.loc[topratedpodcast,:]
    mostsimilar = sims.sort_values(ascending=False).index.values
    # Get 10 most similar podcasts excluding the podcast itself
    mostsimilar = mostsimilar[1:11]
    # Get titles of movies from ids
    mostsim_podcasts = []
    for m in mostsimilar:
        mostsim_podcasts.append(df.loc[df['itunes_id']==m,['title', 'genre']].values[0])
        #mostsim_podcast_genres.append(df.loc[df['itunes_id']==m, 'genre'].values[0])
    return topratedpodcast_title, mostsim_podcasts



In [155]:
# Get recommendations for a random user
user = podcast_df.iloc[100].user
topratedpodcast, recs = generate_recommendations(user,sim_matrix,podcast_df)
print("User's highest rated podcast was {}".format(topratedpodcast))
for i,rec in enumerate(recs):
  print('Recommendation {} (Title, Genre): {}, {}'.format(i,rec[0], rec[1]))

User's highest rated podcast was Stolen Hearts
Recommendation 0 (Title, Genre): My Favorite Murder with Karen Kilgariff and Georgia Hardstark, True Crime
Recommendation 1 (Title, Genre): Devil in the Dorm, True Crime
Recommendation 2 (Title, Genre): Cold, True Crime
Recommendation 3 (Title, Genre): Black Girl Gone: A True Crime Podcast, True Crime
Recommendation 4 (Title, Genre): Up and Vanished, True Crime
Recommendation 5 (Title, Genre): High Strange, True Crime
Recommendation 6 (Title, Genre): The Girl in the Blue Mustang, True Crime
Recommendation 7 (Title, Genre): FBI Retired Case File Review, True Crime
Recommendation 8 (Title, Genre): My Life of Crime with Erin Moriarty, True Crime
Recommendation 9 (Title, Genre): Deep Cover: Never Seen Again, True Crime


In [158]:
## generates a set of new rows for the podcast df based on the input user dictionary
def generate_new_user(name, podcast_list, ratings, df):
    new_user_data = []

    for itunes_id, rating in zip(podcast_list, ratings):
        #print('matching podcast', itunes_id)
        matching_row = df.loc[df.itunes_id == itunes_id].iloc[0]
        #print(matching_row)

        user_row = {'user': name, 'itunes_id': itunes_id, 'rating': rating}
        for column in df.columns:
            if column not in user_row and column != 'episode_descriptions':
                #print('adding column', column)
                user_row[column] = matching_row[column]

        new_user_data.append(user_row)

    new_user_df = pd.DataFrame(new_user_data)
    return new_user_df

In [167]:
## making a new user to generate predictions

wait_id = '121493804'
superdatascience_id = '1163599059'
thisamerican_id = '1223767856'
collegebball_id = '268800565'
verge_id = '430333725'

userA = pd.DataFrame({'user': 'A', 'itunes_id': [wait_id, superdatascience_id, thisamerican_id, collegebball_id, verge_id], 'rating': [4, 5, 4, 4, 5] })




In [169]:
userA = generate_new_user('A', userA['itunes_id'], userA['rating'], podcast_df)
podcast_dfnew = podcast_df.append(userA)
podcast_dfnew.tail()

Unnamed: 0,title,producer,genre,description,num_episodes,avg_rating,num_reviews,link,episode_descriptions,itunes_id,rating,user
0,Wait Wait... Don't Tell Me‪!‬,Wait Wait Don't Tell Me!,Comedy,NPR's weekly news quiz. Have a laugh and test ...,318,4.6,34000.0,https://podcasts.apple.com/us/podcast/wait-wai...,,121493804,4,A
1,Super Data Science,"Jon Krohn and Guests on Machine Learning, A.I....",Technology,"The latest machine learning, A.I., and data ca...",500,4.6,231.0,https://podcasts.apple.com/us/podcast/super-da...,,1163599059,5,A
2,This American President,Parthenon Podcast Network,History,This American President delves into the lives ...,97,4.7,479.0,https://podcasts.apple.com/us/podcast/this-ame...,,1223767856,4,A
3,Eye On College Basketball,"CBS Sports, College Basketball, Basketball, Ma...",Sports,CBS Sports’ official college basketball podcas...,932,4.6,2500.0,https://podcasts.apple.com/us/podcast/eye-on-c...,,268800565,4,A
4,The Vergecast,The Verge,Technology,The Vergecast is the flagship podcast from The...,640,4.4,3300.0,https://podcasts.apple.com/us/podcast/the-verg...,,430333725,5,A


now will get predictions for the new user

In [170]:
# Get recommendations for a random user
user = 'A'
topratedpodcast, recs = generate_recommendations(user,sim_matrix,podcast_dfnew)
print("User's highest rated podcast was {}".format(topratedpodcast))
for i,rec in enumerate(recs):
  print('Recommendations {} (Title, Genre): {}, {}'.format(i,rec[0], rec[1]))

User's highest rated podcast was Super Data Science
Recommendations 0 (Title, Genre): Adventures in Machine Learning, Technology
Recommendations 1 (Title, Genre): The Data Scientist Show, Technology
Recommendations 2 (Title, Genre): Last Week in AI, Technology
Recommendations 3 (Title, Genre): Data Skeptic, Technology
Recommendations 4 (Title, Genre): The AI in Business Podcast, Technology
Recommendations 5 (Title, Genre): The Gradient Podcast, Technology
Recommendations 6 (Title, Genre): Gradient Dissent, Technology
Recommendations 7 (Title, Genre): Behind The Tech with Kevin Scott, Technology
Recommendations 8 (Title, Genre): Microsoft Research Podcast, Technology
Recommendations 9 (Title, Genre): Where the Internet Lives, Technology
