# Building a podcast Recommendation Sytem

This project is based on the kaggle tutorial https://www.kaggle.com/switkowski/building-a-podcast-recommendation-engine, which is also based in the Chris Clark's blog http://blog.untrod.com/2017/02/recommendation-engine-for-trending-products-in-python.md.html

The dataset used in this project is provided by ListenNotes https://www.kaggle.com/listennotes/all-podcast-episodes-published-in-december-2017. It contains both podcast and episode metadata.

## I. Imports first
We will import usual libraries as numpy, pandas. Also we will need some modules from sklearn: TfidfVectorizer and linear_kernel.

In [1]:
import numpy as np
import pandas as pd 
import os
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [3]:
podcasts = pd.read_csv('Dataset/podcasts.csv')
podcasts.head()
podcasts.info()

Unnamed: 0,uuid,title,image,description,language,categories,website,author,itunes_id
0,8d62d3880db2425b890b986e58aca393,"Ecommerce Conversations, by Practical Ecommerce",http://is4.mzstatic.com/image/thumb/Music6/v4/...,Listen in as the Practical Ecommerce editorial...,English,Technology,http://www.practicalecommerce.com,Practical Ecommerce,874457373
1,cbbefd691915468c90f87ab2f00473f9,Eat Sleep Code Podcast,http://is4.mzstatic.com/image/thumb/Music71/v4...,On the show we’ll be talking to passionate peo...,English,Tech News | Technology,http://developer.telerik.com/,Telerik,1015556393
2,73626ad1edb74dbb8112cd159bda86cf,SoundtrackAlley,http://is5.mzstatic.com/image/thumb/Music71/v4...,A podcast about soundtracks and movies from my...,English,Podcasting | Technology,https://soundtrackalley.podbean.com,Randy Andrews,1158188937
3,0f50631ebad24cedb2fee80950f37a1a,The Tech M&A Podcast,http://is1.mzstatic.com/image/thumb/Music71/v4...,The Tech M&A Podcast pulls from the best of th...,English,Business News | Technology | Tech News | Business,http://www.corumgroup.com,Timothy Goddard,538160025
4,69580e7b419045839ca07af06cf0d653,"The Tech Informist - For fans of Apple, Google...",http://is4.mzstatic.com/image/thumb/Music62/v4...,The tech news show with two guys shooting the ...,English,Gadgets | Tech News | Technology,http://techinformist.com,The Tech Informist,916080498


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 121175 entries, 0 to 121174
Data columns (total 9 columns):
uuid           121175 non-null object
title          121173 non-null object
image          121175 non-null object
description    119832 non-null object
language       121175 non-null object
categories     121175 non-null object
website        120005 non-null object
author         118678 non-null object
itunes_id      121175 non-null int64
dtypes: int64(1), object(8)
memory usage: 8.3+ MB


## II. Cleaning the dataset

We are only focusing on the podcasts titles rather than the episodes. Our dataset contains 121175 entries. In order to do some tests quickly, we should reduce this dataset to 15000 at most.

As a first filter, we can choose one podcast language.

In [6]:
(podcasts
 .language
 .value_counts()
 .to_frame()
 .head(10))

Unnamed: 0,language
English,99316
German,4316
French,3874
Spanish,3637
Portuguese,1827
Swedish,1698
Chinese,1329
Japanese,1097
Italian,818
Russian,602


We observe that the english podcasts are the most popular in this dataset, then we can work with them.

In [8]:
podcasts = podcasts[podcasts.language == 'English']

Let's take a look into the missing values

In [10]:
missing_values_count = podcasts.isnull().sum()
missing_values_count

uuid              0
title             1
image             0
description    1143
language          0
categories        0
website        1093
author         2155
itunes_id         0
dtype: int64

There exists one title missing and some rows without a description. These aren't going to be very useful, so we can drop those records.

In [13]:
podcasts = podcasts.dropna(subset=['description','title'])

In [16]:
missing_values_count = podcasts.isnull().sum()
missing_values_count
podcasts.shape

uuid              0
title             0
image             0
description       0
language          0
categories        0
website        1069
author         2051
itunes_id         0
dtype: int64

(98172, 9)

Let's take a look into duplicated entries

In [17]:
podcasts = podcasts.drop_duplicates('itunes_id')

In [18]:
podcasts.shape

(98172, 9)

Since we're building a recommender system based on podcast descriptions, we want to make sure that the descriptions have enough sontent in them to serve as useful inputs. 

From the tutorial, we see that is useful to find the length of each podcast description and describe it.

In [19]:
podcasts['description_length'] = [len(x.description.split()) for _, x in podcasts.iterrows()]

count    98172.000000
mean        39.168918
std        107.099080
min          0.000000
25%         11.000000
50%         26.000000
75%         51.000000
max      30157.000000
Name: description_length, dtype: float64

In [22]:
podcasts.head()
podcasts['description_length'].describe()

Unnamed: 0,uuid,title,image,description,language,categories,website,author,itunes_id,description_length
0,8d62d3880db2425b890b986e58aca393,"Ecommerce Conversations, by Practical Ecommerce",http://is4.mzstatic.com/image/thumb/Music6/v4/...,Listen in as the Practical Ecommerce editorial...,English,Technology,http://www.practicalecommerce.com,Practical Ecommerce,874457373,15
1,cbbefd691915468c90f87ab2f00473f9,Eat Sleep Code Podcast,http://is4.mzstatic.com/image/thumb/Music71/v4...,On the show we’ll be talking to passionate peo...,English,Tech News | Technology,http://developer.telerik.com/,Telerik,1015556393,59
2,73626ad1edb74dbb8112cd159bda86cf,SoundtrackAlley,http://is5.mzstatic.com/image/thumb/Music71/v4...,A podcast about soundtracks and movies from my...,English,Podcasting | Technology,https://soundtrackalley.podbean.com,Randy Andrews,1158188937,11
3,0f50631ebad24cedb2fee80950f37a1a,The Tech M&A Podcast,http://is1.mzstatic.com/image/thumb/Music71/v4...,The Tech M&A Podcast pulls from the best of th...,English,Business News | Technology | Tech News | Business,http://www.corumgroup.com,Timothy Goddard,538160025,59
4,69580e7b419045839ca07af06cf0d653,"The Tech Informist - For fans of Apple, Google...",http://is4.mzstatic.com/image/thumb/Music62/v4...,The tech news show with two guys shooting the ...,English,Gadgets | Tech News | Technology,http://techinformist.com,The Tech Informist,916080498,17


count    98172.000000
mean        39.168918
std        107.099080
min          0.000000
25%         11.000000
50%         26.000000
75%         51.000000
max      30157.000000
Name: description_length, dtype: float64

We observe that at lease a quarter of our descriptions have less than 11 words. I'm certain these won't serve as good inputs when we build the recommender system. Just to be safe, we are only going to include descriptions that have at least 20 words in them.

In [81]:
podcasts = podcasts[podcasts.description_length >= 20]

Like it was mentioned earlier, in order to do fast processing we will build the recommender system with 15,000 records from this data set.

At the end of this, we want to  find podcasts similar to our personal favorites, so let's pull those into a separate dataframe, and load them back in after the sample is created.

We need to create a list of favorite podcasts, in order to test the recommendation system. I will add my personal favorite list. This list is very diverse, contains music, films and business podcasts. I wasn't sure of the titles, so I did a quick search in the database

In [92]:
podcasts[podcasts.title.str.contains('Blank Check ')]

Unnamed: 0,uuid,title,image,description,language,categories,website,author,itunes_id,description_length
9306,28cb5a9ca7d44b3faa1de9c93e23d806,Blank Check with Griffin & David,http://is4.mzstatic.com/image/thumb/Music128/v...,"Not just another bad movie podcast, Blank Chec...",English,Society & Culture | TV & Film | Comedy,https://audioboom.com/channel/Blank-Check,Audioboom,981330533,55


In [95]:
favorite_podcasts = ['928159684', '1052989183', '981330533']
favorites = podcasts[podcasts.itunes_id.isin(favorite_podcasts)]
favorites
#podcasts.title
#podcasts[podcasts.itunes_id == 928159684].title.values
#podcasts[podcasts.itunes_id == 1052989183].title.values
#podcasts[podcasts.itunes_id == 981330533].title.values

Unnamed: 0,uuid,title,image,description,language,categories,website,author,itunes_id,description_length
5996,fe5998aec5d046e1b3fb913f9a585935,Rock N Roll Archaeology,http://is5.mzstatic.com/image/thumb/Music111/v...,"Rock And Roll Archaeology is, first and foremo...",English,Performing Arts | Arts | Music | Society & Cul...,http://www.rocknrollarchaeology.com,Diy and HoW Studios,1052989183,98
9306,28cb5a9ca7d44b3faa1de9c93e23d806,Blank Check with Griffin & David,http://is4.mzstatic.com/image/thumb/Music128/v...,"Not just another bad movie podcast, Blank Chec...",English,Society & Culture | TV & Film | Comedy,https://audioboom.com/channel/Blank-Check,Audioboom,981330533,55
79021,fe6864628066420c8103c94e91e72eb3,The GaryVee Audio Experience,http://is2.mzstatic.com/image/thumb/Music127/v...,"Welcome to The Garyvee Audio Experience, hoste...",English,Business | Business News | Management & Marketing,http://www.garyvaynerchuk.com,Alex De Simone,928159684,61


Let's create the sample

In [96]:
podcasts = podcasts[~podcasts.isin(favorites)].sample(15000)
data = pd.concat([podcasts, favorites], sort = True).reset_index(drop = True)

In [97]:
data.shape

(15003, 10)

#### IT'S DONE! THE DATASET IS CLEAN

## III. TF-IDF Algorithm