In [1]:
import pandas as pd #importing pandas 
transcripts=pd.read_csv("/home/roronoa/Downloads/ted-data/transcripts.csv") #storing the dataset into transcripts variable
transcripts.head() #displaying the variable

Unnamed: 0,transcript,url
0,Good morning. How are you?(Laughter)It's been ...,https://www.ted.com/talks/ken_robinson_says_sc...
1,"Thank you so much, Chris. And it's truly a gre...",https://www.ted.com/talks/al_gore_on_averting_...
2,"(Music: ""The Sound of Silence,"" Simon & Garfun...",https://www.ted.com/talks/david_pogue_says_sim...
3,If you're here today — and I'm very happy that...,https://www.ted.com/talks/majora_carter_s_tale...
4,"About 10 years ago, I took on the task to teac...",https://www.ted.com/talks/hans_rosling_shows_t...


In [2]:
transcripts['title']=transcripts['url'].map(lambda x:x.split("/")[-1]) 
#creates a new column in transcripts which contains the title of the transcript from the URL
transcripts.head()

Unnamed: 0,transcript,url,title
0,Good morning. How are you?(Laughter)It's been ...,https://www.ted.com/talks/ken_robinson_says_sc...,ken_robinson_says_schools_kill_creativity\n
1,"Thank you so much, Chris. And it's truly a gre...",https://www.ted.com/talks/al_gore_on_averting_...,al_gore_on_averting_climate_crisis\n
2,"(Music: ""The Sound of Silence,"" Simon & Garfun...",https://www.ted.com/talks/david_pogue_says_sim...,david_pogue_says_simplicity_sells\n
3,If you're here today — and I'm very happy that...,https://www.ted.com/talks/majora_carter_s_tale...,majora_carter_s_tale_of_urban_renewal\n
4,"About 10 years ago, I took on the task to teac...",https://www.ted.com/talks/hans_rosling_shows_t...,hans_rosling_shows_the_best_stats_you_ve_ever_...


**Term Frequency-Inverse Document Frequency (Tf-Idf)**

This helps us in finding the importance of a word which results in finding the similarity between scripts.

1.If the word occurs a lot in document?

2.If the word occurs rarely in the corpus?

3.Both 1 and 2?


A word is important in a document if, it occurs a lot in the document, but rarely in other documents in the corpus(collection of documents). 

Term Frequency measures how often the word appears in a given document, while Inverse term frquency measures how rare the word is in a corpus. The product of these two quantities, measures the importance of the word and is known as Tf-Idf. Creating a tf-idf representation is fairly straightforward, if you are working with a machine learning frame-work, such as scikit-learn, it's fairly straighforward to create a matrix representation of text data.

In [3]:
from sklearn.feature_extraction import text #text module from feature_extraction in sklearn(scikit-learn)
Text=transcripts['transcript'].tolist() #making them as lsit and loading into Text
tfidf=text.TfidfVectorizer(input=Text,stop_words="english") #vectorizing the text
matrix=tfidf.fit_transform(Text) #forming a matrix
print(matrix.shape) #size of the matrix

(2467, 58489)


So once we sort the issue of representing word vectors by taking into account the importance of the words, we are all set to tackle the next issue, how to find out which documents (in our case Ted talk transcripts) are similar to a given document?

To find out similar documents among different documents, we will need to compute a measure of similarity. Usually when dealing with Tf-Idf vectors, we use $cosine$ similarity. Think of $cosine$ similarity as measuring how close one TF-Idf vector is from the other. Now if you remember from the previous discussion, we were able to represent each transcript as a vector, so the $cosine$ similarity will become a means for us to find out how similar the transcript of one Ted Talk is to the other.

In [4]:
from sklearn.metrics.pairwise import cosine_similarity #import cosine similarity.
sim_unigram=cosine_similarity(matrix) #apllying cosine similarity to the matrix.

for, each Transcript, we need to find out the 4 most similar ones, based on cosine similarity. Algorithmically, this would amount to finding out, for each row in the cosine matrix constructed above, the index of five columns, that are most similar to the document (transcript in our case) corresponding to the respective row number.

In [5]:
def get_similar_articles(x):
    return ",".join(transcripts['title'].loc[x.argsort()[-5:-1]]) # -5:-1 will give us four related transcipts
transcripts['similar_articles_unigram']=[get_similar_articles(x) for x in sim_unigram] #new column in transcripts

In [6]:
transcripts['title'].str.replace("_"," ").str.upper().str.strip()[1] 
# [1] represents 2nd trnascipt in the corpus. chnage this to see the titles of different transcriptes

'AL GORE ON AVERTING CLIMATE CRISIS'

In [7]:
transcripts['similar_articles_unigram'].str.replace("_"," ").str.upper().str.strip().str.split("\n")[1]
# [1] represents 2nd trnascipt in the corpus. chnage this to see the recommended list of different transcriptes

['RORY BREMNER S ONE MAN WORLD SUMMIT',
 ',ALICE BOWS LARKIN WE RE TOO LATE TO PREVENT CLIMATE CHANGE HERE S HOW WE ADAPT',
 ',TED HALSTEAD A CLIMATE SOLUTION WHERE ALL SIDES CAN WIN',
 ',AL GORE S NEW THINKING ON THE CLIMATE CRISIS']

[here](https://www.kaggle.com/rounakbanik/ted-talks) is the dataset I used. Taken reference from the [here](https://towardsdatascience.com/how-i-used-text-mining-to-decide-which-ted-talk-to-watch-dfe32e82bffd).