In [40]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.metrics import pairwise_distances

We'll do topic modelling with our tuned vectorizor and the tuned LDA params we found earlier. Then we'll go on to build our recommendation system. 

## Build the LDA Topic Modeler first

In [4]:
ted = pd.read_csv("ted_clean.csv")

In [5]:
script = ted.transcript

In [21]:
%store -r data
stop_words = data

In [7]:
tf = CountVectorizer(stop_words = stop_words, token_pattern = r'\b[a-zA-Z]{3,}\b', \
                     max_df = 0.25, min_df = 250)

In [8]:
doc_term_matrix = tf.fit_transform(script)

Understanding the matrices: 
- doc_term_matrix is our X
- lda.transform(X), which is our doc to topic matrix, is our V
- lda.components_, which is our topic to term matrix, is our U 

Since **X = V x U**:
\
Their shapes are going to be: (4005, 1142) = (4005, 15) x (15, 1142)

In [9]:
doc_term_matrix.shape

(4005, 1134)

In [10]:
vocab = tf.get_feature_names()

In [11]:
def show_topics(num_words= 10): 
    topic_words = []
    for idx, topic in enumerate(tm.components_):
    # Select the top 15 words in vocab for this topic.
        top_words = [vocab[i] for i in topic.argsort()[:-num_words-1:-1]]
        print(f"Topic {idx}:\n", ", ".join(top_words), "\n")
        topic_words.append(top_words) 
    return topic_words

In [12]:
tm = LatentDirichletAllocation(n_components = 15, random_state = 0, \
                               doc_topic_prior = 5, topic_word_prior = 2)
tm.fit(doc_term_matrix)
topic_words = show_topics()

Topic 0:
 city, food, countries, china, cities, dollars, economic, global, business, growth 

Topic 1:
 data, car, online, internet, cars, phone, company, software, stuff, project 

Topic 2:
 brain, light, sleep, memory, brains, fly, blue, image, surface, animal 

Topic 3:
 cells, cancer, blood, heart, patients, cell, disease, patient, oil, dna 

Topic 4:
 music, play, kids, sound, piece, hear, playing, game, sounds, audience 

Topic 5:
 design, climate, ideas, stories, ted, action, community, built, carbon, project 

Topic 6:
 computer, god, machine, book, art, computers, stuff, design, science, machines 

Topic 7:
 students, kids, education, learning, teachers, schools, teacher, read, student, parents 

Topic 8:
 war, government, internet, political, global, media, security, data, states, public 

Topic 9:
 energy, earth, universe, science, planet, sun, theory, dark, solar, physics 

Topic 10:
 women, men, woman, girls, mother, sex, girl, black, violence, stories 

Topic 11:
 health,

In [13]:
doc_topic = tm.transform(doc_term_matrix)
doc_topic.shape

(4005, 15)

In [14]:
topic_word = tm.components_
topic_word.shape

(15, 1134)

## Now: Rec Sys

Put the document topic matrix (**V**) into a dataframe and then use the pairwise_distances function to find cosine similarities across the board

In [65]:
doc_topic_lda = pd.DataFrame(doc_topic.round(3),
                             index = ted.title)
doc_topic_lda

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Averting the climate crisis,0.152,0.070,0.044,0.035,0.045,0.213,0.042,0.044,0.061,0.049,0.036,0.053,0.052,0.046,0.059
The best stats you've ever seen,0.479,0.106,0.024,0.027,0.022,0.029,0.024,0.060,0.045,0.030,0.031,0.054,0.023,0.026,0.021
Simplicity sells,0.040,0.243,0.029,0.026,0.101,0.068,0.127,0.040,0.039,0.027,0.032,0.034,0.057,0.026,0.111
Greening the ghetto,0.423,0.030,0.025,0.025,0.025,0.062,0.025,0.036,0.032,0.040,0.085,0.081,0.046,0.036,0.030
Do schools kill creativity?,0.026,0.032,0.041,0.031,0.123,0.047,0.060,0.236,0.040,0.052,0.147,0.033,0.047,0.038,0.049
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
"Crisis support for the world, one text away",0.085,0.082,0.042,0.040,0.043,0.063,0.043,0.061,0.067,0.032,0.118,0.145,0.053,0.035,0.090
The dark history of IQ tests,0.061,0.047,0.060,0.056,0.038,0.061,0.059,0.096,0.188,0.057,0.044,0.071,0.054,0.047,0.061
"How ""policing for profit"" undermines your rights",0.149,0.099,0.029,0.043,0.029,0.037,0.036,0.038,0.109,0.029,0.152,0.144,0.043,0.030,0.031
The electrifying speeches of Sojourner Truth,0.047,0.044,0.035,0.037,0.051,0.055,0.063,0.052,0.086,0.043,0.249,0.054,0.064,0.042,0.077


In [66]:
a = np.arange(0, 4005)
doc_topic_lda.set_index(a, inplace = True)
doc_topic_lda["title"] = ted.title 
doc_topic_lda

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,title
0,0.152,0.070,0.044,0.035,0.045,0.213,0.042,0.044,0.061,0.049,0.036,0.053,0.052,0.046,0.059,Averting the climate crisis
1,0.479,0.106,0.024,0.027,0.022,0.029,0.024,0.060,0.045,0.030,0.031,0.054,0.023,0.026,0.021,The best stats you've ever seen
2,0.040,0.243,0.029,0.026,0.101,0.068,0.127,0.040,0.039,0.027,0.032,0.034,0.057,0.026,0.111,Simplicity sells
3,0.423,0.030,0.025,0.025,0.025,0.062,0.025,0.036,0.032,0.040,0.085,0.081,0.046,0.036,0.030,Greening the ghetto
4,0.026,0.032,0.041,0.031,0.123,0.047,0.060,0.236,0.040,0.052,0.147,0.033,0.047,0.038,0.049,Do schools kill creativity?
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4000,0.085,0.082,0.042,0.040,0.043,0.063,0.043,0.061,0.067,0.032,0.118,0.145,0.053,0.035,0.090,"Crisis support for the world, one text away"
4001,0.061,0.047,0.060,0.056,0.038,0.061,0.059,0.096,0.188,0.057,0.044,0.071,0.054,0.047,0.061,The dark history of IQ tests
4002,0.149,0.099,0.029,0.043,0.029,0.037,0.036,0.038,0.109,0.029,0.152,0.144,0.043,0.030,0.031,"How ""policing for profit"" undermines your rights"
4003,0.047,0.044,0.035,0.037,0.051,0.055,0.063,0.052,0.086,0.043,0.249,0.054,0.064,0.042,0.077,The electrifying speeches of Sojourner Truth


In [67]:
doc_topic_lda["talk_title"] = ted.title

Let's look at the distance of the first one to the rest. 

In [68]:
dist_pairs = pairwise_distances(doc_topic[0].reshape(1,-1),doc_topic,metric='cosine').argsort()[0]
dist_pairs

array([   0, 3134, 3807, ..., 3357, 2375,  655])

In [69]:
first_talk_rec = doc_topic_lda.iloc[dist_pairs]
first_talk_rec

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,title,talk_title
0,0.152,0.070,0.044,0.035,0.045,0.213,0.042,0.044,0.061,0.049,0.036,0.053,0.052,0.046,0.059,Averting the climate crisis,Averting the climate crisis
3134,0.112,0.039,0.045,0.052,0.044,0.195,0.041,0.053,0.062,0.086,0.040,0.052,0.052,0.074,0.053,The disarming case to act right now on climate...,The disarming case to act right now on climate...
3807,0.203,0.027,0.024,0.026,0.025,0.223,0.028,0.040,0.073,0.046,0.051,0.077,0.039,0.089,0.030,Climate change will displace millions. Here's ...,Climate change will displace millions. Here's ...
2847,0.101,0.061,0.038,0.061,0.042,0.168,0.060,0.044,0.064,0.059,0.038,0.075,0.065,0.083,0.043,Bridges should be beautiful,Bridges should be beautiful
2779,0.155,0.069,0.056,0.043,0.066,0.133,0.064,0.051,0.052,0.060,0.033,0.060,0.062,0.047,0.050,A playful solution to the housing crisis,A playful solution to the housing crisis
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1463,0.019,0.020,0.022,0.018,0.022,0.032,0.022,0.026,0.034,0.018,0.665,0.027,0.023,0.020,0.032,Violence against women -- it's a men's issue,Violence against women -- it's a men's issue
3626,0.025,0.078,0.018,0.655,0.019,0.022,0.019,0.022,0.021,0.018,0.020,0.025,0.021,0.018,0.018,"A ""living drug"" that could change the way we t...","A ""living drug"" that could change the way we t..."
3357,0.023,0.022,0.043,0.659,0.020,0.022,0.022,0.021,0.021,0.022,0.021,0.025,0.026,0.029,0.024,Can we regenerate heart muscle with stem cells?,Can we regenerate heart muscle with stem cells?
2375,0.017,0.019,0.014,0.018,0.019,0.022,0.023,0.047,0.019,0.017,0.682,0.020,0.026,0.017,0.039,We should all be feminists,We should all be feminists


In [70]:
first_talk_rec.iloc[0:5].title

0                             Averting the climate crisis
3134    The disarming case to act right now on climate...
3807    Climate change will displace millions. Here's ...
2847                          Bridges should be beautiful
2779             A playful solution to the housing crisis
Name: title, dtype: object

Give the recommendation system a try: Let's insert our own sentence and see what it comes up with. 

In [71]:
t = ["I've read a lot of art history over my life"]

In [72]:
vt = tf.transform(t)

In [73]:
tt = tm.transform(vt)

In [74]:
dist_pairs = pairwise_distances(tt,doc_topic,metric='cosine').argsort()[0]
dist_pairs

array([2344,   95,   84, ..., 2375, 1160,  655])

In [75]:
first_talk_rec = doc_topic_lda.iloc[dist_pairs]
first_talk_rec.iloc[0:5].title

2344                                      "Rollercoaster"
95                                       "La Vie en Rose"
84      Global warming's theme song, "Manhattan in Jan...
899                                 How to tie your shoes
3516                                 "Three Months After"
Name: title, dtype: object

Write a function for this: 

In [76]:
def recommend_ted(string, num_talks): 
    t = [string]
    vt = tf.transform(t)
    tt = tm.transform(vt)
    dist_pairs = pairwise_distances(tt,doc_topic,metric='cosine').argsort()[0]
    recs = doc_topic_lda.iloc[dist_pairs]
    top_talks = recs.iloc[0:num_talks].title
    return top_talks

In [83]:
my_interest = "I'm also really into sports"

In [84]:
recommendation = recommend_ted(my_interest, 5)
recommendation

95                                       "La Vie en Rose"
2344                                      "Rollercoaster"
84      Global warming's theme song, "Manhattan in Jan...
3516                                 "Three Months After"
145                                        My magic moves
Name: title, dtype: object