# Latent Dirichlet Allocation

## Introduction

We're going to use Latent Dirichlet Allocation to do topic modeling on a bunch of articles from horror movies and paranormal events. The first goal is to discern what are the distinct topics within the dataset, and what features describe them. Then we'll write a function that takes a given article and returns the most similar articles

## Basic
### Part 1: Prepare data

1. Load the data from `spooky_wikipedia.csv`. Since this is a Wikipedia dump, there are some pages (such as lists) that we're not interested in, so remove those. There are also some pages that have no text, so remove those as well. There's about 24,000 articles right now so take a smaller sample of that to start with (~1000). When you take a sample, pay attention to the indices as they might not look like you expect.

    Hint: the `title` contains information about whether the page is a list.

In [1]:
import pandas as pd

df = pd.read_csv('data/spooky_wikipedia.csv', index_col=0)
df = df.iloc[:1000]
list_articles = df.title.str.lower().str.contains('list of')
df = df[~list_articles]
df = df.dropna()

2. Vectorize the corpus. Note that LDA generally does not take a TF-IDF matrix, but a bag-of-words vector (you can use sklearn's <a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html">count vectorizer</a>). You can start with the default stopwords, but you'll probably want to update those later. We'll tune some of these other hyperparameters later but start with max_df = 0.85, min_df=2 and max_features=1000.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

tf_vectorizer = CountVectorizer(stop_words='english', max_df=0.85, min_df=2, max_features=1000)
word_vec = tf_vectorizer.fit_transform(df.text)

### Part 2: Build LDA model
3. Create an <a href="http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html">LDA instance </a> and think about what each of the parameters mean. In our use case, what does n_components represent? How do we input our alpha and beta priors? Use the 'online' learning method and n_jobs=-1 (all cores) or -2 (all cores but one) to speed up your processing.

In [3]:
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(learning_method='online', n_jobs=-2, random_state=1659)

4. Fit the LDA model on your vectorized corpus.

In [4]:
lda.fit(word_vec)

LatentDirichletAllocation(learning_method='online', n_jobs=-2,
                          random_state=1659)

5. Examine the generated topics. what does lda.components_ represent? How do we determine the most important features in a topic? Write a function that takes the most important features for each topic in lda.components_, then uses the feature names from the vectorizer to print out the most important words for each topic. What do you think each topic describes? Try adding some words to your stopwords to make your categories more specific to spooky topics and less to wikipedia topics.

> Checkpoint 1: Nice work; you've learned how to fit an LDA model and examine the topics to gain an intuitive understanding of the latent associations in a set of documents.


Helpful hint: if you don't want to keep fitting your vectorizer and lda model over and over again, you can persist them (save them to a file) with joblib (similar to pickle but optimized for large data)

```python
    joblib.dump(lda, 'lda_model.joblib')
    joblib.dump(vectorizer, 'tf_vec.joblib')
    lda = joblib.load('lda_model.joblib')
    tf_vectorizer = joblib.load('tf_vec.joblib')
    # It's that easy!
```

In [5]:
import joblib

joblib.dump(lda, 'lda_model.joblib')
joblib.dump(tf_vectorizer, 'tf_vec.joblib')

['tf_vec.joblib']

In [6]:
import joblib

lda = joblib.load('lda_model.joblib')
tf_vectorizer = joblib.load('tf_vec.joblib')

`lda.components_` represents the distributions of words in each generated topic, where rows are topics and columns are words. The most important features in a topic are the words with the highest values.

In [7]:
lda.components_

array([[ 0.13109778,  6.08632059,  7.15945968, ..., 21.56583322,
         0.73597785,  4.9738055 ],
       [ 0.10769533,  6.82274724,  2.24793448, ..., 12.8841689 ,
         0.14241819,  2.75489192],
       [ 7.76355189,  0.61276257,  0.12555321, ..., 24.1864745 ,
         4.02857495,  3.93901932],
       ...,
       [ 0.256852  ,  0.63753218,  0.10746563, ..., 10.77513433,
         0.11253416,  4.89832834],
       [ 0.3887244 ,  0.1027463 ,  0.10192989, ...,  0.10292673,
         0.39191822,  0.11213927],
       [21.82286025,  5.21510422, 11.5115133 , ..., 51.65488355,
        17.05453642,  8.84890863]])

In [8]:
def top_topic_features(model, feature_names, num_features=10):
    sorted_topics = feature_names[model.components_.argsort(axis=1)[:, ::-1][:, :num_features]]
    return sorted_topics

In [9]:
import numpy as np

feature_names = np.array(tf_vectorizer.get_feature_names())
top_topic_features(lda, feature_names=feature_names)

array([['series', 'star', 'character', 'television', 'trek', 'wars',
        'season', 'fictional', 'episode', 'franchise'],
       ['known', 'war', 'world', 'greek', 'water', 'castle', 'states',
        'south', 'light', 'black'],
       ['house', 'king', 'century', 'known', 'play', 'james', 'opera',
        'published', 'story', 'work'],
       ['character', 'comic', 'game', 'fictional', 'book', 'published',
        'story', 'books', 'man', 'universe'],
       ['term', 'used', 'century', 'magic', 'religious', 'word', 'world',
        'people', 'human', 'spiritual'],
       ['buffy', 'harry', 'music', 'series', 'vampire', 'angel',
        'slayer', 'potter', 'born', 'president'],
       ['film', 'films', 'american', 'horror', 'best', 'directed',
        'released', 'novel', 'award', 'fiction'],
       ['god', 'greek', 'hebrew', 'jesus', 'goddess', 'book',
        'according', 'wicca', 'jewish', 'king'],
       ['loa', 'vodou', 'haitian', 'sun', 'moon', 'french', 'baron',
        'sail

In [10]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

In [11]:
stop_words = ENGLISH_STOP_WORDS.union({'film', 'directed', 'fictional', 
                                       'work', 'books', 'released', 
                                       'written', 'born', 'characters', 
                                       'television', 'episodes', 
                                       'director', 'novel', 'story', 
                                       'book', 'list', 'element', 
                                       'redirect', 'starring', 
                                       'fiction', 'story', 'produced', 
                                       'novel', 'based', 'character', 
                                       'game', 'comic', 'television', 
                                       'animated', 'tv', 'series', 
                                       'redirects', 'mentions', 
                                       'locations'})

In [12]:
tf_vectorizer_2_5 = CountVectorizer(stop_words=stop_words, max_df=0.85, min_df=2, max_features=1000)
word_vec_2_5 = tf_vectorizer_2_5.fit_transform(df.text)
lda_2_5 = LatentDirichletAllocation(learning_method='online', n_jobs=-2, random_state=1659)
lda_2_5.fit(word_vec_2_5)
joblib.dump(lda_2_5, 'lda_model_2-5.joblib')
joblib.dump(tf_vectorizer_2_5, 'tf_vec_2-5.joblib')
lda_2_5 = joblib.load('lda_model_2-5.joblib')
tf_vectorizer_2_5 = joblib.load('tf_vec_2-5.joblib')
feature_names_2_5 = np.array(tf_vectorizer_2_5.get_feature_names())
top_topic_features(lda_2_5, feature_names=feature_names_2_5)

array([['star', 'trek', 'version', 'universe', 'created', 'wicca',
        'space', 'appeared', 'time', 'role'],
       ['films', 'american', 'best', 'time', 'award', 'million',
        'awards', 'horror', 'received', 'academy'],
       ['church', 'saint', 'christian', 'loa', 'catholic', 'roman',
        'vodou', 'jesus', 'known', 'pope'],
       ['wars', 'buffy', 'season', 'star', 'angel', 'stories', 'novels',
        'tale', 'appears', 'episode'],
       ['term', 'used', 'world', 'magic', 'century', 'word', 'human',
        'ancient', 'religious', 'greek'],
       ['god', 'king', 'hebrew', 'end', 'century', 'greek', 'jewish',
        'house', 'bible', 'bc'],
       ['scientology', 'church', 'hubbard', 'mind', 'halloween', 'salem',
        'witch', 'astrology', 'jack', 'dianetics'],
       ['horror', 'dead', 'dawn', 'black', 'baron', 'creature',
        'campbell', 'ghede', 'evil', 'ring'],
       ['known', 'united', 'society', 'american', 'states', 'years',
        'war', 'later', 'w

## Advanced
### Part 3: Build recommender

6. Let's now work on creating a function that will take the name of an article and return the names of n articles most closely related to it. First we need to turn our vectorized corpus into an array of topic probabilities for each document. Which method of our model will return this?

In [13]:
def find_article_idx(df, article_title):
    return df.title[df.title == article_title].index[0]

In [14]:
find_article_idx(df, 'Alchemy')

0

In [15]:
def predict_proba(model, vectorizer, text):
    if type(text) == str:
        text = [text]
    vec_text = vectorizer.transform(text)
    doc_probs = model.transform(vec_text)
    return doc_probs

In [16]:
predict_proba(lda_2_5, tf_vectorizer_2_5, df.text)

array([[1.00027375e-03, 1.00015114e-03, 1.00027679e-03, ...,
        1.00006104e-03, 1.00020069e-03, 1.00008565e-03],
       [1.01025954e-03, 7.98591683e-01, 1.01020428e-03, ...,
        1.01020424e-03, 1.93325920e-01, 1.01040307e-03],
       [7.14363731e-04, 7.14370961e-04, 7.14493125e-04, ...,
        6.29865513e-03, 8.59948227e-01, 7.14362348e-04],
       ...,
       [5.99897115e-01, 4.54630067e-02, 2.38106583e-03, ...,
        2.73231553e-02, 2.38128048e-03, 2.38106076e-03],
       [9.83016919e-01, 1.88688948e-03, 1.88692015e-03, ...,
        1.88702352e-03, 1.88702719e-03, 1.88688332e-03],
       [4.76230503e-03, 4.76202272e-03, 9.57139194e-01, ...,
        4.76205530e-03, 4.76221006e-03, 4.76221696e-03]])

7. Next, given a certain article, we need to compute the distance between this and every other document. sklearn.metrics.pairwise has great functions for cosine distance and euclidean distances here.

In [17]:
from sklearn.metrics.pairwise import cosine_distances, euclidean_distances

8. Use cosine distance to create a vector that contains the distance from our document to every other document. Use argsort to determine the closest top 10.

In [18]:
def sort_by_distance(doc_index, doc_probs, distance_func=cosine_distances, num_documents=10):
    distances = distance_func(doc_probs[doc_index].reshape(1, -1), doc_probs).ravel()
    return distances.argsort()[:num_documents]

In [19]:
doc_index = find_article_idx(df, 'Alchemy')
doc_probs = predict_proba(lda_2_5, tf_vectorizer_2_5, df.text)
sort_by_distance(doc_index, doc_probs)

array([  0, 214, 387, 361,  54,  61, 372, 901,  55,  98], dtype=int64)

9. Now we have an array that contains the indices of all of the most similar articles, we're almost there! Write a function that takes this array and returns the name of the input article as well as its most similar articles.

In [20]:
def find_closest_document_titles(sorted_distances, titles):
    name_array = titles.iloc[sorted_distances]
    return {name_array.iloc[0]: name_array.iloc[1:]}

In [21]:
doc_index = find_article_idx(df, 'Alchemy')
doc_probs = predict_proba(lda_2_5, tf_vectorizer_2_5, df.text)
article_similarities = sort_by_distance(doc_index, doc_probs)
find_closest_document_titles(article_similarities, df.title)

{'Alchemy': 214               Shamanism
 388            Magic square
 361    Magic (supernatural)
 54               Dalai Lama
 61               Divination
 372    Dream interpretation
 935    Pow-wow (folk magic)
 55                    Demon
 98                 Grimoire
 Name: title, dtype: object}

> Checkpoint 2: Congratulations! You've just created a very useful recommender using LDA. This is a practical use-case; websites often use a similar approach to determine the articles for recommended reading that appear below the article text or in sidebars.

### Part 4: Evaluation and make improvements
10. Do your recommendations make sense? Try changing hyperparameters of your count vectorizer and your LDA model to try to improve them!
I had pretty good results using the full dataset and these parameters:
```python
    lda = LatentDirichletAllocation(n_components = 20, learning_offset =50., verbose=1,
                                    doc_topic_prior=0.9, topic_word_prior= 0.9,
                                    n_jobs=-1, learning_method = 'online')
    tf_vectorizer =  CountVectorizer(max_df=0.85, min_df=2, max_features = 1000,
                                    stop_words=stop_words, ngram_range = (1,3))
```

In [22]:
lda = LatentDirichletAllocation(n_components = 20, learning_offset =50., verbose=1,
                                doc_topic_prior=0.9, topic_word_prior= 0.9,
                                n_jobs=-1, learning_method = 'online', random_state=1659)
tf_vectorizer =  CountVectorizer(max_df=0.85, min_df=2, max_features = 1000,
                                stop_words=stop_words, ngram_range = (1,3))

word_vec_4 = tf_vectorizer.fit_transform(df.text)
lda.fit(word_vec_4)
joblib.dump(lda, 'lda_model_4.joblib')
joblib.dump(tf_vectorizer, 'tf_vec_4.joblib')
lda = joblib.load('lda_model_4.joblib')
tf_vectorizer = joblib.load('tf_vec_4.joblib')

iteration: 1 of max_iter: 10
iteration: 2 of max_iter: 10
iteration: 3 of max_iter: 10
iteration: 4 of max_iter: 10
iteration: 5 of max_iter: 10
iteration: 6 of max_iter: 10
iteration: 7 of max_iter: 10
iteration: 8 of max_iter: 10
iteration: 9 of max_iter: 10
iteration: 10 of max_iter: 10


In [23]:
def top_closest_features(model, vectorizer, df, article_title):
    doc_index = find_article_idx(df, article_title)
    doc_probs = predict_proba(model, vectorizer, df.text)
    article_similarities = sort_by_distance(doc_index, doc_probs)
    return find_closest_document_titles(article_similarities, df.title)

In [24]:
top_closest_features(lda, tf_vectorizer, df, 'Alchemy')

{'Alchemy': 408    Fortune-telling
 181             Occult
 70         Eschatology
 4              Animism
 186         Pythagoras
 83       Faith healing
 388       Magic square
 214          Shamanism
 103         Gnosticism
 Name: title, dtype: object}

In [25]:
top_closest_features(lda, tf_vectorizer, df, 'Chupacabra')

{'Chupacabra': 857                   Majestic 12
 824                 Donald Keyhoe
 693                          Vayu
 501                    Buckriders
 277    Unidentified flying object
 772                    Candy corn
 702               Newton, Alabama
 951                   Foo fighter
 33      Earth (classical element)
 Name: title, dtype: object}

In [26]:
top_closest_features(lda, tf_vectorizer, df, 'Ghost')

{'Ghost': 719             Damnation
 235                  Soul
 422            Necromancy
 920                  Omen
 550              Caduceus
 811     Western astrology
 889    Magic and religion
 874               Saṃsāra
 685              Orunmila
 Name: title, dtype: object}

11. Since we don't have traditional error metrics like we would in a supervised learning approach, it's hard to tune these hyperparameters in the same way. We can, however, use log-likelihood as a scoring function for the LDA model. We split our data, train our model, and then determine the likelihood that that our model of the documents could have generated the unseen text. The higher this value, the "better" we have modeled our corpus.
Using sklearn <a href="http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html">GridSearchCV</a> or <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html">RandomizedSearchCV</a> tune the number of topics using cross validation on log-loss(equivalent to negative log-likelihood; log-loss is the default scorer for the sklearn LDA model).

In [27]:
from scipy import stats
from sklearn.model_selection import RandomizedSearchCV, train_test_split

X_train, X_test = train_test_split(word_vec_4, random_state=1659)

params = {'n_components': [2, 5, 10, 50, 75, 100, 200], 
          'doc_topic_prior': stats.uniform(),
          'topic_word_prior': stats.uniform(),
          'learning_offset': stats.uniform(10, 90)}
lda.set_params(**{'verbose': 0, 'n_jobs': -2})
lda_cv = RandomizedSearchCV(lda, params, n_iter=1, n_jobs=-2)

results = {'mean_test_score': [],
'std_test_score': [],
'params': []}

In [28]:
n_iter = 2

for _ in range(n_iter):
    lda_cv.fit(X_train)
    results['mean_test_score'].append(lda_cv.cv_results_['mean_test_score'][0])
    results['std_test_score'].append(lda_cv.cv_results_['std_test_score'][0])
    results['params'].append(lda_cv.cv_results_['params'][0])

In [29]:
import pandas as pd

df_results = pd.DataFrame(results)
df_results.to_csv('lda_tuning.csv', index=False)
df_results.head()

Unnamed: 0,mean_test_score,std_test_score,params
0,-58863.411374,4363.600763,"{'doc_topic_prior': 0.036429455021845025, 'lea..."
1,-76467.16052,4633.829837,"{'doc_topic_prior': 0.7833607791416924, 'learn..."


## Extra credit
### Part 4: Classes

Put this all into a class for easy usage!

see `solution.py`