# This notebook experiments with building topic models for the reviews - can we find some useful topics, assign reviews to these topics and use those to classify the reviews somehow?

# For example, does some topic discuss specific types of wine?

Using Gensim for topic modelling, NLTK for some basic features. Something like Spacy and Mallet with Gensim would also be interesting to investigate, but not necessary for this exercise.

In [None]:
import pandas as pd
from gensim import corpora
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk

There are several datasets in this dataset, main variants being a 150k review set and a 130k review set. Since the 130k set was described (or so I understood) as having duplicates removed, and additional information added, I used that.

Running some of these analysis, I got some best matching documents, which actually had all duplicate information. So had to drop duplicates as well in any case. However, the added information about the reviewer turned out to be very useful, as we will see in the end of this notebook.

In [None]:
df = pd.read_csv('../input/winemag-data-130k-v2.csv')
df.shape

In [None]:
df = df.drop_duplicates('description') 
df.shape

So that dropped about 10k rows with duplicate descriptions. 
The reviews are in this "description" column, so pick that up.

In [None]:
 descriptions = df['description']


Collect stopwords for removal. This is the NLTK english stopwords, Python listed punctuation characters, and a set of custom token I found were still floating around in the results after all the NLTK stopwords and punctuations were removed. The last ones seem to be possibly some artefacts of how the stemming (or lemmatization/word tokenization) is done.

In [None]:
from string import punctuation
stop_words = set(stopwords.words('english')) 
stop_words = stop_words.union(set(punctuation)) 
stop_words.update(["\'s", "n't"])


To better analyze the remaining words, I lemmatize them. Stemming is another option, but I prefer to lemmatize it, so I can actually look at the resulting topics/features and understand what they are. In some languages, the lemma might also have less overlap. Of course, POS tagging could also be useful but let's see how this works for now.

In [None]:
lemmatizer = WordNetLemmatizer()
texts = [[lemmatizer.lemmatize(word) for word in word_tokenize(description.lower()) if word not in stop_words] for description in descriptions]


In [None]:
print(texts[4])

Here I set up bi-gram and tri-gram identification for gensim. Most approaches I found seem to just want to make everything a bi-gram (so bi-gram for every word pair there is, or maybe I just misunderstand). I don't see that as useful, however, so this takes the most often co-occurring ones only.

For more discussion and formula on the threshold value: [StackOverflow](https://stackoverflow.com/questions/35716121/how-to-extract-phrases-from-corpus-using-gensim), [Radim](https://radimrehurek.com/gensim/models/phrases.html)

Seems a bit complicated, but 100 was giving me good results so I used that for threshold.

In [None]:
import gensim
#https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
# Build the bigram and trigram models
bigram = gensim.models.Phrases(texts, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[texts], threshold=100)
# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram) 
trigram_mod = gensim.models.phrases.Phraser(trigram)


So that trained a bi-gram and tri-gram analyzer/generator for me. Now to see how it handles the description at index 4 as I printed out above:

In [None]:
 print(trigram_mod[bigram_mod[texts[4]]])

So it has identified "winter stew" in that text to be one word (a bi-gram). Don't really know what that is, but Google does give lots dishes for it, so seems good.

Now lets replace all the identified bi-gram and tri-gram word-pairs and triplets in the text with the bi-gram and tri-gram representations (i.e., make them single features for algorithms).

In [None]:
texts = [trigram_mod[bigram_mod[text]] for text in texts]

In [None]:
 #id to word mapping for gensim
id2word = corpora.Dictionary(texts)

Initial goal for me was to build a binary classifier. That would make this a two-topic affair. Let's see how this goes.

In [None]:
from gensim.models import LdaModel

corpus = [id2word.doc2bow(text) for text in texts] 
test_lda = LdaModel(corpus,num_topics=2, id2word=id2word) 
sentence = 'i like red wine with steak'
sentence2 = [word for word in sentence.lower().split()] 
test_lda[id2word.doc2bow(sentence2)]

So the above trained the Gensim LDA model for two topics, and used it to classify a given sentence in regards to those two topics. In a real system, the classified evaluated sentence (variable "sentence" above) should be stop-word removed and lemmatized but in this case the words are pretty much there already (lemma form). Good enough for this experiment.

The result shows the sentence ranked as 11.8% in topic 1 and 88.2% in topic 2. And what are those topics?

In [None]:
 test_lda.print_topics(num_words=20)

Above are the 20 top words for each of these two topics. Typically I try to look at these words to figure out what the topics might be about. These two topics do not seem very cohesive, so cannot say. Maybe a wine specialist could see something there.

Gensim has a measure called topic coherence, which gives a measure of how good it thinks the topics are. So let's try that.

To try this with different topic counts, I borrowed some code from the internets (as usual), maybe it was here: [Datascienceplus](https://datascienceplus.com/evaluation-of-topic-modeling-topic-coherence/)

Wherever it was, thanks! :)


In [None]:
from gensim.models import CoherenceModel
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3): 
    """
    Compute c_v coherence for various number of topics
    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics
    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respect """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = LdaModel(corpus=corpus, num_topics=num_topics, id2word=id2word) 
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())
    return model_list, coherence_values

In [None]:
# Can take a long time to run.
model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=texts, limit=40, start=2, step=6)

In [None]:
coherence_values

In [None]:
import matplotlib.pyplot as plt 
%matplotlib inline
# Show graph
limit=40; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values) 
plt.xlabel("Num Topics") 
plt.ylabel("Coherence score") 
plt.legend(("coherence_values"), loc='best') 
plt.show()

This indicates that the 2 topics I started with was a pretty poor choice. So no binary classification with this it seems. Around 10-15 this seems to spike, so with a quick eyeballing of the chart, I will try with 14 topics. I actually ran this with some slightly different configurations and in all cases the beginning of the chart was about the same, and the rest of it went down after about 20 topics. So 14 it is for now.

First pick the topic model:

In [None]:
print(coherence_values)

In [None]:
print(coherence_values[2]) 
test_lda = model_list[2]

Top words for this model:

In [None]:
 test_lda.print_topics(num_words=20)

There are some more intuitive ones here. For example, one seems to be around wine age, vintage wines, and so on. Another seems to be quite a lot about different wine types. The topic numbers seem to possibly change a bit across kernel runs, so cannot put a number here as I cannot predict what will be the topic number in the final Kaggle commit/run, after which this is uneditable. Probably there is some configuration of randomness I should set to exactly deterministic results. Since the topics generally seem the same across runs (just minor differences), I did not bother this time. Sorry. 

In any case, for the topics and their sensibility with regards to wines, someone with more wine knowledge would likely be able to say something deeper about those. And how to iterate from these with more stopwords, etc.

To investigate a bit deeper myself, I take the top documents for each of these topics. This means the documents that the LDA model classifies to most heavily belonging to that specific topic:

In [None]:
import heapq 

top_docs = {} 
n_topics = 14
#first create placeholder lists for top 3 docs in each topic 
for t in range(0, n_topics):
    doc_list = [(-1,-1),(-1,-1),(-1,-1)] 
    heapq.heapify(doc_list)
    top_docs[t] = doc_list
#count variable in following is practically doc_id since the index is from 0 with increments of 1
count = 0
for doc in corpus:
    if count % 10000 == 0:
        #this is just to see it progresses, as it sometimes seems slow 
        print(count)
    topics = test_lda[doc] 
    for topic_prob in topics:
        topic_n = topic_prob[0]
        topic_p = topic_prob[1]
        top_list = top_docs[topic_n]
        #count is document id, heapq sorts by first item in tuple
        heapq.heappushpop(top_list, (topic_p, count))
        #above pushes new item, pops lowest item. so pop itself if lowest..
    count += 1

In [None]:
print(top_docs)

For example, topic 3 (or this was the number in my run..) has 3 documents that are classified as 97% in that topic. So by looking at those documents, maybe we can get an idea of what the topic is about?

In [None]:
#topic 3:
doc_ids = [23682, 35855, 79546] 
temp_df = df.iloc[doc_ids, :] 
temp_df.head()


That is one reviewer, who is trying a lot of Italian wines in this topic (at least when I printed this in my run..). Lets take a bit clearer look at the top docs for all topics:

In [None]:
top_sorted = {}
for topic_id in top_docs:
    heap = top_docs[topic_id]
    sorted_topics = [heapq.heappop(heap) for _ in range(len(heap))] 
    print(str(topic_id)+": "+str(sorted_topics)) 
    top_sorted[topic_id] = sorted_topics


Topic 2 was also all just under 97% when I ran this:

In [None]:
#topic 2:
doc_ids = [58978, 1195, 82788] 
temp_df = df.iloc[doc_ids, :] 
temp_df.head()


Again one reviewer, but this time with some Argentinian and one Spanish wine. So lets just look at all the topics then:

(This Kaggle environment seems to cut the size of the results box, and I cannot find how to expand it, so scroll in the below results  results to see all the topics)

In [None]:
from IPython.display import display

for topic_id in top_sorted:
    print("Topic:"+str(topic_id))
    top_docs = top_sorted[topic_id]
    doc_ids = [doc_tuple[1] for doc_tuple in top_sorted[topic_id]] 
    doc_weights = [doc_tuple[0] for doc_tuple in top_sorted[topic_id]] 
    temp_df = df.iloc[doc_ids, :]
    temp_df["topic_weight"] = doc_weights 
    display(temp_df.head())



So here is the part why I said in the beginning it is nice to have the reviewer information available. The topics seem to actually map to a reviewer. Especially for the ones where the document is highly ranked as part of a specific topic.

I was hoping to find topics revealing something interesting about the wines, what best describes some types of wine, and so on. This seems to have identified something about the reviewers writing style, or perhaps there would be something deeper in there to be found..?