# Word Cloud Generator

## Pre-processing

Text preprocessing and normalization is crucial before building a proper NLP model. Some of the important steps are:

1. converting words to lower/upper case
2. removing special characters
3. removing stopwords and high/low-frequency words
4. stemming/lemmatization

In [None]:
import numpy as np
import pandas as pd
import os
import nltk
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter
import matplotlib.pyplot as plt
from wordcloud import WordCloud
#from pylab import rcParams
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
#rcParams['figure.figsize'] = 30, 60

%matplotlib inline

In [None]:
os.getcwd()

In [None]:
## Testing data for text, ideally one column with text strings
data = pd.read_excel('Comments.xlsx')
data.shape

In [None]:
data.head(5)

### 1. converting words to lower/upper case:

Let's start by converting all of the words into a consistent case format, say lowercase

In [None]:
data.Comments=data.Comments.astype(str)

In [None]:
## Getting the number of words by splitting them by a space
words = data.Comments.apply(lambda x: len(x.split(" ")))
words.hist(bins = 100)

Validating number of words of the set:

In [None]:
data.Comments.describe()

Initial Word Cloud from step 1:

In [None]:
word_cloud = ''.join(map(str, data.Comments))
print(len(word_cloud))
wordcloud = WordCloud(max_font_size=200, background_color="white",\
                          scale = 5,width=700, height=400).generate(word_cloud)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

In [None]:
data['text_review'] = data.Comments.apply(lambda x: x.lower())

In [None]:
from nltk import word_tokenize

token = [word_tokenize(each) for each in data.Comments]
tokens = [item for sublist in token for item in sublist]
print("Number of unique tokens then: ",len(set(tokens)))

token_lists_lower = [word_tokenize(each) for each in data.text_review]
tokens_lower = [item for sublist in token_lists_lower for item in sublist]
print("Number of unique tokens now: ",len(set(tokens_lower)))

The number of tokens has gone down just from normalizing the case.

### 2. removing special characters:

For the sake of simplicity, we will proceed by removing all of the special characters; however, it pays to keep in mind that this is something to revisit depending on the results we get later. The following gives a list of all the special characters in our dataset

In [None]:
### Selecting non alpha numeric charactes that are not spaces
spl_chars = data.text_review.apply(lambda x: [each for each in list(x) if not each.isalnum() and each != ' '])

## Getting list of list into a single list
flat_list = [item for sublist in spl_chars for item in sublist]

## Unique special characters
set(flat_list)

In [None]:
import re
review_backup = data.text_review.copy()
data.text_review = data.text_review.apply(lambda x: re.sub('[^A-Za-z0-9 ]+', ' ', x))

We can see how our reviews change after removing these:

In [None]:
print("Old Review:")
review_backup.values[0]

In [None]:
print("New Review:")
data.text_review[0]

The number of unique tokens has dropped further:

In [None]:
token_lists = [word_tokenize(each) for each in data.Comments]
tokens = [item for sublist in token_lists for item in sublist]
print("Number of unique tokens then: ",len(set(tokens)))

token_lists = [word_tokenize(each) for each in data.text_review]
tokens = [item for sublist in token_lists for item in sublist]
print("Number of unique tokens now: ",len(set(tokens)))

Word Cloud for step 2 excluding special characters:

In [None]:
word_cloud_review = ''.join(map(str, data.text_review))
print(len(word_cloud_review))
wordcloud = WordCloud(max_font_size=500, background_color="white",\
                          scale = 6,width=1600, height=800).generate(word_cloud_review)
plt.figure(figsize=(20,10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
plt.savefig('wordcloud.png', bbox_inches='tight')

### 3. removing stopwords and high/low-frequency words (ENGLISH):

Stopwords naturally occur very frequently in the English language without adding any context specific insights. It makes sense to remove them:

In [None]:
noise_words = []
stopwords_corpus = nltk.corpus.stopwords
eng_stop_words = stopwords_corpus.words('english')
noise_words.extend(eng_stop_words)
noise_words

### 4. stemming/lemmatization:

Stemming algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found.
On the other hand, lemmatization takes into consideration the morphological analysis of the words. So lemmatization takes into account the grammar of the word and tries to find the root word instead of just getting to the root word by brute force methods.

Now we are ready for the last part of our pre-processing - **stemming & lemmatization**.

Different forms of a word often communicate essentially the same meaning. For example, there’s probably no difference in intent between a search for `shoe` and a search for `shoes`. The same word may also appear in different tenses; e.g. "run", "ran", and "running". These syntactic differences between word forms are called **inflections**. In general, we probably want to treat inflections identically when featurizing our text.

Sometimes this process is nearly-reversible and quite safe (e.g. replacing verbs with their infinitive, so that "run", "runs", and "running" all become "run"). Other times it is a bit dangerous and context-dependant (e.g. replacing superlatives with their base form, so that "good", "better", and "best" all become "good"). The more aggressive you are, the greater the potential rewards and risks. For a very aggressive example, you might choose to replace "Zeus" and "Jupiter" with "Zeus" only; this might be OK if you are summarizing myths, confusing if you are working on astronomy, and disastrous if you are working on comparative mythology.

In [None]:
###  Creating a method for stemming
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
analyzer = CountVectorizer().build_analyzer()
def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))

Let's use this to create a bag of words from the reviews, excluding the noise words we identified earlier:

In [None]:
### Creating a python object of the class CountVectorizer
bow_counts = CountVectorizer(tokenizer= word_tokenize, stop_words=noise_words,
                             ngram_range=(1, 4))
bow_data = bow_counts.fit_transform(data.text_review)

In [None]:
final=bow_counts.get_feature_names()

In [None]:
word_cloud_review_F = ''.join(map(str, final))
print(len(word_cloud_review_F))
wordcloud_F = WordCloud(max_font_size=500, background_color="white",\
                          scale = 6,width=1600, height=800).generate(word_cloud_review_F)
plt.figure(figsize=(20,10))
plt.imshow(wordcloud_F, interpolation="bilinear")
plt.axis("off")
plt.show()
plt.savefig('wordcloudF.png', bbox_inches='tight')

In [None]:
print(bow_counts.get_stop_words())

Let's re-featurize our original set of reviews based on TF-IDF and split the resulting features into train and test sets:

In [None]:
### Creating a python object of the class CountVectorizer
### Changes: Removing stop words and including 1-4 grams in the tf-idf data

tfidf_counts = TfidfVectorizer(tokenizer= word_tokenize,
                             ngram_range=(1,4))
tfidf_data = tfidf_counts.fit_transform(data.text_review)

In [None]:
final=tfidf_counts.get_feature_names()

In [None]:
word_cloud_review_F = ''.join(map(str, final))
print(len(word_cloud_review_F))
wordcloud_F = WordCloud(max_font_size=500, background_color="white",\
                          scale = 6,width=1600, height=800).generate(word_cloud_review_F)
plt.figure(figsize=(20,10))
plt.imshow(wordcloud_F, interpolation="bilinear")
plt.axis("off")
plt.show()
plt.savefig('wordcloudF.png', bbox_inches='tight')