# NLP Workshop Part-1
### Text as Data

---

Text is an extremely rich source for data. Tapping into it can yield great insights, but creating a pipeline from raw data to metrics and visualizations often feels like it requires more grit and luck than anything else.

In this first workshop we'll take a high-level look at some of the fundamentals of Natural Language Processing, or NLP, from cleaning data to extracting insights.

---

**Contents:**
* Cleaning/Preprocessing Text (80 - 95%)
    - tokenization, removal (punctuation, stopwords), stemming/lemmatization
* Analyzing Text (5 - 20%)
    - word clouds, tfidf, clustering

---

# Phase 1: initial approach

In [None]:
# Example install, un-comment the line below to run

# !pip install wordcloud

In [None]:
# standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# data storage
import pickle

# elemental libraries for text data
import string
import re
from collections import Counter

# fancy NLP libraries
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.util import ngrams
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# clustering imports
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

# visuals
from wordcloud import WordCloud

In [None]:
# this is copy paste from a google search of making a wordcloud

def make_cloud(freq_dict):
    '''make a wordcloud from a word frequency dictionary'''
    
    wordcloud = WordCloud()
    wordcloud.generate_from_frequencies(frequencies=freq_dict)

    plt.figure(figsize=(12, 10))
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.show()

In [None]:
# reading in pickled data

job_text = pickle.load(open('job_descriptions.pkl', 'rb'))
len(job_text)

In [None]:
# first job description

example_text = job_text[0]

print(example_text)

# Tokenization

Tokens are the elemental - or indivisible - pieces of a text document. We can break text into words, n-grams, and sentences.

In [None]:
# tokens generated from single words

tokens = word_tokenize(example_text)
tokens[:10]

In [None]:
# using Counter to count words

Counter(tokens).most_common(10)

In [None]:
# take a look

make_cloud(Counter(tokens))

In [None]:
# using "tokens" we can call ngrams to get chunks of words as tokens

list(ngrams(tokens[:20], 2))

In [None]:
# combining the words within each token, and calling wordcloud

bigrams = [' '.join(gram) for gram in ngrams(tokens, 2)]

make_cloud(Counter(bigrams))

In [None]:
# tokenizing sentences is tricky

sent_tokenize(example_text)[:3]

---
# Phase 2: Cleaning Up

Know how to clean your data.

# Cleaning Up - identifying and removing stuff

Something something, data science is art. This step will always be subject to your data, your business problem, and politics. 

"Knowing what you can and can't do makes you a great analyst, knowing what you should and shouldn't do makes you the manager of analysts."

-Dr. Hugh Watson, probably

In [None]:
# check every character in our document

unique_characters = {i for i in example_text}
unique_characters

In [None]:
# characters in string.punctuation

string.punctuation

In [None]:
# removing everything but letters
# also, moving to lower-case only

letters = {i.lower() for i in unique_characters if i not in string.punctuation + string.digits}
letters

In [None]:
# grabbing all the non-letters

really_only_letters = {i for i in letters if i not in ['\n', '’', '“', '”']}
really_only_letters  # and spaces, because a blob of text without spaces would be even worse

In [None]:
# bringing it all together, let's make our text into only text

non_char = string.punctuation + string.digits + '\n' + '’' + '“' + '”'

cleaner_text = re.sub('[%s]' % re.escape(non_char), ' ', example_text.lower())
cleaner_text

# Cleaning Up - stop words

Stop words: common words like "and, this, that" which have little to no value for semantic analysis.

Removing stop words removes noise, enhancing relevant information

"The dog ran really fast." -> "dog ran fast"

In [None]:
# some stop words

sorted(stopwords.words('english'))[:10]

In [None]:
# simply remove the stopwords
# note: using split() and join together also removes extra spaces

even_cleaner_text = ' '.join([i for i in cleaner_text.split() 
                              if i not in stopwords.words('english')])
even_cleaner_text

# Cleaning Up - stemming and lemmatization

Stemming: reducing words to base form

Lemmatization: stemming + crazy linguistics stuff

Essentially, the goal here is to make our lexicon significantly smaller - i.e. reduce complexity.

In [None]:
# quick example

stemmer = PorterStemmer()  # stemming class
lemmer = WordNetLemmatizer()  # lemmatization class

same_things = ['ran', 'run', 'runs', 'running', 'runner', 'runners']

for i in same_things:
    print('Stem - {}  ;  Lemma - {}'.format(stemmer.stem(i), lemmer.lemmatize(i)))

In [None]:
# squeaky clean

cleanest_text = ' '.join([lemmer.lemmatize(i) for i in even_cleaner_text.split()])
cleanest_text

In [None]:
# time for another cloud

tokens = word_tokenize(cleanest_text)
bigrams = [' '.join(gram) for gram in ngrams(tokens, 2)]

make_cloud(Counter(tokens))

In [None]:
make_cloud(Counter(bigrams))

In [None]:
# compiling all of the prior stuff into a function

def clean(text):
    '''
    given some text, clean it
    import string
    import re
    from nltk.corpus import stopwords
    from nltk import WordNetLemmatizer
    '''
    # remove stuff
    non_char = string.punctuation + string.digits + '\n' + '’' + '“' + '”'
    cleaner_text = re.sub('[%s]' % re.escape(non_char), ' ', text.lower())
    # import lemma
    lemma = WordNetLemmatizer()
    # combine lemma and stopword removal
    clean_text = ' '.join([lemma.lemmatize(i) for i in cleaner_text.split() 
                           if i not in stopwords.words('english')])
    
    return clean_text


# clean all text

clean_jobs = [clean(i) for i in job_text]

---
# Phase 3: Analysis

# Analysis - CountVectorizer

It's easy enough to apply the above to all of our text with a function, but we need some way to manipulate it all as data. Counter is sufficient for one document, CountVectorizer is sufficient for many.

In [None]:
# spin up CountVectorizer and test it out on the cleanest text

cv = CountVectorizer()

X = cv.fit_transform([cleanest_text])
pd.DataFrame(X.toarray(), columns=cv.get_feature_names())

In [None]:
# CV has ngrams builtin

cv = CountVectorizer(ngram_range=(2, 2))

X = cv.fit_transform([cleanest_text])
pd.DataFrame(X.toarray(), columns=cv.get_feature_names())

In [None]:
# now run it on everything

cv = CountVectorizer(min_df=0.1, max_df=0.9)

X = cv.fit_transform(clean_jobs)
counts = pd.DataFrame(X.toarray(), columns=cv.get_feature_names())
counts

In [None]:
make_cloud(counts.sum(axis=0))

In [None]:
cv = CountVectorizer(ngram_range=(2, 2), min_df=0.1, max_df=0.9)

X = cv.fit_transform(clean_jobs)
bigram_counts = pd.DataFrame(X.toarray(), columns=cv.get_feature_names())

make_cloud(bigram_counts.sum(axis=0))

# Analysis - TFIDF

Counts are a good source of information, but Term Frequency Inverse Document Frequency gives a weight representation of word importances.

In [None]:
# straight to the df

tf = TfidfVectorizer(ngram_range=(1, 2), min_df=0.1, max_df=0.9)

X = tf.fit_transform(clean_jobs)
scores = pd.DataFrame(X.toarray(), columns=tf.get_feature_names())
scores

In [None]:
# reduce feature-space and cluster

pca = PCA(n_components=4)
pca.fit(scores)
pca_scores = pca.transform(scores)

k = 3
km = KMeans(n_clusters=k, random_state=42)
km.fit(pca_scores)

In [None]:
# cluster labels

km.labels_

In [None]:
# clouds of clusters

scores['cluster_label'] = km.labels_

for i in range(k):
    cluster = scores[scores['cluster_label'] == i].iloc[:,:-1]
    make_cloud(cluster.sum(axis=0))