# NLP-2

---

**text preprocessing, document similarity, and a project**

Our first foray into Natural Language Processing didn't go very deep, so now it's time to dig in a little. We're returning to the jobs data - I'll be using the pickle file in the data folder, so if something I'm writing doesn't make sense, it's probably because we're using different data.

For NLP projects, you should expect to spend around 80% of your time on data prep. Why so much? Text data can be extremely messy, and if you haven't worked with a particular dataset before then you'll probably have to circle back to the preprocessing stage multiple times. It's definitely worth it though, because extracting insights from text data can have significant value for a business; whether you're analyzing the sentiment of tweets linked to followers of your business, using topic modeling on call transcripts, or detecting fradulent transactions.

If you find that this notebook piques your interest, then I'd recommend that you explore the docs of [sklearn](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html), [spacy](https://spacy.io/), [nltk](https://www.nltk.org/), and [gensim](https://radimrehurek.com/gensim/). There's are a ton of topics to cover (from dimensionality reduction to cosine similarity), each one of them will take some time to sink in, and learning how and when to use them all is truly an art. I can only go so far into the bits and pieces that are covered below.

At the end of this one there's a project. It's on you to do it or not - obviously there aren't any grades to go around - but if you do choose to tackle it, I'd be happy to give guidance or feedback.

**Contents:**
1. text preprocessing: 
  * tokenization, removal, stemming
  * tfidf
2. text analysis:
  * document similarity
  * word importances
  * word clouds
3. project idea

---

### Text Preprocessing

Always start off by reading through some of your documents. If you don't know the context of your text, then your analysis will suffer, your models won't make sense, blah blah blah

In [None]:
# standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# data storage
import pickle

In [None]:
# read in the raw data
# you can move the file to this folder, or rename 'jobs.pkl' to its full path
# something like "~/Documents/Cult-Terry/data/jobs.pkl"

raw_jobs_data = pickle.load(open('jobs.pkl', 'rb'))

example = raw_jobs_data[0][1]  # [first job][job description]
print(example)

In [None]:
# each word, symbol, number is its own entity (token)
# this step can be modified to exclude words of certain length,
# non alphanumeric symbols, etc

from nltk.tokenize import word_tokenize

word_tokenize(example)[:10]

In [None]:
# parts of speech tagging is a difficult task. It's case sensitive, so
# try opening up more words and you'll see that most capitalized words
# are labeled as propper nouns (NNP)

from nltk.tag import pos_tag

tokens = pos_tag(word_tokenize(example))
tokens[:10]

In [None]:
import string

# removing propper nouns and punctuation
clean_tokens = [t for t in tokens if t[1] != 'NNP' and t[0] not in string.punctuation]
clean_text = ' '.join([t[0] for t in clean_tokens])  # stitching the text back together
clean_text

In [None]:
# a bunch of non-alphanumeric characters made it through
import re

# this regex statement will get rid of all non-letters, and make our text lower-case
cleaner_text = re.sub('[^a-zA-Z]', ' ', clean_text).lower()
cleaner_text

In [None]:
# we already saw this in NLP-1, CountVectorizer gives counts of all the words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()  # initalize the class
X = cv.fit_transform([cleaner_text])  # cv reads iterables, so pass the string as a list object
pd.DataFrame(X.toarray(), columns=cv.get_feature_names())

In [None]:
# stemming is the process of using an algorithm for suffix stripping
# ie. cats -> cat  running -> run  industrialized -> industrial
# there are various algorithms out there, but nltk's PorterStemmer
# is derived from Martin Porter's algorithm that he proposed in 1980
# (and then continued to update for decades)

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

stemmed = [stemmer.stem(word) for word in cleaner_text.split(' ')]
stemmed[:10]

In [None]:
# stemming is a brute-force approach to dimensionality reduction, or
# normalization, of text data. We can also drop stop-words for further
# d-reduction

stemmed_text = ' '.join(stemmed)

cv = CountVectorizer(stop_words='english')  # stop_words='english' is the default
X = cv.fit_transform([stemmed_text])
pd.DataFrame(X.toarray(), columns=cv.get_feature_names())

In [None]:
# let's break to a toy dataset to gently segue into tfidf

# say we have the following 5 sentences

sent_1 = 'cold weather events are bad for roads'
sent_2 = 'she thinks that cold sandwhiches are best'
sent_3 = 'he thinks hot sandwhiches with mustard taste better'
sent_4 = 'it was a cold and dreary morning'
sent_5 = 'we need freezers that keep ice cold'
all_sentences = [sent_1, sent_2, sent_3, sent_4, sent_5]

# with simple counts our data looks like this
cv = CountVectorizer()
X = cv.fit_transform(all_sentences)
pd.DataFrame(X.toarray(), columns=cv.get_feature_names())

In [None]:
# to compare our documents we can use cosine similarity.
# cosine similarity measures the angle between two vectors, if they point
# in exactly the same direction (have all the same words) we get a 1
# ...hit command+tab on cosine_similarity(X) below to read more

from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(X)

In [None]:
# the closest match to 'she thinks that cold sandwhiches are best' is
# cold weather events are bad for roads' and 'we need freezers that keep ice cold'
# even though contextually we know that 'she preffers hot sandwhiches that have mustard'
# is closer.
# we need to weight the words so that 'sandwhiches' is more important than 'cold'

# tf-idf: term frequency inverse document frequency, is just the tool we need
# cv gave us the term frequency, so the new part is the -idf
# you can find the math here: https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
# but essentially we're going to weight words by their appearance across all sentences
# and use that as a divisor to the term frequency

from sklearn.feature_extraction.text import TfidfVectorizer

# corrected data
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(all_sentences)
pd.DataFrame(X.toarray(), columns=tfidf.get_feature_names()).round(2)  # compare scores of 'cold' and 'sandwhiches'

In [None]:
# and now we can see that the sandwhich sentences are closest

cosine_similarity(X)

In [None]:
# back to our original data, using tfidf

tfidf = TfidfVectorizer(stop_words='english')

X = tfidf.fit_transform([stemmed_text])
pd.DataFrame(X.toarray(), columns=tfidf.get_feature_names())

In [None]:
# to gather more context we can include bigrams, or groups of two words
# ex: 'the dog ran' -> ['the dog', 'dog ran']

tfidf = TfidfVectorizer(stop_words='english',
                        ngram_range=(1, 2))

X = tfidf.fit_transform([stemmed_text])
pd.DataFrame(X.toarray(), columns=tfidf.get_feature_names())

# this more than doubles the number of features that we have, but if we
# exclude words and bigrams that appear in almost every document or in
# very few, then we can reduce the noise

In [None]:
# time to clean all of our documents, we can copy paste all of our
# previous code and throw it into a function. We can even save this
# off in a .py script for other projects, modifying parts if we need

def text_cleaner(raw_docs):
    '''
    general text cleaning for a list of documents
    '''
    from nltk.tokenize import word_tokenize
    from nltk.tag import pos_tag
    import string
    import re
    from nltk.stem import PorterStemmer
    
    clean_docs = []
    
    for doc in raw_docs:
        
        tokens = pos_tag(word_tokenize(doc))

        clean_tokens = [t for t in tokens if t[1] != 'NNP' and t[0] not in string.punctuation]
        clean_text = ' '.join([t[0] for t in clean_tokens])

        cleaner_text = re.sub('[^a-zA-Z]', ' ', clean_text).lower()

        stemmer = PorterStemmer()
        clean_docs.append(' '.join([stemmer.stem(word) for word in cleaner_text.split(' ')]))

    return clean_docs

In [None]:
# loop over all of the documents
clean_docs = text_cleaner([doc[1] for key, doc in raw_jobs_data.items()])

# vectorize
tfidf = TfidfVectorizer(stop_words='english',
                        max_df=0.8,  # remove features that appear in > 80% of documents
                        min_df=0.1,  # and features that appear in < 10%
                        ngram_range=(1, 3))  # uni, bi, and trigrams

X = tfidf.fit_transform(clean_docs)
tfidf_df = pd.DataFrame(X.toarray(), columns=tfidf.get_feature_names())
tfidf_df

---

### Text Analysis

In [None]:
# now we can grab cosine similarity to find which jobs are similar

similarities = pd.DataFrame(cosine_similarity(tfidf_df))
similarities

In [None]:
# if we go through each row and find the top two numbers we'll get
# the document we're looking at (because it's closest to itself) and
# the most similar document to it

top_pairs = []

for i in similarities.index:  # iterate over the indices
    top = similarities.nlargest(2, i)[i].drop(i)  # reffer to self, and drop self to keep closest match
    top_pairs.append((i, top.index[0], top.values[0]))  # store self, index of match, and cosine
    
sorted(top_pairs, key=lambda x: x[2], reverse=True)[:5]  # sort by cosine and show the top 5

In [None]:
# hmm... something is fishy

len(similarities.drop_duplicates())

In [None]:
# some jobs were posted multiple times, so drop the duplicates
similarities = similarities.drop_duplicates()


# and try again
top_pairs = []

for i in similarities.index:
    top = similarities.nlargest(2, i)[i].drop(i)
    top_pairs.append((i, top.index[0], top.values[0]))
    
sorted(top_pairs, key=lambda x: x[2], reverse=True)[:5]

In [None]:
# the top two seem remarkably close, check that they aren't duplicates

raw_jobs_data[67][0], raw_jobs_data[188][0]

In [None]:
# time for a visualization of our words - word clouds!

from wordcloud import WordCloud

# copy pasted from notebook 05

def make_cloud(freq_dict):
    '''make a wordcloud from a word frequency dictionary'''
    
    wordcloud = WordCloud()
    wordcloud.generate_from_frequencies(frequencies=freq_dict)

    plt.figure(figsize=(12, 10))
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.show()

In [None]:
# word importances can be pulled from tfidf directly by taking the idf score
# and matching that to the word or bigram

word_importances = list(zip(tfidf.get_feature_names(), tfidf.idf_))

word_importances = sorted(word_importances, key=lambda x: x[1], reverse=True)

word_importances[:10]  # the top 10 features

In [None]:
# some of these words are helpful, and some are just from the EOO statements 

make_cloud(dict(word_importances))

In [None]:
# now compare with counts (and similar parameters)

cv = CountVectorizer(ngram_range=(1, 3),
                     max_df=0.8,
                     min_df=0.1,
                     stop_words='english')
vocab = cv.fit_transform(clean_docs)
wordcount_df = pd.DataFrame(data=vocab.toarray(), columns=cv.get_feature_names())

sorted_words = wordcount_df.sum().sort_values(ascending=False)

make_cloud(sorted_words.to_dict())

---

### Project Idea

The best way to learn coding, or anything, is practice. So, here's a project idea for those of you who want to put it on your resume, or want to improve your skills, or are just bored. It will require knowledge from the previous notebooks.

**Project:**

What insight can you derive from job descriptions using NLP, and supervised or unsupervised learning methods? The result can be clusters of job descriptions, a classification model, etc.

**Data:**

* whatever you already have
* more jobs scraped from the same or a different source

**Tools:**

* selenium
* pandas/numpy
* matplotlib/seaborn/wordcloud
* sklearn/nltk
* any others you need

**Methods:**

* data scraping
* data visualization
* text data preprocessing
* supervised/unsupervised modeling of text