# NLP-1

---

**word clouds, intro to NLP and sklearn**

Way way way back, pre MSBA bootcamp we were supposed to use some word-cloud generating software. Well now we're going to do that with python, and at the end we'll have the ability to create our own word clouds, and even manipulate text data.

**Contents:**
1. Words as Data
2. Word Clouds

---

### Words as Data:

In the data scraping notebook, we grabbed a bunch of job descriptions from Monster and saved them in a pickle file - if you skipped this one, use the file "jobs.pkl" in the data folder. 

Let's check out some of the text, and see how we can potentially turn it into something useful.

In [None]:
# standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# data storage
import pickle

In [None]:
# read in the raw data
# if you need the file in data, change the 'jobs.pkl' path
# it'll look something like "~/Documents/uga/Cult-Terry/data/jobs.pkl"

raw_jobs_data = pickle.load(open('jobs.pkl', 'rb'))

raw_jobs_data[0][0] # [first job][title]

In [None]:
# '\n' is the symbol for a line break
# use the builtin function to split the string on this character

first_job = raw_jobs_data[0]
description = first_job[0]

description

In [None]:
# you should see the first chunk as "job title at company"
# ideally everything is formatted this way

# using a list comprehension to view all job titles

all_titles = [raw_jobs_data[job][0].split('\n')[0]
              for job in raw_jobs_data.keys()]

all_titles[:5], all_titles[-5:]

In [None]:
# looks like the data I got isn't entirely "data scientist" roles
# since that's the keyword I'm looking for, I'll optimize my results

data_scientist_titles = [job for job in all_titles
                         if 'data scientist' in job]
len(data_scientist_titles)

In [None]:
# huh, no jobs

# let's check some things

print('data scientist' in 'Senior Data Scientist at')
print('Data Scientist' in 'Senior Data Scientist at')
print('Data Scientist' in 'DATA SCIENTIST')

In [None]:
# this shouldn't be surprising, text is case sensitive after all
# all we have to do is standardize before we operate

print('data scientist' in 'Senior Data Scientist at'.lower())

In [None]:
# making everything lower case
# take the index of each job, so we can reference that to the job descriptions

data_scientist_titles = [(job.lower(), idx) for idx, job in enumerate(all_titles)
                         if 'data scientist' in job.lower()]
len(data_scientist_titles)

In [None]:
# we've cut down our data substantially, but seeing as everything removed
# was unrelated to our outcome, that means the data is substantially cleaner

# moving on to the job descriptions
# we can use the indices attached to each title to grab descriptions

last_index = data_scientist_titles[-1][1]

last_description = raw_jobs_data[last_index][1]

last_description.split('\n')

In [None]:
# let's work towards this goal: find the most frequent words for data scientist jobs

# our first step is to "clean" last_description by turning it into a list of words

# clean = remove punctuation, numbers, and white space, lower-case everything 

import string

print(string.punctuation)
print(string.digits)

clean_description = last_description.lower()
for i in string.punctuation + string.digits:  # full list of digits and punctuation
    clean_description = clean_description.replace(i, ' ')
    
clean_description

In [None]:
# still need to remove '\n' and whatever ’ is, just copy pase into a replace

clean_description = clean_description.replace('\n', ' ').replace('’', ' ')

# now split on whitespace

clean_description = clean_description.split(' ')

clean_description[:10]

In [None]:
# every word that isn't blank

clean_description = [word for word in clean_description if len(word) > 0]

clean_description[:10]

In [None]:
# Counter, pretty self explanatory
# most_common gives the most common items

from collections import Counter

Counter(clean_description).most_common(10)

---

**a few points to make here**

1. As you can see, words like "and," "of," "to" are very common but don't give us any information at all. Noise words like this are called "stop words."
2. Additionally, looking at just single words doesn't look like it's giving us much to go off of. Bi-grams and tri-grams (series of two and three words) would surely give use more information.
3. Some non-stop words like "data" are probably not going to be that helpful.

**sklearn**

sci-kit learn can help us handle all of these concerns. This marks a pretty big point for us analytically, sklearn permeates almost every angle of data analysis/science where python is concerned (unless you want to build everything from scratch).

If you ever struggle with it, **read the docs**. I can't emphasize how well most of the documentation is done for sklearn. Most of the examples are solid, most come with visualizations, and almost all of them are accessible with very little background knowledge.

In [None]:
# CountVectorizer - the frontline tool for turning text into data

from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# using the pre-cleaned description 

test_text = [last_description.lower()]  # CV iterates over lists of documents, start with one

In [None]:
# starting off with single words

cv = CountVectorizer()  # CV is a class, so we have to initialize it, give it a place to be called from

vocab = cv.fit_transform(test_text)  # fit on the text, transform into calculable fields

single_grams = cv.get_feature_names()  # words are called "feature names"

print(len(single_grams))
single_grams[:10]

In [None]:
# on their own, most of these words just don't say anything to us. Certainly some words
# like python and hadoop are fine, but without context most words are meaningless

# moving on to bigrams - two words

cv = CountVectorizer(ngram_range=(2,2))

vocab = cv.fit_transform(test_text)

double_grams = cv.get_feature_names()

print(len(double_grams))
double_grams[:10]

In [None]:
# okay, so, some of these word pairs are starting to make sense BUT we still need frequencies

vocab.toarray()  # toarray returns the counts for each - in this case bigrams

---

Now we have a single row made of all the counts of bigrams in this one job description. What happens when we run CountVectorizer with this description and another?

Some of the bigrams will be overlapping (adding values into the same column), some will be new (adding new columns). The end result will be a sparse matrix with as many rows as we have documents, and as many columns as there are unique bigrams.

Sparse matricies are "mostly empty," in this case most jobs aren't with any one company,
aren't requiring exactly the same skills, and aren't worded the same way. The end result
are lots of 0's in each row, and without compression it would take way longer for your
machine to load. Compression means that each non-zero cell has a reference to it's row,
column, and value which can be used to build a dataframe where non-referenced cells
are known to be 0.

Since we want to build clouds of the most frequent unigrams and bigrams, all we have to do is throw CountVectorizer at our entire document-space, and then sum up our columns.

In [None]:
# retrieve all descriptions from our raw data using the data scientist keys

all_text = [raw_jobs_data[key[1]][1] for key in data_scientist_titles]

cv = CountVectorizer(ngram_range=(1, 2), stop_words='english')  # both uni and bigrams, and remove stopwords

vocab = cv.fit_transform(all_text)  # fit

In [None]:
# number of jobs x number of uni and bigrams

vocab

In [None]:
# not easy to tell what's what

count_matrix = vocab.toarray()

count_matrix

In [None]:
# we can put this all into a dataframe though, and make it a little easier to comprehend

wordcount_df = pd.DataFrame(data=count_matrix, columns=cv.get_feature_names())

wordcount_df.head()

In [None]:
# we've certainly captured some interesting looking word combinations
# let's sum-up our columns and look at the most frequent ones

sorted_words = wordcount_df.sum().sort_values(ascending=False)

sorted_words[:10]  # all of the words

In [None]:
# quick way to reference our bigrams

bigram_indices = [word for word in sorted_words.index if ' ' in word]

sorted_words[bigram_indices][:10]  # only bigrams

---

### Word Clouds

Time to use the eponymous wordcloud package!

In [None]:
# here's how to install a package inline

# if you run into "ModuleNotFoundError: No module named 'whatever'"
# this is your first line of defence

!pip install wordcloud

In [None]:
# import 

from wordcloud import WordCloud

In [None]:
# this is copy paste from a google search of making a wordcloud

def make_cloud(freq_dict):
    '''make a wordcloud from a word frequency dictionary'''
    
    wordcloud = WordCloud()
    wordcloud.generate_from_frequencies(frequencies=freq_dict)

    plt.figure(figsize=(12, 10))
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.show()

In [None]:
# make a dictionary of frequencies from our sorted vocab, and call the function

freq_dict = sorted_words.to_dict()

make_cloud(freq_dict)

In [None]:
# cool, we can do the same thing with just our bigrams

make_cloud(sorted_words[bigram_indices].to_dict())  # all in one line

Looks like a lot of our frequent bigrams come from EEO statements. If the formats are all the same we can hack those out pretty easily.

However, for little effort and no manual cleaning this looks pretty good. 

To recap, we can get to this point with 6 lines of code:

```
cv = CountVectorizer(ngram_range=(1, 2), stop_words='english')
vocab = cv.fit_transform(all_text)
wordcount_df = pd.DataFrame(data=vocab.toarray(), columns=cv.get_feature_names())

sorted_words = wordcount_df.sum().sort_values(ascending=False)
bigram_indices = [word for word in sorted_words.index if ' ' in word]

make_cloud(sorted_words[bigram_indices].to_dict())
```

Any extra work will add an exponential amount of power to our understanding.