# Text Analytics / Natural Language Processing
Using text-based data like comments from employee listening surveys, performance review write-ups, development goals, and many other data sources in the HR world are often underutilized, or not utilized at all.  If an HR function is utilizing the text-based data today, many are having their employees go through the comments and manually coding them.  This is extremely time consuming for larger companies and introduces human bias into the analysis from the start.  In this chapter, I’ll walk through the techniques you can use to turn your text-based data into data you can use for analysis or predictive modeling.

I’ll start with commenting on what the various names for text analytics are.  In the data science world, it is referenced to as Natural Language Processing (NLP).  In more traditional contexts it is referred to as text mining.  For the purposes of this chapter, I’ll refer to it as NLP as the term is easier to reference 😊

When hearing about NLP, many people think it is this extremely complex topic that has to either be vended out (companies that offer this service will certainly tell you that) or done by data scientists.  In the HR space, this is particularly true given the lack of advanced analytics skillsets that organization has traditionally had.  I’m here to tell you actually executing the basics of NLP are incredibly easy!  So easy it’s a bit scary how little you actually would need to understand what is occurring on the backend to use it.  At its core, a basic NLP approach is to take a word or string of words from your text-based data and turn them into columns.  While there are nuances that need to be considered, requiring some knowledge of what the algorithm is doing, much of the complexity is abstracted away by the algorithm.  If you're looking for your faith to be restored in humanity, the people who own and contribute to open source projects like sklearn are a good place to start.  The hard work people put into these open source projects makes the lives of people trying to utilize concepts like NLP significantly easier.

#### About the data for this chapter
Luckily there are a few HR specific text-based data sources online to pick from for this chapter.  While I was able to get away with generating a dataset for the turnover prediction model chapter, I wouldn’t have been able to do the same for a text-based dataset.  While an employee engagement survey would have been the ideal type of text-based dataset, [Kaggle has a dataset containing job posting information]( https://www.kaggle.com/shivamb/real-or-fake-fake-jobposting-prediction).  While you may not have a direct use case in your role pertaining to job postings, this dataset will provide use an HR focused example to run the NLP techniques on.

#### Objective with this data
The use case for this chapter will be to categorize the various job postings into similar postings using a topic model.

#### Outline of chapter
1. We’ll briefly discuss when you need to clean your data beforehand and when you can simply let the algorithm take care of it.

2. Vectorizing our data, or splitting our data into the various columns.  This is the “core” of the NLP process we will utilize.  This section has the most parameter exploration, as the vectorization process is where we have the most options to adjust our process.

3. Examining our output and understanding what it means.

4. How can we utilize our output?  We can leave it as is and put it into an existing model or analysis, or we can leverage a topic model.

5. Topic model approaches and concept explanation.


## Data cleaning: to clean or not to clean?
This will be one of the few times anyone tells you you’re able to err on the side of not cleaning your data vs. cleaning your data (gasp).  If your data is already structured, meaning its already in a cell in an excel or csv file, you probably won’t need to do any additional cleaning.  This is because the vectorizing step has mechanisms to remove stop words and words that don’t occur very frequently or words that occur very frequently.  

The instances you will likely need to do some data cleaning is if your data already has HTML tags in it for display purposes, or if you’re pulling data from a word file.  I have had to deal with both and it can take some time, but there are plenty of resources available online for handling this.  At least in the beginning, we will not tackle the steps required for cleaning your data from a word file or when it contains HTML tags.



## Loading the data


In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer 

In [2]:
df = pd.read_excel("fake_job_postings.xlsx")
df = df.fillna(" ")
df.head()

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0


While this dataset has a number of columns, since the goal of this is to utilize the text information, we'll select only the columns we actually want to utilize to keep our dataset a bit more tidy.  We'll stick with leveraging the description and requirements fields as we are hoping group these jobs into similar types of jobs based on the text of these columns.

In [4]:
df_skinny = df[['description', 'requirements']]
df_skinny.head()

Unnamed: 0,description,requirements
0,"Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...
1,Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...
2,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...
3,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi..."
4,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...


## Vectorizing
Vectorizing is just a fancy word for "splitting apart" the words from the text into columns.  We will have a number of parameters that must be defined by us to get the algorithm to function.

* ngram_range
* max_df and min_df
* stop_words

But first, we need to understand the two primary types of vectorizing.  The first and easiest to understand is the count vectorizor. Every time the word or combination of words appears in the text, it is given a count of 1.  If the same word shows up 3 times, it shows as a count of 3.  In practice this is rarely used as the TFIDF vectorizor will almost always produce superior performance.  TFIDF stands for "Term Frequency Inverse Document Frequency".  While the math behind this is beyond the scope of this resource, know that TFIDF is trying to account for how important the word is to the document.  For example, if you work for Amazon and are analyzing your job profile text, the word Amazon is relatively unimportant given its the company name.  However, if you work for a Walmart and are doing the same exercise, Amazon may be more important as it could relate to AWS.  In short, use the TFIDF vectorizor because you'll get better results.

Within the TFIDF vectorizor, we'll utilize the various parameters I mentioned above.  Let's talk about each of them a bit prior to diving into the TFIDF function.

N-grams are a how many words you want to be strung together.  For example, if your n-gram range is only 1 to 1, then you'd only be getting each word by itself.  While this can be helpful, you will likely lose a lot of important context by only looking at each word in isolation.  If you set your n-gram range as 1 to 3, you'll get each single, pair, and trio of words.  This helps retain some more of the context of your text when multiple words that are next to each other in the sentence can be kept together throughout your vectorization process.

The min and max df parameters ensure very common and very unique words don't stay in your dataset for consideration.  The value input for each of these is a percentage.  For example, if min_df is set to .05, that means the word has to show up in at least 5% of your documents to be kept in.  For max_df, if it is set to .95, that means any words in more than 95% of your documents are removed.  

The last parameter isn't much of an option.  For stop_words, you want to remove the stop words from your comments.  Currently sklearn only supports English as of this writing.

Now onto our example.  Since we're using multiple columns, this will require some programming to execute the process on each of our columns independently.

In [5]:
#create a list of the 2 column positions for the dataset
col_nums = [0,1]

#create a blank dataframe so we can put the vectorized columns in it
df_vect = pd.DataFrame()

#set vectorizer parameters
tfidf_vec = TfidfVectorizer(stop_words = 'english', ngram_range = (1,3), max_df = 0.75, min_df = .005)

#run vectorization process on each column
for col in col_nums:
    
    #select the single column
    temp_df = df_skinny[df_skinny.columns[col]]
    
    #execute vectorizor on data
    after_tfidf = tfidf_vec.fit_transform(temp_df)
    
    #get the column names
    col_names = tfidf_vec.get_feature_names()
    
    #convert into df
    after_tfidf = pd.DataFrame(after_tfidf.toarray(), columns = col_names)
    
    #put together each iteration of the vectorization process into a single dataframe
    df_vect = pd.concat([df_vect, after_tfidf], axis = 1)

As you can see below, we are now able to see that our dataset has 6001 columns.  Each column represents a word or consecutive string of words and the number is how important the word(s) is/are to the text.

At least for me, this is where I can more clearly see how a computer or model can utilize text information to understand it.  We've taken a bunch of words and turned them into a tabular dataset that can go into a model!

In [7]:
print(df_vect.shape)
df_vect.head()

(17880, 6001)


Unnamed: 0,00,000,10,10 years,100,1000,11,12,12 month,12 month contract,...,years professional,years professional experience,years related,years relevant,years sales,years software,years work,years work experience,years working,yes
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.106598,0.0,0.0,0.056939,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now the next question is, what do you do with this information?

## What the heck do I do with 6000 columns?
By themselves, this many columns are pretty useless.  With most basic modeling or analytics techniques, having this many variables becomes a problem.  Each column by themselves don't tell us much, but when you can start interacting the columns together, you may start getting somewhere.  You can also leverage dimensionality reduction techniques to reduce the number of variables you have.  This will allow you to utilize more traditional analysis or modeling techniques.  I'll speak about two paths you can take to put this text analysis into action.  As always, it should go back to your use case on what to do.

1. Topic Modeling: If you're familiar with Principle Component Analysis or other dimensionality reduction techniques, this is a similar concept.  Reduce the number of variables while minimizing information loss.

2. Build Tree-Based Model: One significant benefit of tree-based models is how well it can handle high dimension datasets.  Feeding a random forest or GBM 6000 columns won't cause any issues because the nature of a tree based model will be to ignore columns that don't contribute to the prediction.  This approach can also use a topic model, which may produce a better or worse result depending on your data.

#### Topic Modeling
Topic modeling is a dimensionality reduction technique that can be a way to identify themes or topics within text data.  This is an unsupervised approach.  You input how many topics to create and the output is X number of columns with a probability that text fits into the topic.  Unfortunately the magic stops here, there isn't a name for the topic provided.  A common approach is for an analyst to review the words or phrases within that topic to name it.  If you're familiar with clustering, this is similar.  The clusters are created, but they still need to be provided a name.  

In the HR space, building a topic model has many uses.  From an employee survey perspective, any open comments can then be grouped via a topic model to get a general idea of what employees are discussing.  If you pair this with your more standard HR data elements like tenure, job level, age, department, etc, this can provide a new perspective on your survey data you didn't have previously.  With your performance goal write ups, you could use a topic model to identify the themes of goals across teams and/or departments within your company.

#### Build Tree-Based Model
If you're hoping to build a predictive model, you can build a tree-based model (decision tree, random forest, GBM) using a high number of features that get output from the vectorization process.  You have many options on how to approach this.  You can simply use all the vectorized column to predict whatever you're trying to predict, you can add in other more standard HR data elements (tenure, age, job level, etc), you can put your vectorized data through a topic model, or you can do any combination of the three.  As long as you have time, its best to try the various approaches to see which provides the best results!

In the HR space, a good example for this is predicting turnover.  Utilizing your structured data, survey comments, performance reviews, peer feedback, etc all together will almost certainly provide a superior result than each by itself.  Leveraging a tree-based model to find some of these complex interactions will be something regression cannot solve for without a signficant amount of additional time and complexity in the modeling process.

## Topic Model Example
There are a handful of different approaches to topic modeling.  In this example we'll use LSA or Latent Sematic Analysis.  It is using SVD, or Singular Value Decomposition, however since we used TFIDF to vectorize our data, its considered LSA. Know that this is using linear algebra on the backend, but the actual math behind this is beyond the scope of this chapter.  Once you've done all the work to get your text vectorized, you'll find that topic modeling is quite easy!  Topic modeling definitely fits into the classic analysis stereotype of most time is spent prepping the data rather than with the actual analysis.

In [8]:
from sklearn.decomposition import TruncatedSVD

In [30]:
topic_model = TruncatedSVD(n_components = 5)
topic_model_df = topic_model.fit_transform(df_vect)

Building the topic model is as easy as that!  I've found it valuable to play with how many topics are created.  Make sure to use your use case as context to help inform that.  If you're trying to identify themes from an onboarding survey, its not likely that 25 themes will be helpful!  If you're just using the topics to be fed into a model, the actual meaning of the topics aren't as important as the means to reduce the number of columns you're feeding into your model.

In [32]:
topic_model_df = pd.DataFrame(topic_model_df)
topic_model_df.head()

Unnamed: 0,0,1,2,3,4
0,0.256389,-0.06256,-0.026146,-0.016824,-0.025928
1,0.385037,-0.086439,0.000473,-0.014787,-0.040057
2,0.149374,-0.037078,0.036085,-0.002194,0.029768
3,0.440844,-0.097447,-0.049344,-0.00359,-0.050383
4,0.294612,-0.063121,0.042366,-0.006927,0.018472


Each number represents a probability that text relates to the comments.  There are two important things to note.

1. If using the numbers, take the absolute value.  A negative number doesn't mean less of a chance the text is related to a topic.

2. The numbers do not add up to 100.  This is because the process is accounting for some uncertainty.