<font face="serif" size="6" color="scarlet">Natural Language Processing</font>

It is a field in machine learning/deep learning that deals with understanding, analyzing, manipulating and generating language. Humans communicate through language on multiple mediums these days. It gets complicated. There is context, intonation, inflection and body language. The first major advancement in machine language processing was in 1950 when Alan Turing published "Computing Machinery and Intelligence". This paper establsihed the Turing Test, a criterion for how well a computer could impersonate a human. In 1957, Noam Chomsky's paper on Syntactic Structures revolutionized our understanding of linguistics. But a few decades passed without any real progress. It wasn't until the late 80's when ML algorithms were introduced that NLP showed real promise.

        

_NLP is not Neuro-linguistic programming(pseuodo-science - think changing behavior through hypnosis). Natural Language Understanding is similar to NLP but a bit different. NLP focuses on turning unstructured data into structured data. NLU is focused on content or sentiment analysis._

 <font face="script" size="4">"Learn a language and you'll avoid a war"-Arab proverb</font>

 <font face="script" size="6" color="scarlet">Learning Objectives</font>
 - Understand what NLP is and how it is being used today to solve problems
 - Understand Regex and how its used to pattern match and filter
 - Understand common feature engineering techniques like stemming, lemming and bigrams
 - Understand POS tagging and parse trees 

 <font face="script" size="6" color="scarlet">NLP in the Real World</font>
 
 Lots of everyday things we take for granted rely completely on NLP to function. Spell check and auto-complete, voice recognition/texting, spam filters, search engines, Siri/Alexa, google translate.
 
 - [AI having a convo](https://youtu.be/WnzlbyTZsQY)
 - [Summarize text](https://smmry.com/)
 - [Jennings vs. Watson](https://www.ted.com/talks/ken_jennings_watson_jeopardy_and_me_the_obsolete_know_it_all)

<font face="serif" size="4">Definitions</font>
<font face="serif" size="4"> 
* **Corpus/corpora** - a collection of written texts
* **Linguistics** - the scientific study of language. Its form, meaning and context
* **Information theory** - the study of how information is stored, quantified and communicated 
* **Morphology** - the study of words their formation and their relationship to other words
* **Syntax** - the study of the rules that govern a language 
* **Semantics** - the study of the philosophical meaning of a language 
</font>

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
%matplotlib inline 

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize 
nltk.download('punkt')

In [None]:
df = pd.read_csv('job_scrape6.csv')
df.info()

In [None]:
df.head()

<font face="script" size="6" color="scarlet">Preprocessing</font>

<font face="serif" size="4">Regular Expressions</font>

![](regex_cheat_sheet.png)

<a href="https://www.debuggex.com/cheatsheet/regex/python">Regex Cheatsheet</a>

In [None]:
#Getting rid of upper cases. This avoids having multiple copies of the same words 
df['lower_desc'] = df['description'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df['lower_desc'].head()

In [None]:
#Removing punctuation. It helps us reduce the size of the data 
df['lower_desc'] = df['lower_desc'].str.replace('[^\w\s]','')
df['lower_desc'].head()

<font face="serif" size="4">**Natural Language Tool Kit (NLTK)** - popular open-source python package for dealing with text data</font>

[documentation](https://www.nltk.org/)

<font face="serif" size="4">**Tokenization** -splitting the whole into pieces or tokens. A word is a token in a whole sentence. A sentence is a token in a whole paragraph. </font>

In [None]:
#how to split a sentence into a list of words 
word_tokenize('I am Jon Snow of House Stark.') #how do you pass something like a df.column? 

<font face="serif" size="4">**Stop Words Removal** - words that don't contribute to the significance or meaning of a document </font>

In [None]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

In [None]:
df['char_count'] = df['description'].str.len() #how many characters do we have in description? 
df[['description','char_count']].head()

In [None]:
df['char_count'].sum() #including spaces 

In [None]:
#how many stop words do we have? 
df['stopwords'] = df['description'].apply(lambda x: len([x for x in x.split() if x in stop]))
df[['description','stopwords']].head()

In [None]:
df['stopwords'].sum()

In [None]:
#removing stopwords 
df['lower_desc'] = df['lower_desc'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df['lower_desc'].head()

In [None]:
#most frequent and least frequent words 
freq = pd.Series(' '.join(df['lower_desc']).split()).value_counts()[:20]
freq

In [None]:
#df['location'].value_counts()

In [None]:
#df['no_zip_location'] = df['location'].str.replace('\d+', '')
#locations = df['location'].str.split(' ', n=2, expand=True)
#df['no_zip_location'].value_counts()

In [None]:
desc_str = ' '.join(df['lower_desc'].tolist())
print(desc_str)

In [None]:
tokens = nltk.word_tokenize(desc_str) #back to tokenizing 
print(len(tokens))

<font face="serif" size="4">**POS Tagging** - tagging each word in the corpus with a part of speech.
- CC coordinating conjunction
- CD cardinal digit
- DT determiner
- EX existential there (like: “there is” … think of it like “there exists”)
- FW foreign word
- IN preposition/subordinating conjunction
- JJ adjective ‘big’
- JJR adjective, comparative ‘bigger’
- JJS adjective, superlative ‘biggest’
- LS list marker 1)
- MD modal could, will
- NN noun, singular ‘desk’
- NNS noun plural ‘desks’
- NNP proper noun, singular ‘Harrison’
- NNPS proper noun, plural ‘Americans’
- PDT predeterminer ‘all the kids’
- POS possessive ending parent’s
- PRP personal pronoun I, he, she </font>

In [None]:
tokens_pos = nltk.pos_tag(tokens)
pos_df = pd.DataFrame(tokens_pos, columns = ('word','POS'))
pos_sum = pos_df.groupby('POS', as_index=False).count() # group by POS tags
pos_sum.sort_values(['word'], ascending=[False]) # in descending order of number of words per tag

In [None]:
#getting just the nouns
filtered_pos = [ ]
for one in tokens_pos:
    if one[1] == 'NN' or one[1] == 'NNS' or one[1] == 'NNP' or one[1] == 'NNPS':
        filtered_pos.append(one)
print (len(filtered_pos))


In [None]:
#the 100 most common nouns
fdist_pos = nltk.FreqDist(filtered_pos)
top_100_words = fdist_pos.most_common(100)
print(top_100_words)

In [None]:
top_words_df = pd.DataFrame(top_100_words, columns = ('pos','count'))
top_words_df['Word'] = top_words_df['pos'].apply(lambda x: x[0]) # split the tuple of POS
top_words_df = top_words_df.drop('pos', 1) # drop the previous column
top_words_df.head(10)

In [None]:
fig, ax = plt.subplots(figsize=(15,18))
top_words_df.sort_values(by='count').plot.barh(x='Word',
                      y='count',
                      ax=ax,
                      color="purple")

ax.set_title("Common Words Found in DS Job Descriptions(Without Stop Words)")

plt.show()

**N-grams**

In [None]:
#for n-grams 
from textblob import TextBlob, Word


In [None]:
TextBlob(desc_str).ngrams(3)

In [None]:
word_counts = ' '.join(top_words_df['Word'].tolist())
print(type(word_counts))

In [None]:
from wordcloud import WordCloud

In [None]:
wordcloud = WordCloud().generate(word_counts)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

<font face="serif" size="6" color="scarlet">Feature Engineering</font>

<font face="serif" size="4">**Stemming** - a technique to remove affixes from a word and ending up with the stem. Play would be the stem of a word and the 'ing' in playing would be an affix. This process makes similar words more equal to each other. This way the algorithm only has to learn the stem of the word instead of the stem and all its variants.</font>

<font face="serif" size="4">**Lemmatization** - similar to stemming but it brings context to the words with morphological(words relationships to other words) analysis. A lemma is the base form of all its inflectional forms. Inflections are added to the stem of a word</font>

In [None]:
from nltk.stem import PorterStemmer, WordNetLemmatizer 
porter = PorterStemmer() #instantiate
lemma = WordNetLemmatizer() #instantiate 

In [None]:
print('Porter Stemmer')
print(porter.stem(""))


In [None]:
print('Lemmatizer')
print(lemma.lemmatize(""))

<font face="serif" size="4">Count Vectorizing</font>

<font face="serif" size="4">Term Frequency-Inverse Document Frequency (TF-IDF)</font>

<font face="serif" size="4" color="scarlet">**_Term Frequency_** is calculated with the following formula:

$$\large Term\ Frequency(t) = \frac{number\ of\ times\ t\ appears\ in\ a\ document} {total\ number\ of\ terms\ in\ the\ document} $$ 

**_Inverse Document Frequency_** is calculated with the following formula:

$$\large IDF(t) = log_e(\frac{Total\ Number\ of\ Documents}{Number\ of\ Documents\ with\ t\ in\ it})$$

The **_TF-IDF_** value for a given word in a given document is just found by multiplying the two!</font>
