# Introduction to Natural Language  Processing (NLP)

Reference Video is **[here](https://www.youtube.com/watch?v=fOvTtapxa9c)**

## Why is NLP Important?

Large volumes of text data that are largely unstructured  are being generated every second. Humans are not able to decipher the underlying knowledge about them and  make decisions.  This is what makes Natural Language  Processing (NLP) important.

The below inforgraphic details the amount  of  data  generated per minute in 2019.
![Per Minute Data Generation in 2019](https://2oqz471sa19h3vbwa53m33yj-wpengine.netdna-ssl.com/wp-content/uploads/2019/07/big-data-getting-bigger.jpg)
**Image source https://bit.ly/3dh0oUW**

With the expontial increase in devices, cheaper internet connectivity etc , data and  more so text data is  bound  to increase. Understanding this data to make business decisions  is  the reason why NLP is important going forward.




## History of NLP

NLP as a field of Artificial Intelligence (AI) helps computers understand, utilize, and interpret human languages. This way, computers can connect with  people. History of  NLP dates back to 1957  when  Noam Chomsky published the book "[Syntactic Structures](https://doubleoperative.files.wordpress.com/2009/12/chomsky-syntactic-structures-2ed.pdf)". The conclusion was that for a computer to understand a language, the sentence  structure had to be changed too. This early research called for more innovation in making human languages understandable to computers. Right now, neural networks are able to  even understand the structure of  sentences and  to a large extent can correctly predict the next word a human would say/write in a sentence. A brief history of  NLP can be found  [**here**](https://en.wikipedia.org/wiki/History_of_natural_language_processing).

## Applications of  NLP

Some of the application areas in NLP are below:-

1.   Content Categorization
2.   Document Summarization
3.   Sentiment Analysis and opinion mining
4.   Text-to-Speech and Speech-to-Text Conversion
5.   Topic Discovery and Modeling
6.   Machine Translation











# Introduction to Natural Language  Toolkit (NLTK)

**Paper:** Loper, E., & Bird, S. (2002). NLTK: the natural language toolkit. arXiv preprint cs/0205028

## Text Wrangling

Wrangling text simply refers to the pre-processing  work that is applied on raw text to make it clean and more readable to computers  for  training.



### Dataset

We'll be making use of  the sentiment140 [Twitter Dataset](https://www.kaggle.com/kazanova/sentiment140) annotated with sentiments. It contains 1,600,000 tweets extracted using the twitter api . The tweets have been annotated (0 = negative, 4 = positive) and they can be used to detect sentiment. To a large extent, we'll be  making use of the "text" field in the dataset.



In [None]:
# Import all the packages for data wrangling

from nltk.tokenize import TweetTokenizer  #tokenizer for  tweets
import numpy as np
import sys
import pandas as pd
import re
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
stop = stopwords.words('english')
import seaborn as sns
import matplotlib.pyplot as plt
%tensorflow_version 1.x #Invokes  running of TensorFlow (TF) version 1.xx. This version will work with BertLibrary package.
import tensorflow
print(tensorflow.__version__)



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


  import pandas.util.testing as tm


TensorFlow 1.x selected.
1.15.2


In [None]:
#Google Colab Specific to access the location with the notebook
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [None]:
%ls

[0m[01;34mdrive[0m/  [01;34msample_data[0m/


In [None]:
%cd drive/My Drive/Torrens/NLP/

/content/drive/My Drive/Torrens/NLP


In [None]:
# Read the datase. Pandas package is  quite  helpful here. I'll write it as function
dataset_columns = ["target", "ids", "date", "flag", "user", "text"] #dataset columns
def read_data():
    dataset = pd.read_csv("training.1600000.processed.noemoticon.csv", encoding = "ISO-8859-1", names=dataset_columns) # Enter your file location
    dataset.drop_duplicates(inplace=True)
    dataset = dataset[dataset['text'].isnull() == False]
    dataset.reset_index(inplace=True)
    dataset.drop('index', axis=1, inplace=True)
    print ('Dataset loaded with shape', dataset.shape  )
    return dataset

dataset = read_data() #Call the function defined above

Dataset loaded with shape (1600000, 6)


In [None]:
dataset.head() #Sample output . Instead  of "head", you use "tail" to view the  last records in the dataset

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


### Cleaning Stop Word Removal



In [None]:
#this  will take sometime  on a slow computer to clean the text part of the tweet
dataset['text'] = dataset['text'].map(lambda x:re.sub('[^a-zA-Z]',' ',str(x))) #remove numbers. Not  o interest in this aspect
dataset['text'] = dataset['text'].map(lambda x:re.sub('http.*','',str(x))) #Remove  hyperlinks
dataset['text'] = dataset['text'].map(lambda x:re.sub(r'#','',str(x))) #Remove hashtags. Not of  interest
dataset['text'] = dataset['text'].map(lambda x:re.sub(r'@\w*','',str(x))) #Remove user mentions
dataset['text'] = dataset['text'].map(lambda x:str(x).lower()) #lower case everything
dataset['text'] = dataset['text'].str.split().map(lambda sl: " ".join(s for s in sl if len(s) > 3)) #Remove words with less than characters
dataset['text'] = dataset['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)])) #Stop word removal. Uses the defined NLTK stopword list defined above


In [None]:
dataset.tail() # View cleaner tweets in the text part

Unnamed: 0,target,ids,date,flag,user,text
1599995,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,woke school best feeling ever
1599996,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,thewdb cool hear walt interviews
1599997,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,ready mojo makeover details
1599998,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,happy birthday alll time tupac amaru shakur
1599999,4,2193602129,Tue Jun 16 08:40:50 PDT 2009,NO_QUERY,RyanTrevMorris,happy charitytuesday thenspcc sparkscharity sp...


### Tokenization

Tokenization breaks up a sequence of strings into words, keywords, phrases, symbols and other elements called tokens.



In [None]:
#tokenization function
def tokenization (text):
  tokens = re.split('\W+',text)
  return tokens

#Call the tokenization function
dataset['tokens'] = dataset['text'].apply(lambda x: tokenization(x))
dataset.head()

Unnamed: 0,target,ids,date,flag,user,text,tokens
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,switchfoot,[switchfoot]
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,upset update facebook texting might result sch...,"[upset, update, facebook, texting, might, resu..."
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,kenichan dived many times ball managed save re...,"[kenichan, dived, many, times, ball, managed, ..."
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,whole body feels itchy like fire,"[whole, body, feels, itchy, like, fire]"
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,nationwideclass behaving,"[nationwideclass, behaving]"


### Stemming and Lemmatisation

Stemming is closely related to lemmatisation in the way the two categorize similar words. Words like ***fishing, fished***, and ***fisherto*** can be stemmed to "***fish***". A stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.




In [None]:
#Lemmatization
nltk.download('wordnet')
lemmatizer = nltk.WordNetLemmatizer()
def lemmatize_text(text):
  output = [lemmatizer.lemmatize(word) for  word in text]
  return output
dataset["Lemmatized_text"] = dataset["tokens"].apply(lambda x: lemmatize_text(x))
dataset.head()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


Unnamed: 0,target,ids,date,flag,user,text,tokens,Lemmatized_text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,switchfoot,[switchfoot],[switchfoot]
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,upset update facebook texting might result sch...,"[upset, update, facebook, texting, might, resu...","[upset, update, facebook, texting, might, resu..."
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,kenichan dived many times ball managed save re...,"[kenichan, dived, many, times, ball, managed, ...","[kenichan, dived, many, time, ball, managed, s..."
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,whole body feels itchy like fire,"[whole, body, feels, itchy, like, fire]","[whole, body, feel, itchy, like, fire]"
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,nationwideclass behaving,"[nationwideclass, behaving]","[nationwideclass, behaving]"


Nothing  much changes on the tokens when lemmatized.

In [None]:
#Stemming
stemmer = nltk.PorterStemmer()
def stem_text(text):
  output = [stemmer.stem(word) for  word in text]
  return output
dataset["Stemmed_text"] = dataset["tokens"].apply(lambda x: stem_text(x))
dataset.head()

Unnamed: 0,target,ids,date,flag,user,text,tokens,Lemmatized_text,Stemmed_text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,switchfoot,[switchfoot],[switchfoot],[switchfoot]
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,upset update facebook texting might result sch...,"[upset, update, facebook, texting, might, resu...","[upset, update, facebook, texting, might, resu...","[upset, updat, facebook, text, might, result, ..."
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,kenichan dived many times ball managed save re...,"[kenichan, dived, many, times, ball, managed, ...","[kenichan, dived, many, time, ball, managed, s...","[kenichan, dive, mani, time, ball, manag, save..."
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,whole body feels itchy like fire,"[whole, body, feels, itchy, like, fire]","[whole, body, feel, itchy, like, fire]","[whole, bodi, feel, itchi, like, fire]"
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,nationwideclass behaving,"[nationwideclass, behaving]","[nationwideclass, behaving]","[nationwideclass, behav]"


Stemmed text loses a few characters on each token.  A model cannot be  trained on such text. We'll therefore stick to the  tokens in their original form. They are good enough.

## Statistical Language Modelling

### Bag of Words (BoW) and Count Vectorizer

The **CountVectorizer** provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.

In [None]:
text =list(dataset["text"][:100]) #Select just the top 100 tweets to count vectors. 1.6M tweets are such a huge number so not possible.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize a CountVectorizer object: count_vectorizer
count_vec_tweets = CountVectorizer(stop_words="english", analyzer='word', ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None)

# Transforms the data into a bag of words
count_train = count_vec_tweets.fit(text)
bag_of_words = count_vec_tweets.transform(text)

In [None]:
print("Vocabulary:\n {}".format(count_train.vocabulary_))

Vocabulary:
 {'switchfoot': 352, 'upset': 397, 'update': 396, 'facebook': 120, 'texting': 363, 'result': 295, 'school': 307, 'today': 374, 'blah': 44, 'kenichan': 198, 'dived': 102, 'times': 373, 'ball': 34, 'managed': 229, 'save': 303, 'rest': 294, 'bounds': 49, 'body': 47, 'feels': 131, 'itchy': 188, 'like': 209, 'nationwideclass': 246, 'behaving': 39, 'kwesidei': 201, 'crew': 89, 'need': 247, 'loltrish': 216, 'long': 217, 'time': 371, 'rains': 287, 'fine': 135, 'thanks': 365, 'tatiana': 356, 'nope': 254, 'twittera': 386, 'muera': 244, 'spring': 338, 'break': 51, 'plain': 270, 'city': 72, 'snowing': 328, 'pierced': 269, 'ears': 110, 'caregiving': 62, 'bear': 38, 'watch': 410, 'thought': 368, 'loss': 221, 'embarrassing': 112, 'octolinz': 256, 'counts': 86, 'talk': 355, 'anymore': 16, 'smarrison': 325, 'really': 290, 'snyder': 329, 'doucheclown': 105, 'iamjazzyfizzle': 185, 'wish': 417, 'miss': 236, 'iamlilnicki': 186, 'premiere': 278, 'hollis': 176, 'death': 95, 'scene': 306, 'hurt': 

Vocabulary count for each word

### Term Frequency Inverse Document Frequency (TF-IDF) Vector

![Term Frequency-Inverse Document Frequency ](https://miro.medium.com/max/1400/1*V9ac4hLVyms79jl65Ym_Bw.jpeg)
Image  from https://bit.ly/3dmDDyS

TF-IDF is a metric that factors the importance of a word relative to the  corpus. For example, words that are frequent in a document by Bag of Words, but the same frequency is not replicated across the documents in the collection tend to have a higher TF-IDF score. This means they are important  in the collection. On the contrary, words that appear frequently across the  collection like stopwords are less important thus have a lower TF-IDF score. A TF-IDF value can be  used as a feature representation in model building.

In [None]:
#Term Frequency (TF)
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer(smooth_idf=False, sublinear_tf=False, norm=None, analyzer='word')
fitted_text = tf.fit(text)
transformed_text = fitted_text.transform(text)
print ("Listed Text: ", text)

Listed Text:  ['switchfoot', 'upset update facebook texting might result school today also blah', 'kenichan dived many times ball managed save rest bounds', 'whole body feels itchy like fire', 'nationwideclass behaving', 'kwesidei whole crew', 'need', 'loltrish long time rains fine thanks', 'tatiana nope', 'twittera muera', 'spring break plain city snowing', 'pierced ears', 'caregiving bear watch thought loss embarrassing', 'octolinz counts either never talk anymore', 'smarrison would first really though snyder doucheclown', 'iamjazzyfizzle wish watch miss iamlilnicki premiere', 'hollis death scene hurt severely watch film directors', 'file taxes', 'lettya always wanted rent love soundtrack', 'fakerpattypattz dear drinking forgotten table drinks', 'alydesigns much done', 'friend called asked meet valley today time sigh', 'angry barista baked cake ated', 'week going hoped', 'blagh class tomorrow', 'hate call wake people', 'going sleep watching marley', 'miss lilly', 'ooooh leslie leslie

In [None]:
tf.vocabulary_ #Learned  corpus vocabulary

{'able': 0,
 'account': 1,
 'actually': 2,
 'added': 3,
 'adidas': 4,
 'afternoon': 5,
 'agreed': 6,
 'algonquin': 7,
 'alielayus': 8,
 'allllll': 9,
 'almost': 10,
 'already': 11,
 'also': 12,
 'always': 13,
 'alydesigns': 14,
 'anaheim': 15,
 'andy': 16,
 'andywana': 17,
 'angry': 18,
 'annoys': 19,
 'another': 20,
 'anymore': 21,
 'anything': 22,
 'arms': 23,
 'around': 24,
 'asap': 25,
 'ashleyac': 26,
 'asian': 27,
 'asked': 28,
 'asleep': 29,
 'assets': 30,
 'astros': 31,
 'ated': 32,
 'attention': 33,
 'attire': 34,
 'away': 35,
 'awol': 36,
 'awww': 37,
 'babe': 38,
 'babies': 39,
 'back': 40,
 'baked': 41,
 'ball': 42,
 'bands': 43,
 'barista': 44,
 'batmanyng': 45,
 'bear': 46,
 'behaving': 47,
 'bill': 48,
 'birthday': 49,
 'black': 50,
 'blackberry': 51,
 'blagh': 52,
 'blah': 53,
 'blast': 54,
 'blood': 55,
 'body': 56,
 'booked': 57,
 'bounds': 58,
 'bracket': 59,
 'break': 60,
 'breaking': 61,
 'breaks': 62,
 'broadband': 63,
 'broken': 64,
 'burnt': 65,
 'business': 66,

In [None]:
# Lets get the  Inverse Document Frequency (IDF) part
idf = tf.idf_
print(dict(zip(fitted_text.get_feature_names(), idf)))

{'able': 5.605170185988092, 'account': 5.605170185988092, 'actually': 5.605170185988092, 'added': 5.605170185988092, 'adidas': 5.605170185988092, 'afternoon': 5.605170185988092, 'agreed': 5.605170185988092, 'algonquin': 5.605170185988092, 'alielayus': 5.605170185988092, 'allllll': 5.605170185988092, 'almost': 5.605170185988092, 'already': 4.912023005428146, 'also': 4.912023005428146, 'always': 4.912023005428146, 'alydesigns': 5.605170185988092, 'anaheim': 5.605170185988092, 'andy': 5.605170185988092, 'andywana': 5.605170185988092, 'angry': 5.605170185988092, 'annoys': 5.605170185988092, 'another': 5.605170185988092, 'anymore': 5.605170185988092, 'anything': 5.605170185988092, 'arms': 5.605170185988092, 'around': 5.605170185988092, 'asap': 5.605170185988092, 'ashleyac': 5.605170185988092, 'asian': 5.605170185988092, 'asked': 5.605170185988092, 'asleep': 4.912023005428146, 'assets': 5.605170185988092, 'astros': 5.605170185988092, 'ated': 5.605170185988092, 'attention': 5.605170185988092,

In [None]:
feature_names = np.array(tf.get_feature_names())
sorted_by_idf = np.argsort(tf.idf_)
print("Features with lowest IDF:\n{}".format(feature_names[sorted_by_idf[:10]]))
print("\nFeatures with highest idf:\n{}".format(feature_names[sorted_by_idf[-10:]]))

Features with lowest IDF:
['like' 'think' 'tomorrow' 'really' 'sorry' 'time' 'today' 'miss' 'hate'
 'though']

Features with highest idf:
['forgotten' 'forget' 'forever' 'follow' 'fleurylis' 'first' 'fire' 'fine'
 'gear' 'wutcha']


In [None]:
#TF-IDF - Maximum token value throughout the whole dataset

tfidf_value = tf.transform(text)

# find maximum value for each of the features over all of dataset:
max_val = tfidf_value.max(axis=0).toarray().ravel()

#sort weights from smallest to biggest and extract their indices
sort_by_tfidf = max_val.argsort()

print("Features with lowest tfidf:\n{}".format(feature_names[sort_by_tfidf[:10]]))
print("\nFeatures with highest tfidf: \n{}".format(feature_names[sort_by_tfidf[-10:]]))


Features with lowest tfidf:
['like' 'think' 'tomorrow' 'really' 'hate' 'sorry' 'time' 'today' 'miss'
 'though']

Features with highest tfidf: 
['first' 'fire' 'fine' 'friend' 'still' 'class' 'never' 'sick' 'nite'
 'leslie']


### Co-Occurrence Vector

In [None]:
import collections# implements specialized container datatypes providing alternatives to Python’s general purpose built-in containers, dict, list, set, and tuple.
def co_occurrence(sentences, window_size):
    d = collections.defaultdict(int) #dict subclass that calls a factory function to supply missing values
    vocab = set()
    for text in sentences:
        # preprocessing (use tokenizer instead)
        text = text.lower().split()
        # iterate over sentences
        for i in range(len(text)):
            token = text[i]
            vocab.add(token)  # add to vocab
            next_token = text[i+1 : i+1+window_size]
            for t in next_token:
                key = tuple( sorted([t, token]) )
                d[key] += 1

    # formulate the dictionary into dataframe
    vocab = sorted(vocab) # sort vocab
    df = pd.DataFrame(data=np.zeros((len(vocab), len(vocab)), dtype=np.int16),
                      index=vocab,
                      columns=vocab)
    for key, value in d.items():
        df.at[key[0], key[1]] = value
        df.at[key[1], key[0]] = value
    return df

In [None]:
text_co = list(dataset["text"][:50]) #Just the top 50 tweets
co_occurence_df = co_occurrence(text_co, 2) #dataframe

In [None]:
co_occurence_df.head(50)

Unnamed: 0,account,afternoon,alielayus,almost,also,always,alydesigns,anaheim,angry,another,anymore,asian,asked,asleep,ated,awww,back,baked,ball,barista,bear,behaving,blackberry,blagh,blah,body,bounds,break,breaking,breaks,burnt,cake,call,called,came,caregiving,cause,champ,checked,city,...,tatiana,taxes,teardrops,tell,texting,thanks,think,though,thought,three,time,timeline,times,today,tomorrow,tomorrows,track,tracy,twanking,twittera,uids,unfornately,update,upset,user,valley,viennah,wake,wanna,want,wanted,watch,watching,wear,week,whole,wish,work,workin,would
account,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
afternoon,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
alielayus,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
almost,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
also,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
always,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
alydesigns,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
anaheim,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
angry,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
another,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


The co-occurence matrix  above is quite sparse i.e. many 0s mean very few words co-occured in the  specified window i.e. window of  2 words in our case.

### Continuous Bag of Words (CBoW)

**Word Embedding** is a modeling technique where words are mapped to vectors of real numbers based in a vector space with set dimensions . Neural networks and  other probabilistic  models  generate them. **[Word2Vec](https://code.google.com/archive/p/word2vec/)** is  one technique. CBOW is one of the two ways of of  predicing the next word in a sentence.

1. CBOW model predicts the current word given context words within specific window. The input layer in this  instance contains the context words and the output layer contains the current word. The hidden layer contains the number of dimensions in which we want to represent current word present at the output layer.

![CBOW](https://cdn-images-1.medium.com/max/800/1*UVe8b6CWYykcxbBOR6uCfg.png)
      
The CBOW model Framework (Source: https://arxiv.org/pdf/1301.3781.pdf Mikolov el al.)


**Reference:**

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
import gensim  #Gensim makes it very easy to train complicated models with very few lines of  code
from gensim.models import Word2Vec
nltk.download('punkt')
import warnings
warnings.filterwarnings(action = 'ignore')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
text_w2v =  dataset['tokens'][:10000]

In [None]:
CBOW_Model = gensim.models.Word2Vec(text_w2v, min_count = 1, size = 100, window = 5) #Default representatio is CBOW , unless specified as Skipgram

In [None]:
print("Most Similar Word by CBOW  to 'tomorrow': \n")
CBOW_Model.wv.most_similar("tomorrow") #The score is the cosine similarity score

Most Similar Word by CBOW  to 'tomorrow': 



[('going', 0.9997029304504395), ('still', 0.9996705651283264), ('time', 0.9996705651283264), ('today', 0.9996482133865356), ('hope', 0.9996415972709656), ('work', 0.9996351599693298), ('miss', 0.999625563621521), ('tonight', 0.9996063709259033), ('think', 0.9996018409729004), ('week', 0.9995994567871094)]

###  Skip Gram
Skip gram model on the other hand predicts the surrounding context words within specific window given current word.

![SkipGram Representation](https://cdn-images-1.medium.com/max/800/1*SR6l59udY05_bUICAjb6-w.png)

The Skip-gram model Framework (Source: https://arxiv.org/pdf/1301.3781.pdf Mikolov el al.)

The input layer contains the current word while the output layer contains the context words. The hidden layer contains the number of dimensions in which we want to represent current word present at the input layer.

**Reference:**

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).

In [None]:
# Create CBOW model
Skp_Gram_Model = gensim.models.Word2Vec(text_w2v, min_count = 1, size = 100, window = 5, sg = 1) #sg=1 changes the representation from CBOW to SkipGram

In [None]:
print("Most Similar Word by SkipGram  to 'tomorrow': \n")
Skp_Gram_Model.wv.most_similar("tomorrow") #The score is the cosine similarity score

Most Similar Word by SkipGram  to 'tomorrow': 



[('going', 0.9997503757476807), ('today', 0.999640941619873), ('tired', 0.9996140003204346), ('gonna', 0.9996077418327332), ('early', 0.9996066093444824), ('gotta', 0.9995886087417603), ('time', 0.9995882511138916), ('class', 0.9995787143707275), ('long', 0.9995675086975098), ('school', 0.9995666146278381)]

# Deep Learning in NLP

### Bidirectional Encoder Representations from Transformers (BERT)

BERT is a method of pre-training language representations, meaning that we train a general-purpose "language understanding" model on a large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (e.g. question answering). BERT outperforms previous methods because it is the first unsupervised, deeply bidirectional system for pre-training NLP.

Reference paper: [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)

We'll use a pre-trained BERT to generate the embedding vectors. We'll set up a BERT layer as a hidden layer which requires token_ids, mask_ids and  segment_ids as input sequence. More information on this can be found [here](https://github.com/google-research/bert/blob/master/run_classifier.py)


In [None]:
dataset_bert = dataset[["target","text"]] #Dataset for the BERT model

In [None]:
dataset_bert.head() #Sample records

Unnamed: 0,target,text
0,0,switchfoot
1,0,upset update facebook texting might result sch...
2,0,kenichan dived many times ball managed save re...
3,0,whole body feels itchy like fire
4,0,nationwideclass behaving


In [None]:
#Split the dataset into training, validation and  testing sets for BERT modelling.
from sklearn.model_selection import train_test_split
TRAIN_SIZE = 0.75
VAL_SIZE = 0.05
dataset_count = len(dataset_bert)

df_train_val, df_test = train_test_split(dataset_bert, test_size=1-TRAIN_SIZE-VAL_SIZE, random_state=42)
df_train, df_val = train_test_split(df_train_val, test_size=VAL_SIZE / (VAL_SIZE + TRAIN_SIZE), random_state=42)

print("TRAIN size:", len(df_train))
print("VALIDATION size:", len(df_val))
print("TEST size:", len(df_test))

TRAIN size: 1200000
VALIDATION size: 80000
TEST size: 320000


In [None]:
df_val.head()

Unnamed: 0,target,text
1309287,4,heat brought letter summer slain fists raised ...
569311,0,missing days felt inside
133752,0,rebeccao dear well
1087939,4,chalkbored thank love bright colours could don...
1378591,4,thinking eating another doughnut


In [None]:
#Transform Dataframe to CSV files
!mkdir dataset
df_train.sample(frac=1.0).reset_index(drop=True).to_csv('dataset/train.tsv', sep='\t', index=None, header=None)
df_val.to_csv('dataset/dev.tsv', sep='\t', index=None, header=None)
df_test.to_csv('dataset/test.tsv', sep='\t', index=None, header=None)
! cd dataset && ls

mkdir: cannot create directory ‘dataset’: File exists
dev.tsv  test.tsv  train.tsv


In [None]:
#!wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip - huge model. Takes sometime to train
!wget https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-2_H-128_A-2.zip #smaller version. Ideal for students  learnign without  lots of  resources
!unzip uncased_L-2_H-128_A-2.zip

--2020-04-28 09:41:37--  https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-2_H-128_A-2.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.20.128, 2607:f8b0:400e:c07::80
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.20.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16529104 (16M) [application/zip]
Saving to: ‘uncased_L-2_H-128_A-2.zip’


2020-04-28 09:41:41 (4.92 MB/s) - ‘uncased_L-2_H-128_A-2.zip’ saved [16529104/16529104]

Archive:  uncased_L-2_H-128_A-2.zip
  inflating: bert_model.ckpt.data-00000-of-00001  
  inflating: bert_config.json        
  inflating: vocab.txt               
  inflating: bert_model.ckpt.index   


In [None]:
!pip install BertLibrary #Tensorflow library for quick and easy training and finetuning of models based on Bert

Collecting BertLibrary
[?25l  Downloading https://files.pythonhosted.org/packages/a5/f6/62c112afb62265d980e44db418094e11950a47b79ea8d71d14a2a9c6f6d8/BertLibrary-0.0.4.tar.gz (57kB)
[K     |█████▊                          | 10kB 23.2MB/s eta 0:00:01[K     |███████████▍                    | 20kB 3.3MB/s eta 0:00:01[K     |█████████████████               | 30kB 4.7MB/s eta 0:00:01[K     |██████████████████████▉         | 40kB 3.1MB/s eta 0:00:01[K     |████████████████████████████▌   | 51kB 3.8MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 3.1MB/s 
Building wheels for collected packages: BertLibrary
  Building wheel for BertLibrary (setup.py) ... [?25l[?25hdone
  Created wheel for BertLibrary: filename=BertLibrary-0.0.4-cp36-none-any.whl size=75016 sha256=f3957b435ebcb574c3f9ea11d97868eea44107c18e6697dab79626fde9a24936
  Stored in directory: /root/.cache/pip/wheels/63/3d/ab/990438ec53e97a0203d2be35ad77fcdcb0750bee7057ddf25f
Successfully built BertLibrary
Ins

In [None]:
from BertLibrary import BertFTModel
import numpy as np






In [None]:
!mkdir output
ft_model = BertFTModel( model_dir='uncased_L-2_H-128_A-2',
                        ckpt_name="bert_model.ckpt",
                        labels=['0','1','2','3','4'], #Labels  in your  dataset. Sentiment scores in our case
                        lr=1e-05,
                        num_train_steps=10000, #Quite few steps. Increase the number as per your reference
                        num_warmup_steps=1000,
                        ckpt_output_dir='output',
                        save_check_steps=1000,
                        do_lower_case=False,
                        max_seq_len=50,
                        batch_size=32,
                        )
ft_trainer =  ft_model.get_trainer()
ft_evaluator = ft_model.get_evaluator()

INFO:tensorflow:Using config: {'_model_dir': 'output', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': device_count {
  key: "GPU"
  value: 1
}
gpu_options {
  per_process_gpu_memory_fraction: 0.5
  allow_growth: true
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fccb05a2f98>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


In [None]:
%ls

 [0m[01;34mdataset[0m/                             training.1600000.processed.noemoticon.csv
'Natural Language Processing.ipynb'   [01;34muncased_L-2_H-128_A-2[0m/
 [01;34moutput[0m/                              uncased_L-2_H-128_A-2.zip


In [None]:
%cd ..

/content/drive/My Drive/Torrens/NLP


In [None]:
%ls

 [0m[01;34mdataset[0m/                                    [01;34muncased_L-12_H-768_A-12[0m/
'Natural Language Processing.ipynb'          uncased_L-12_H-768_A-12.zip
 [01;34moutput[0m/                                     uncased_L-12_H-768_A-12.zip.1
 training.1600000.processed.noemoticon.csv


In [None]:
ft_trainer.train_from_file('dataset/',35000) #Training  the model on the  split data  in the dataset folder. Make sure you have such a folder or change the name to yours

INFO:tensorflow:Not using Distribute Coordinator.
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps 1000 or save_checkpoints_secs None.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:*** Features ***
INFO:tensorflow:  name = input_ids, shape = (32, 50)
INFO:tensorflow:  name = input_mask, shape = (32, 50)
INFO:tensorflow:  name = is_real_example, shape = (32,)
INFO:tensorflow:  name = label_ids, shape = (32,)
INFO:tensorflow:  name = segment_ids, shape = (32, 50)
INFO:tensorflow:**** Trainable Variables ****
INFO:tensorflow:  name = bert/embeddings/word_embeddings:0, shape = (30522, 128), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/token_type_embeddings:0, shape = (2, 128), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/position_embeddings:0, shape = (5

In [None]:
ft_evaluator.evaluate_from_file('dataset', checkpoint="output/model.ckpt-35000")

INFO:tensorflow:Writing example 0 of 319999
INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: 1
INFO:tensorflow:tokens: [CLS] miss ##tori ##bla ##ck cool t ##wee ##t apps ra ##z ##r [SEP]
INFO:tensorflow:input_ids: 101 3335 29469 28522 3600 4658 1056 28394 2102 18726 10958 2480 2099 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:label: 0 (id = 0)
INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: 2
INFO:tensorflow:tokens: [CLS] tian ##nac ##ha ##os know family drama lame next time hang guys like sleep ##over whatever call [SEP]
INFO:tensorflow:input_ids: 101 23401 18357 3270 2891 2113 2155 3689 20342 2279 2051 6865 4364 2066 3637 7840 3649 2655 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

Evaluation Metrics: *eval_accuracy = 0.49841717, global_step = 35000, loss = 4.158586* .  Very low scores because we trained the  model on a very tiny BERT model. Please train the model on a larger model here https://github.com/google-research/bert .

## Part of Speech (POS) tagging
POS tagging is the process of marking up a word in a corpus to a corresponding part of a speech tag, based on its context and definition. The primary target of Part-of-Speech(POS) tagging is to identify the grammatical group of a given word. Whether it is a NOUN, PRONOUN, ADJECTIVE, VERB, ADVERBS, etc. based on the context. POS Tagging looks for relationships within the sentence and assigns a corresponding tag to the word.

In [None]:
sentence = "This is the  first tweet about Torrens University"

In [None]:
import nltk
nltk.download('punkt')
tokens=nltk.word_tokenize(sentence)
print(tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
['This', 'is', 'the', 'first', 'tweet', 'about', 'Torrens', 'University']


In [None]:
nltk.pos_tag(tokens)

[('This', 'DT'),
 ('is', 'VBZ'),
 ('the', 'DT'),
 ('first', 'JJ'),
 ('tweet', 'NN'),
 ('about', 'IN'),
 ('Torrens', 'NNP'),
 ('University', 'NNP')]

## Named Entity Recognition (NER) Using SPacy

Named entity recognition (NER) is technique in information extraction that seeks to locate and classify named entities in text into pre-defined categories. Such categories  can include names of persons, organizations, locations,time, currency etc.  We'll use [Spacy](https://spacy.io/), a very versatile  Python package  that is designed for  real and production level NLP work. Its a great alternative to NLTK.

In [None]:
!pip install spacy
!python -m spacy download en_core_web_sm
import spacy
# ​# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [None]:
bbc_news = ("HSBC has paused plans to cut 35,000 jobs, saying it does not want to leave staff unable to find work elsewhere during the coronavirus outbreak."
                             "The bank announced the cuts in February as part of a massive cost-cutting programme.But boss Noel Quinn said the vast majority  of redundancies "
                             "would now be put on hold due  to the exceptional circumstances. It came as HSBC reported a 50% fall in profits linked to the pandemic. "
                             "Pre-tax earnings for the first three months came in at $3.2bn (£2.6bn), down from $6.2bn a year ago.")

document = nlp(bbc_news)

In [None]:
# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in document.noun_chunks])
print("Verbs:", [token.lemma_ for token in document if token.pos_ == "VERB"])

# Find named entities, phrases and concepts
for entity in document.ents:
    print(entity.text, entity.label_)

Noun phrases: ['HSBC', 'plans', '35,000 jobs', 'it', 'staff', 'work', 'the coronavirus outbreak', 'The bank', 'the cuts', 'February', 'part', 'a massive cost-cutting programme', 'boss Noel Quinn', 'the vast majority', 'redundancies', 'hold', 'the exceptional circumstances', 'It', 'HSBC', 'a 50% fall', 'profits', 'the pandemic', 'Pre-tax earnings', 'the first three months']
Verbs: ['pause', 'cut', 'say', 'want', 'leave', 'find', 'announce', 'cut', 'say', 'would', 'put', 'come', 'report', 'link', 'come']
HSBC ORG
35,000 CARDINAL
February DATE
Noel Quinn PERSON
HSBC ORG
50% PERCENT
the first three months DATE
3.2bn MONEY
2.6bn MONEY
6.2bn MONEY
a year ago DATE


Spacy's NER model is able to correctly identify categories in the text.