# NLP

Deals with analyzing, understanding, and deriving information from the text data in a smart and efficient manner.


Automatic Summarization, Machine Translation, Named Entity Recognition, Relationship Extraction, Sentiment Analysis, Speech Recognition, Topic Segmentation etc.


# Noise Removal

In [2]:
noise_list = ["is", "a", "this", "..."] 
def _remove_noise(input_text):
    words = input_text.split() 
    noise_free_words = [word for word in words if word not in noise_list] 
    noise_free_text = " ".join(noise_free_words) 
    return noise_free_text

_remove_noise("this is a sample text")

'sample text'

# Lexicon Normalization

## Stemming

In [4]:
from nltk.stem.porter import PorterStemmer 
stem = PorterStemmer()

word = "multiplying" 
stem.stem(word)

'multipli'

## Lemmatization

In [3]:
from nltk.stem.wordnet import WordNetLemmatizer 
lem = WordNetLemmatizer()

word = "multiplying" 
lem.lemmatize(word, "v")

'multiply'

# Part of speech (pos) tagging

In [1]:
from nltk import word_tokenize, pos_tag
text = "FANO Labs is awesome!"
tokens = word_tokenize(text)
print(pos_tag(tokens))

[('FANO', 'NNP'), ('Labs', 'NNP'), ('is', 'VBZ'), ('awesome', 'JJ'), ('!', '.')]


# Statistical Features

## Term Frequency – Inverse Document Frequency (TF – IDF)

The model creates a vocabulary dictionary and assigns an index to each word. Each row in the output contains a tuple (i,j) and a tf-idf value of word at index j in document i.

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
obj = TfidfVectorizer()
corpus = ['This is sample document.', 'another random document.', 'third sample document text']
X = obj.fit_transform(corpus)
print(X)

  (0, 7)	0.58448290102
  (0, 2)	0.58448290102
  (0, 4)	0.444514311537
  (0, 1)	0.345205016865
  (1, 1)	0.385371627466
  (1, 0)	0.652490884513
  (1, 3)	0.652490884513
  (2, 4)	0.444514311537
  (2, 1)	0.345205016865
  (2, 6)	0.58448290102
  (2, 5)	0.58448290102


# Word Embedding (text vectors)

## Word2Vec

Word2Vec model is composed of preprocessing module, a shallow neural network model called Continuous Bag of Words and another shallow neural network model called skip-gram. These models are widely used for all other nlp problems. It first constructs a vocabulary from the training corpus and then learns word embedding representations. Following code using gensim package prepares the word embedding as the vectors.

They can be used as feature vectors for ML model, used to measure text similarity using cosine similarity techniques, words clustering and text classification techniques.

See also: https://nlp.stanford.edu/projects/glove/

In [12]:
from gensim.models import Word2Vec
sentences = [['nlp', 'speech'], ['FANO', 'Labs', 'data', 'analytics'],['machine', 'learning'], ['deep', 'learning']]

model = Word2Vec(sentences, min_count = 1)

print(model.similarity('data', 'learning'))
print(model['speech'])

-0.135327061591
[ 0.00284716 -0.00244435 -0.00298516  0.0013916  -0.00094614  0.00126659
  0.00277956 -0.00388456  0.00461074  0.0003646  -0.00472697 -0.00419046
 -0.00041253 -0.00245799  0.00374663  0.00022313  0.00261072 -0.00355409
  0.00377335 -0.00278561  0.00313254 -0.00410708  0.00196203  0.00402066
 -0.00281726 -0.00068327  0.00210072 -0.00385393 -0.00310453 -0.00057065
  0.00455011  0.00248685  0.00110605 -0.00466762  0.00039222  0.00476353
  0.00277293  0.00323656 -0.00017605 -0.00301211 -0.002616   -0.00077974
 -0.00154974 -0.00197718  0.00218694  0.00284263 -0.00378764 -0.00401743
 -0.0041434  -0.00105117  0.00099916  0.00451826  0.00453743 -0.00183378
  0.00043845  0.00342977  0.00047013  0.0043246   0.0026336   0.00305885
  0.00136497 -0.00094814  0.00012126  0.00451916 -0.00469958  0.00289007
  0.00095014 -0.00297767 -0.00263465 -0.00269361  0.0006098   0.00124558
 -0.0031856  -0.00194441 -0.00098953 -0.00429033 -0.00284154 -0.00325313
  0.00375388 -0.00092285  0.0003386

  
  import sys


# NLP Applications

## Text Classification

 ### Naive Bayes Model

In [14]:
from textblob.classifiers import NaiveBayesClassifier as NBC
from textblob import TextBlob
training_corpus = [
                   ('I am exhausted of this work.', 'Class_B'),
                   ("I can't cooperate with this", 'Class_B'),
                   ('He is my badest enemy!', 'Class_B'),
                   ('My management is poor.', 'Class_B'),
                   ('I love this burger.', 'Class_A'),
                   ('This is an brilliant place!', 'Class_A'),
                   ('I feel very good about these dates.', 'Class_A'),
                   ('This is my best work.', 'Class_A'),
                   ("What an awesome view", 'Class_A'),
                   ('I do not like this dish', 'Class_B')]
test_corpus = [
                ("I am not feeling well today.", 'Class_B'), 
                ("I feel brilliant!", 'Class_A'), 
                ('Gary is a friend of mine.', 'Class_A'), 
                ("I can't believe I'm doing this.", 'Class_B'), 
                ('The date was good.', 'Class_A'), ('I do not enjoy my job', 'Class_B')]

model = NBC(training_corpus) 
print(model.classify("Their codes are amazing."))
print(model.classify("I don't like their computer."))
print(model.accuracy(test_corpus))

Class_A
Class_B
0.8333333333333334


### Support Vector Machine [SVM]

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.metrics import classification_report
from sklearn import svm 

train_data = []
train_labels = []
for row in training_corpus:
    train_data.append(row[0])
    train_labels.append(row[1])

test_data = [] 
test_labels = [] 
for row in test_corpus:
    test_data.append(row[0]) 
    test_labels.append(row[1])

vectorizer = TfidfVectorizer(min_df=4, max_df=0.9)
train_vectors = vectorizer.fit_transform(train_data)
test_vectors = vectorizer.transform(test_data)

model = svm.SVC(kernel='linear') 
model.fit(train_vectors, train_labels) 
prediction = model.predict(test_vectors)

print (classification_report(test_labels, prediction))

             precision    recall  f1-score   support

    Class_A       0.50      0.67      0.57         3
    Class_B       0.50      0.33      0.40         3

avg / total       0.50      0.50      0.49         6



## Text Matching

### Levenshtein Distance 

The Levenshtein distance between two strings is defined as the ** minimum number of edits needed to transform one string into the other **, with the allowable edit operations being insertion, deletion, or substitution of a single character. 

In [24]:
def levenshtein(s1,s2): 
    if len(s1) > len(s2):
        s1,s2 = s2,s1 
    distances = range(len(s1) + 1) 
    for index2,char2 in enumerate(s2):
        newDistances = [index2+1]
        for index1,char1 in enumerate(s1):
            if char1 == char2:
                newDistances.append(distances[index1]) 
            else:
                 newDistances.append(1 + min((distances[index1], distances[index1+1], newDistances[-1]))) 
        distances = newDistances 
    return distances[-1]

print(levenshtein("analyze","analyse"))

1


### Phonetic Matching 

A Phonetic matching algorithm takes a keyword as input (person’s name, location name etc) and produces a character string that identifies a set of words that are (roughly) phonetically similar. It is very useful for searching large text corpuses, correcting spelling errors and matching relevant names. 

In [55]:
import fuzzy 
soundex = fuzzy.Soundex(4) 
dmeta = fuzzy.DMetaphone()
dmeta('fuzzy')

[b'FS', None]

### Cosine Similarity 

When the text is represented as vector notation, a general cosine similarity can also be applied in order to measure vectorized similarity. 

In [61]:
import math
from collections import Counter

def get_cosine(vec1, vec2):
    common = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in common])

    sum1 = sum([vec1[x]**2 for x in vec1.keys()]) 
    sum2 = sum([vec2[x]**2 for x in vec2.keys()]) 
    denominator = math.sqrt(sum1) * math.sqrt(sum2)
   
    if not denominator:
        return 0.0 
    else:
        return float(numerator) / denominator

def text_to_vector(text): 
    words = text.split() 
    return Counter(words)

text1 = 'she sells sea shells on the sea shore' 
text2 = 'sea she on shells sells'

vector1 = text_to_vector(text1) 
vector2 = text_to_vector(text2) 
cosine = get_cosine(vector1, vector2)
print(cosine)

0.8485281374238569
