## Introduction to Natural Language Processing

### Purpose of Text Mining

- To extract information from text
- To identify similarity between sentences \ documents
- Summarize the intent of a text article \ review
- Translate human language context to Numerical representation for computers to analyze

### Flow of NLP projects
 <img src="images/nlpflow.png" alt="nlp_flow" style="width:80%;height:200px;">

### Load Libraries

In [1]:
import string
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
import pandas as pd

#### Transform a processed text to Document Term Matrix / Term Document Matrix

In [6]:
sample = ['Mr. Toad saw the car approching him at a distance. The distance to my house is quite large as compared to yours.' ,
         'When you finish your coding you can have your dinner. You need to embrace hard work',
         'The time has come for humans to embrace Computers as equal partners.',
         'Cricket is not just a sport but a passion in India']

In [7]:
count_vect = CountVectorizer(stop_words= "english") # Initialize vectorizer
dtm = count_vect.fit_transform(sample) # fit & transform
dtm

<4x28 sparse matrix of type '<class 'numpy.int64'>'
	with 29 stored elements in Compressed Sparse Row format>

In [8]:
dtm.todense()

matrix([[1, 1, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0,
         0, 1, 1, 0, 0, 1, 0],
        [0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0,
         0, 0, 0, 0, 0, 0, 1],
        [0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
         0, 0, 0, 0, 1, 0, 0],
        [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
         1, 0, 0, 1, 0, 0, 0]])

In [5]:
count_vect.vocabulary_

{'mr': 15,
 'toad': 22,
 'saw': 19,
 'car': 1,
 'approching': 0,
 'distance': 7,
 'finish': 10,
 'coding': 2,
 'dinner': 6,
 'need': 16,
 'embrace': 8,
 'hard': 11,
 'work': 23,
 'time': 21,
 'come': 3,
 'humans': 12,
 'computers': 4,
 'equal': 9,
 'partners': 17,
 'cricket': 5,
 'just': 14,
 'sport': 20,
 'passion': 18,
 'india': 13}

### Building a sample predictive model for News feeds data

In [9]:
twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)

In [10]:
type(twenty_train.data) , twenty_train.data[0]

(list,
 "From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n")

In [11]:
twenty_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [12]:
len(twenty_train.data), len(twenty_train.filenames)

(11314, 11314)

In [None]:
# Target lables
twenty_train.target[:10]

#### Converting to Bag of Words
- Assign a fixed integer id to each word occurring in any document of the training set (for instance by building a dictionary from words to integer indices).
- For each document #i, count the number of occurrences of each word w and store it in X[i, j] as the value of feature #j where j is the index of word w in the dictionary.

In [13]:
count_vect = CountVectorizer(stop_words= "english")
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(11314, 129796)

#### Training a classifier

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [20]:
# from sklearn.naive_bayes import MultinomialNB
#from sklearn.tree import DecisionTreeClassifier
clf = LogisticRegression().fit(X_train_counts, twenty_train.target)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Lets try to predict

In [21]:
docs_new = ['mortor cycle being launched in December', 'OpenGL on the GPU is fast', 
            'apple to launch new mobile soon', 'new motor bikes to be launched soon']

In [22]:
X_new_counts = count_vect.transform(docs_new)
predicted = clf.predict(X_new_counts)
for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

'mortor cycle being launched in December' => soc.religion.christian
'OpenGL on the GPU is fast' => soc.religion.christian
'apple to launch new mobile soon' => comp.sys.mac.hardware
'new motor bikes to be launched soon' => rec.motorcycles


### Further Reading
- https://medium.com/civis-analytics/an-intro-to-natural-language-processing-in-python-framing-text-classification-in-familiar-terms-33778d1aa3ca