## Introduction to Natural Language Processing

### Purpose of Text Mining

- To extract information from text
- To identify similarity between sentences \ documents
- Summarize the intent of a text article \ review
- Translate human language context to Numerical representation for computers to analyze

### Flow of NLP projects
 <img src="images/nlpflow.png" alt="nlp_flow" style="width:80%;height:200px;">

### Load Libraries

In [None]:
import string
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
import pandas as pd

#### Transform a processed text to Document Term Matrix / Term Document Matrix

In [None]:
sample = ['Mr. Toad saw the car approching him at a distance.',
         'When you finish your coding you can have your dinner. You need to embrace hard work',
         'The time has come for humans to embrace Computers as equal partners.',
         'Cricket is not just a sport but a passion in India']

In [None]:
count_vect = CountVectorizer(stop_words= "english") # Initialize vectorizer
dtm = count_vect.fit_transform(sample) # fit & transform
dtm

In [None]:
dtm.todense()

In [None]:
count_vect.vocabulary_

### Building a sample predictive model for News feeds data

In [None]:
twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)

In [None]:
type(twenty_train.data) , twenty_train.data[0]

In [None]:
twenty_train.target_names

In [None]:
len(twenty_train.data), len(twenty_train.filenames)

In [None]:
# Target lables
twenty_train.target[:10]

#### Converting to Bag of Words
- Assign a fixed integer id to each word occurring in any document of the training set (for instance by building a dictionary from words to integer indices).
- For each document #i, count the number of occurrences of each word w and store it in X[i, j] as the value of feature #j where j is the index of word w in the dictionary.

In [None]:
count_vect = CountVectorizer(stop_words= "english")
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

#### Training a classifier

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [None]:
from sklearn.naive_bayes import MultinomialNB
#from sklearn.tree import DecisionTreeClassifier
clf = RandomForestClassifier().fit(X_train_counts, twenty_train.target)

### Lets try to predict

In [None]:
docs_new = ['mortor cycle being launched in December', 'OpenGL on the GPU is fast', 
            'apple to launch new mobile soon']

In [None]:
X_new_counts = count_vect.transform(docs_new)
predicted = clf.predict(X_new_counts)
for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

### Further Reading
- https://medium.com/civis-analytics/an-intro-to-natural-language-processing-in-python-framing-text-classification-in-familiar-terms-33778d1aa3ca