### Natural Language Processing
- Natural Language Processing (or NLP) is applying Machine Learning models to text and language. Teaching machines to understand what is said in spoken and written word is the focus of Natural Language Processing
- Language Processing. Whenever you dictate something into your iPhone / Android device that is then converted to text, that’s an NLP algorithm in action
- You can also use NLP on a text review to predict if the review is a good one or a bad one. You can use NLP on an article to predict some categories of the articles you are trying to segment. You can use NLP on a book to predict the genre of the book. And it can go further, you can use NLP to build a machine translator or a speech recognition system, and in that last example you use classification algorithms to classify language. Speaking of classification algorithms, most of NLP algorithms are classification models, and they include Logistic Regression, Naive Bayes, CART which is a model based on decision trees, Maximum Entropy again related to Decision Trees, Hidden Markov Models which are models based on Markov processes.
- A very well-known model in NLP is the Bag of Words model. It is a model used to preprocess the texts to classify before fitting the classification algorithms on the observations containing the texts.

In [1]:
import os, sys

sys.path.append(os.path.abspath("Datasets"))
sys.path.append(os.path.abspath("Images"))

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

%matplotlib inline

In [3]:
dataset = pd.read_csv("Datasets/ML_a_z/Restaurant_Reviews.tsv", 
                      delimiter='\t', quoting=3) #quoting=3: ignore""

In [4]:
dataset.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [5]:
# Cleaning the texts, i.e: capital words, ending words,...
import re # Regular expression


In [6]:
dataset['Review'][0]

'Wow... Loved this place.'

In [7]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HuuTanVu\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [8]:
from nltk.corpus import stopwords

In [9]:
review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][0])
# Replace by spaces so it will not create meaningless words

In [10]:
review = review.lower() # Change to lowercase

In [11]:
review = [word for word in review.split() if not word in set(stopwords.words('english'))]

In [12]:
review # Now only relevant english words.

['wow', 'loved', 'place']

In [13]:
# Remove the ending of the words
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [14]:
review = [ps.stem(word) for word in review]

In [15]:
review

[u'wow', u'love', u'place']

In [16]:
def preprocessing(sentence):
    ps = PorterStemmer()
    review = re.sub('[^a-zA-Z]', ' ', sentence).lower()
    review = [ps.stem(word) for word in review.split() if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    return review


In [17]:
corpus = []
for i in range(0, dataset['Review'].values.shape[0]):
    review = preprocessing(dataset['Review'][i])
    corpus.append(review)

In [28]:
# Filtering all the words that appear rarely, for instance, people's names,...
# Creating the Bag of Words model
# Taking all the words once only (not duplicate...) and put all of those
# words to columns, and rows are 1000 sentences.
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words='english', max_features=1500)
#max_features : to remove irrelevant words

In [29]:
X = cv.fit_transform(corpus).toarray()

In [32]:
y = dataset.iloc[:, 1].values

In [33]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_teset = train_test_split(X, y,
                                                    test_size=0.25,
                                                    random_state=0)

In [34]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

GaussianNB(priors=None)

In [35]:
y_pred = classifier.predict(X_test)

In [36]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

In [37]:
cm

array([[ 66,  51],
       [ 18, 115]])