# Natural Language Processing

### 目標
* 資料處理前置作業<br/>
* 創建詞袋模型(Bag of Words model)<br/>
* 使用Naive Bayes 建置模型<br/>
* 預測y並且產生Confusion Matrix<br/>
* TF-IDF<br/>

In [2]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [3]:
# Importing the dataset
dataset = pd.read_csv('/Users/bicc/Documents/kyo/Natural_Language_Processing/Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)

In [4]:
print(dataset)

                                                Review  Liked
0                             Wow... Loved this place.      1
1                                   Crust is not good.      0
2            Not tasty and the texture was just nasty.      0
3    Stopped by during the late May bank holiday of...      1
4    The selection on the menu was great and so wer...      1
5       Now I am getting angry and I want my damn pho.      0
6                Honeslty it didn't taste THAT fresh.)      0
7    The potatoes were like rubber and you could te...      0
8                            The fries were great too.      1
9                                       A great touch.      1
10                            Service was very prompt.      1
11                                  Would not go back.      0
12   The cashier had no care what so ever on what I...      0
13   I tried the Cape Cod ravoli, chicken, with cra...      1
14   I was disgusted because I was pretty sure that...      0
15   I w

* 引入正規表示套件(re)<br/>
* 引入自然語言工具包(NLTK)：<br/>
 nltk.corpus:獲取和處理語料庫<br/>
 nltk.stem.porter.PorterStemmer:從英文單字中獲得符合語法的詞幹<br/>
* re.sub各參數解說：<br/>
re.sub(pattern, repl, string, count=0, flags=0)<br/>
>pattern:正則中的模式字符串<br/>
repl:被替換的字符串<br/>
string:欲處理的字符串<br/>
count:指定處理部分內容<br/>
flags:<br/>
* PorterStemmer( ).stem:提取詞幹<br/>

In [5]:
# Cleaning the texts
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 1000):
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

[nltk_data] Downloading package stopwords to /Users/bicc/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


分段執行範例：

In [6]:
review1 = re.sub('[^a-zA-Z]', ' ', dataset['Review'][6])
review2 = review1.lower()
review3 = review2.split()
ps = PorterStemmer()
review4 = [ps.stem(word) for word in review3]
review5 = [ps.stem(word) for word in review4 if not word in set(stopwords.words('english'))]
review6 = ' '.join(review5)
print(review2)
print(review3)
print(review4)
print(review5)
print(review6)

honeslty it didn t taste that fresh  
['honeslty', 'it', 'didn', 't', 'taste', 'that', 'fresh']
['honeslti', 'it', 'didn', 't', 'tast', 'that', 'fresh']
['honeslti', 'tast', 'fresh']
honeslti tast fresh


### 詞袋模型

In [5]:
help(CountVectorizer)

Help on class CountVectorizer in module sklearn.feature_extraction.text:

class CountVectorizer(sklearn.base.BaseEstimator, VectorizerMixin)
 |  Convert a collection of text documents to a matrix of token counts
 |  
 |  This implementation produces a sparse representation of the counts using
 |  scipy.sparse.coo_matrix.
 |  
 |  If you do not provide an a-priori dictionary and you do not use an analyzer
 |  that does some kind of feature selection then the number of features will
 |  be equal to the vocabulary size found by analyzing the data.
 |  
 |  Read more in the :ref:`User Guide <text_feature_extraction>`.
 |  
 |  Parameters
 |  ----------
 |  input : string {'filename', 'file', 'content'}
 |      If 'filename', the sequence passed as an argument to fit is
 |      expected to be a list of filenames that need reading to fetch
 |      the raw content to analyze.
 |  
 |      If 'file', the sequence items must have a 'read' method (file-like
 |      object) that is called to fetc

#### * TF（Term Frequency）: 表示某個關鍵詞在整篇文章中出現的頻率<br/>
#### * CountVectorizer : 將本文中的詞語轉換為詞頻矩陣<br/>
以下介紹CountVectorizer內的參數：<br/>
* strip_accents : {‘ascii’, ‘unicode’, None}<br/>
* lowercase : boolean, True by default：計算前，先將所有字符轉化為小寫。這個參數一般為True。<br/>
* preprocessor : callable or None (default)<br/>
* tokenizer : callable or None (default)<br/>
* stop_words : string {‘english’}, list, or None (default)<br/>
* token_pattern : string：正則表達式，默認篩選長度大於等於2的字母和數字混合字符，參數analyzer設置為word時才有效。<br/>
* ngram_range : tuple (min_n, max_n)：n-values值得上下界，默認是ngram_range=(1, 1)，該範圍之內的n元feature都會被提取出來，根據自己的需求調整。<br/>
* analyzer : string, {‘word’, ‘char’, ‘char_wb’} or callable：關鍵詞基於wordn-grams還是character n-grams。如果是callable是自己複寫的從the raw, unprocessed input提取關鍵詞的函數。<br/>
* max_df : float in range [0.0, 1.0] or int, default=1.0<br/>
* min_df : float in range [0.0, 1.0] or int, default=1：按比例，或絕對數量刪除df超過max_df或者df小於min_df的word tokens。有效的前提是參數vocabulary設置成Node。<br/>
* max_features : int or None, default=None：選擇tf最大的max_features個關鍵詞。有效的前提是參數vocabulary設置成Node。<br/>
* vocabulary : Mapping or iterable, optional：自定義的word tokens，如果不是None，則只計算vocabulary中關鍵詞的tf。還是設為None靠譜。<br/>
* binary : boolean, default=False：如果是True，tf的值只有0和1，表示出現和不出現。<br/>
* dtype : type, optional：Type of the matrix returned by fit_transform() or transform().<br/>

In [13]:
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values
print(cv.vocabulary_)

{'wow': 1482, 'love': 777, 'place': 963, 'crust': 310, 'good': 569, 'tasti': 1297, 'textur': 1309, 'nasti': 870, 'stop': 1246, 'late': 737, 'may': 809, 'bank': 92, 'holiday': 642, 'rick': 1084, 'steve': 1239, 'recommend': 1050, 'select': 1140, 'menu': 827, 'great': 583, 'price': 997, 'get': 553, 'angri': 33, 'want': 1432, 'damn': 319, 'pho': 951, 'honeslti': 645, 'tast': 1295, 'fresh': 528, 'potato': 989, 'like': 760, 'rubber': 1098, 'could': 283, 'tell': 1302, 'made': 788, 'ahead': 15, 'time': 1331, 'kept': 720, 'warmer': 1434, 'fri': 529, 'touch': 1349, 'servic': 1149, 'prompt': 1009, 'would': 1480, 'go': 563, 'back': 83, 'cashier': 204, 'care': 196, 'ever': 447, 'say': 1125, 'still': 1241, 'end': 431, 'wayyy': 1442, 'overpr': 916, 'tri': 1359, 'cape': 192, 'cod': 245, 'ravoli': 1040, 'chicken': 222, 'cranberri': 296, 'mmmm': 847, 'disgust': 370, 'pretti': 996, 'sure': 1281, 'human': 662, 'hair': 603, 'shock': 1158, 'sign': 1169, 'indic': 684, 'cash': 202, 'highli': 635, 'waitress': 

In [14]:
print(X)

[[0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 ..., 
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]]


### 切割資料

In [9]:
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)



### 使用Naive Bayes 建置模型

In [10]:
# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

GaussianNB(priors=None)

### 預測y並且產生Confusion Matrix

In [11]:
# Predicting the Test set results
y_pred = classifier.predict(X_test)

In [12]:
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print (cm)

[[55 42]
 [12 91]]


### TF-IDF(Term Frequency - Inverse Document Frequency)

In [15]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])

In [16]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[ 0.          0.43370786  0.          0.55847784  0.55847784  0.
   0.43370786  0.          0.        ]
 [ 0.          0.43370786  0.          0.          0.          0.55847784
   0.43370786  0.          0.55847784]
 [ 0.50238645  0.44507629  0.50238645  0.19103892  0.19103892  0.19103892
   0.29671753  0.25119322  0.19103892]]
