## Data Preprocessing Using NLTK
The training data contains about 8,000 comments with corresponding stars(1 to 5). Assume that we're going to train a model to predict the star by the comment. In this homework, we're going to implement data preprocessing by using NLTK. The steps are shown below.
<br>
<li> Import the packages you need and read the csv file.
<li> Turn each comment into a word bag. Remember the bag only contain verb and adjective, stop words and punctuations are excluded.
<li> Turn the word bag into number using one-hot encoding. Each row represents the sample, and each column represent the word.
<li> Finally, using the train_test_split function in sklearn.model_selection to split the data into training set and testing set. Then put the training set into the model.(This part of code is provided.)

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
%cd /content/drive/MyDrive/Colab Notebooks/bigdata/miniHW

/content/drive/MyDrive/Colab Notebooks/bigdata/miniHW


In [4]:
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
import nltk
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet
import string

df = pd.read_csv("training_data.csv")
df.head()

Unnamed: 0,review_id,business_id,user_id,text,date,stars
0,3223,2055,2533,"Sometimes things happen, and when they do this...",2010/12/30,5
1,9938,4165,6371,I know Kerrie through my networking and we ben...,2011/4/26,5
2,7123,869,4929,Love their pizza!!!\r\nVery fresh. Their canno...,2012/9/28,5
3,3601,1603,2789,Being from NJ I am always on the prowl for my ...,2009/6/7,4
4,3948,2347,1245,We have tried this spot a few times and each v...,2011/2/20,4


<li>Write a function to turn all the comments into wordbag, and pick up verbs and adjectives only.
<br>1.input the "text" column in df (i.e. df.text), and tokenize all the comments(nltk.word_tokenize() )
<br>2. pick up all the stop words and punctuation (string.punctuation and nltk.corpus.stopwords.words('english')  )
<br>3. pos_tag the remain words, and pick up lemmatized verbs and  lemmatized adjectives only.(nltk.pos_tag()  and wnl.lemmatize())
<br>4. return a list which contains dictionaries, each dictionary is a comment, i.e.
<br>[{'happen': 1, 'want': 1, 'take': 1, 'best': 1, 'nice': 1, 'find': 1},
 {'know': 1,
  'kerrie': 2,
  'benefit': 1,
  'need': 3,
  'plan': 1,
  'remind': 1,
1},
 {'love': 1, 'fresh': 1, 'good': 1, 'seem': 1, 'great': 1},
 {'hometown': 1,
  'italian': 1,
  'best': 1,
  'pizza': 1,
  'big': 1,}]

In [5]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [6]:
wnl = WordNetLemmatizer()
def tokenize_document(list_text):
    output = []
    punct = string.punctuation
    stopwords = nltk.corpus.stopwords.words('english')
    tags_need = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'JJ', 'JJR', 'JJS']
    for sentence in list_text:
      words = []
      # 去除punctuation
      for p in punct:
         sentence = sentence.replace(p, '.') 
      # 去除stopwords, 找出動詞和形容詞
      for word, tag in nltk.pos_tag(nltk.word_tokenize(sentence)):
        if '.' in word:
          word = word[:word.index('.')]
        if (word not in stopwords) and (tag in tags_need):
          word = wnl.lemmatize(word, 'v') if tag[0] == 'V' else wnl.lemmatize(word, 'a')
          # print(word)
          words.append(word)
      dic = {}
      for word in words:
        dic[word] = 1 if word not in dic else dic[word] + 1
      output.append(dic)
    return output

In [7]:
output = tokenize_document(df['text'][:3])
output

[{'best': 1, 'find': 1, 'happen': 1, 'nice': 1, 'take': 1, 'want': 1},
 {'benefit': 1,
  'come': 1,
  'good': 1,
  'help': 1,
  'know': 1,
  'look': 1,
  'mention': 1,
  'need': 3,
  'remind': 1,
  'troubled': 1,
  'true': 1},
 {'Love': 1, 'fresh': 1, 'good': 1, 'great': 1, 'seem': 1}]

<li>Write a function to turn the bag of word into numeric numpy array by one-hot encoding method.
<br>1. Input the list from the return of above function.
<br>2. create a python set, called "features", containing all the word in all comments.  ex:{"I", "have", "a", "dog", "cat"}
<br>3. create a nested list, called mat, containg the counts of word in each comment.  
<br>ex:  [{"I":1, "have":1, "a":1, "dog":1}, {"I":1, "have":1, "a":1, "cat":1}] -->[[1,1,1,1,0], [1,1,1,0,1]]
<br>4. put the nested into numpy.array() and return the array and "features" set as the function results

In [8]:
def vectorize_mat(dics):
    features = set()
    for dic in dics:
      for k in dic:
        if k not in features:
          features.add(k) 
    features = sorted(features)
    mat = []
    for dic in dics:
      row = [0]*len(features)
      for k in dic:
        # print(k)
        row[features.index(k)] = dic[k]
      mat.append(row)
    return features, np.array(mat)

In [9]:
vectorize_mat(tokenize_document(df['text'][:3]))

(['Love',
  'benefit',
  'best',
  'come',
  'find',
  'fresh',
  'good',
  'great',
  'happen',
  'help',
  'know',
  'look',
  'mention',
  'need',
  'nice',
  'remind',
  'seem',
  'take',
  'troubled',
  'true',
  'want'],
 array([[0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1],
        [0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 3, 0, 1, 0, 0, 1, 1, 0],
        [1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]]))

Here, we choose Multinomial Naive Bayes Classifier as the model. 

In [10]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

In [19]:
features, one_hot_mat = vectorize_mat(tokenize_document(df['text'][:100]))

In [20]:
csr_matrix(one_hot_mat)

<100x749 sparse matrix of type '<class 'numpy.longlong'>'
	with 1760 stored elements in Compressed Sparse Row format>

In [24]:
train_data, test_data, train_lab, test_lab = train_test_split(csr_matrix(one_hot_mat), df.stars[:100] , train_size = 0.8, random_state = 123)
MNB_model = MultinomialNB(alpha = 0.5)
MNB_model.fit(train_data, train_lab)
MSE = np.std(MNB_model.predict(test_data) - test_lab)
ACC = MNB_model.score(test_data, test_lab)
print("Under MNB model, MSE and accuracy are %.3f and %.3f, respectively." % (MSE, ACC) )

Under MNB model, MSE and accuracy are 1.072 and 0.450, respectively.
