### Student Information
Name: 姚瀚宇

Student ID: 110062542

GitHub ID: 51129597

## 4. Improve the data preprocessing

#### First, there are some small issues regarding to coding.

1. In 3.1 Converting Dictionary into Pandas Dataframe, there is a cell adding category_name column to the DataFrame. This cell calls the function `format_labels(target, docs)` from data_mining_helpers.py. 

In [None]:
# add category label also
X['category_name'] = X.category.apply(lambda t: dmh.format_labels(t, twenty_train))

However, after taking a look at the function `format_labels(target, docs)`, I think that this function is redundant. The cell can directly be written as below without calling the function.

In [None]:
# add category label also
X['category_name'] = X.category.apply(lambda t: twenty_train.target_names[t])

---
2. In 5.5 Atrribute Transformation / Aggregation, there is one cell calculating the term frequencies and it takes some time to compute.

In [None]:
# note this takes time to compute. You may want to reduce the amount of terms you want to compute frequencies for
term_frequencies = []
for j in range(0,X_counts.shape[1]):
    term_frequencies.append(sum(X_counts[:,j].toarray()))

However, the efficient way to conduct the same process is listed right below this cell.

In [None]:
term_frequencies = np.asarray(X_counts.sum(axis=0))[0]

These two cells are performing the same thing (calculating the term frequencies), but the second one performs more efficiently. Therefore, we only need to keep the second cell.

---
3. In 6. Data Exploration, there is one cell retrieving sentences from the original dataset. 

In [None]:
# We retrieve 2 sentences for a random record, here, indexed at 50 and 100
document_to_transform_1 = []
random_record_1 = X.iloc[50]
random_record_1 = random_record_1['text']
document_to_transform_1.append(random_record_1)

document_to_transform_2 = []
random_record_2 = X.iloc[100]
random_record_2 = random_record_2['text']
document_to_transform_2.append(random_record_2)

document_to_transform_3 = []
random_record_3 = X.iloc[150]
random_record_3 = random_record_3['text']
document_to_transform_3.append(random_record_3)

Nevertheless, I think that this cell can be written more concisely as below three lines.

In [None]:
document_to_transform_1 = [X['text'][50]]
document_to_transform_2 = [X['text'][100]]
document_to_transform_3 = [X['text'][150]]

---

#### Second, when preprocessing text data, we can do something to make the data cleaner.

1. Clean the text data by removing punctuation marks and converting all characters to lowercase. (implemented by re)
2. Perform word stemming (implemented by nltk) on text data, which grammatically transforms words back to root form. In this way, we can treat all the different form words as the same when calculating frequency or TF-IDF.
3. Remove stop-words (implemented by nltk), which are words that are extremely common in all sorts of texts thus contain little useful information that can be used to distinguish between different classes of documents. For example, "is", "and", "has", and "the".
4. Remove numbers if they are not relevant to the analysis. (implemented by re)

Use the new dataset as an example.

In [1]:
# Read the dataset
text = []
score = []
files = ['amazon_cells_labelled.txt', 'imdb_labelled.txt', 'yelp_labelled.txt']

for file in files:
    f = open('sentiment labelled sentences/' + file)
    for line in f:
        line_lst = line.split('\t')
        text.append(line_lst[0])
        score.append(int(line_lst[1][0]))

print(len(text), len(score))

3000 3000


In [2]:
# Construct DataFrame
import pandas as pd

dic = {"text":text, "score":score}
X = pd.DataFrame(dic)
X.head()

Unnamed: 0,text,score
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


In [3]:
# Drop duplicated records
X.drop_duplicates(keep='first', inplace=True)

In [4]:
# removing punctuation marks but emoticons, and converting all characters to lowercase
import re

def preprocessor(text):
    # regex for matching emoticons, keep emoticons, ex: :), :-P, :-D
    r = '(?::|;|=|X)(?:-)?(?:\)|\(|D|P)'
    emoticons = re.findall(r, text)
    text = re.sub(r, '', text)
    # convert to lowercase and append all emoticons behind (with space in between)
    # replace('-','') removes nose of emoticons
    text = re.sub('[\W]+', ' ', text.lower()) + ' ' + ' '.join(emoticons).replace('-','')
    return text

X['text'] = X['text'].apply(lambda t : preprocessor(t))

In [5]:
# Word stemming
import nltk
from nltk.stem.porter import PorterStemmer

def tokenizer_stem(text):
    porter = PorterStemmer()
    return " ".join([porter.stem(word) for word in re.split('\s+', text.strip())])

X['text'] = X['text'].apply(lambda t : tokenizer_stem(t))

In [6]:
# Stop-word removal
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
nltk.download('stopwords')
stop = stopwords.words('english')

def tokenizer_stem_nostop(text):
    porter = PorterStemmer()
    return " ".join([porter.stem(w) for w in re.split('\s+', text.strip()) \
    if w not in stop and re.match('[a-zA-Z]+', w)])

X['text'] = X['text'].apply(lambda t : tokenizer_stem_nostop(t))

[nltk_data] Downloading package stopwords to /Users/yao/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [7]:
# Create the term-document matrix for frequency
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
doc_counts = count_vect.fit_transform(X.text)
doc_counts = doc_counts.toarray()

In [8]:
# Create a TF-IDF vectorizer and fit_transform it to X.text
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

vectorizer = TfidfVectorizer()
doc_tfidf = vectorizer.fit_transform(X.text)
doc_tfidf = doc_tfidf.toarray()

In [9]:
# combine frequency feature and tf-idf feature together
XX = np.append(doc_counts, doc_tfidf, axis=1)
print(XX.shape)

(2983, 7750)


In [10]:
# Multinomial naive Bayes classifier
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

acc_mnb = []
y = X['score']

for i in range(10):
    X_train, X_test, y_train, y_test = train_test_split(XX, y, test_size=0.2)
    mnb = MultinomialNB(alpha=1.0)
    y_pred = mnb.fit(X_train, y_train).predict(X_test)

    total_test_data = X_test.shape[0]
    wrong_test_data = (y_test != y_pred).sum()
    correct_test_data = (y_test == y_pred).sum()
    print("Number of mislabeled points out of a total %d points : %d"
          % (total_test_data, wrong_test_data))
    print("Accuracy: {:.2f}%".format(correct_test_data/total_test_data*100))
    acc_mnb.append(correct_test_data/total_test_data*100)

Number of mislabeled points out of a total 597 points : 107
Accuracy: 82.08%
Number of mislabeled points out of a total 597 points : 116
Accuracy: 80.57%
Number of mislabeled points out of a total 597 points : 107
Accuracy: 82.08%
Number of mislabeled points out of a total 597 points : 114
Accuracy: 80.90%
Number of mislabeled points out of a total 597 points : 112
Accuracy: 81.24%
Number of mislabeled points out of a total 597 points : 116
Accuracy: 80.57%
Number of mislabeled points out of a total 597 points : 123
Accuracy: 79.40%
Number of mislabeled points out of a total 597 points : 106
Accuracy: 82.24%
Number of mislabeled points out of a total 597 points : 108
Accuracy: 81.91%
Number of mislabeled points out of a total 597 points : 103
Accuracy: 82.75%


In [11]:
print("Average accuracy of Multinomial naive Bayes classifier: {:.2f}%".format(sum(acc_mnb)/10))

Average accuracy of Multinomial naive Bayes classifier: 81.37%


However, after performing these text data preprocessing techniques, the accuracy is nearly the same as previous accuracy (without these text data preprocessing). Maybe some more feature engineering techniques can be applied to this dataset rather than using only frequency and TF-IDF score as features.