# Week 06 Handson - Text Classifier
In this week hands-on, we will create a sentiment analyzer on twitter using the concept of classification and text pre-processing that we have learned before. We will cover:<br>
a. text pre-processing,<br>
b. splitting data for training & testing and converting them into numerical features,<br>
c. training a classifier model and perform predictions on testing dataset,<br>
d. Evaluating performance of algorithm<br>

## Read dataset "tweets.csv"

In [2]:
import numpy as np
import pandas as pd

tweets = pd.read_csv('./tweets.csv', sep=",")# adjust with your own path
tweets.head()

Unnamed: 0,ItemID,Sentiment,SentimentSource,SentimentText
0,1038,1,Sentiment140,that film is fantastic #brilliant
1,1804,1,Sentiment140,this music is really bad #myband
2,1693,0,Sentiment140,winter is terrible #thumbs-down
3,1477,0,Sentiment140,this game is awful #nightmare
4,45,1,Sentiment140,I love jam #loveit


## Milestone 01 (W01)
The given dataset is still a 'raw dataset' which includes some unwanted features, unwanted characters, etc.<br>
a. Select the `SentimentText` column as an attribute and the `Sentiment` column as a label (ground truth) for this study case<br><br>
b. In this M01.b, you have been provided a function template `pre_process` (see below) to perform all the pre-processing step to the all tweet data in the dataset. Complete pre-process function with all techniques that you have learned in the previous hands-on week (W03) for text pre-processing, so the all text attributes can be converted to `pre-processed text`, e.g., after being applied by: (i) tokenization, (ii) normalization, (iii) cleaning, (iv) stemming or lemmatization. Here, you will get `list of words`.<br><br>
c. Use the function that you have completed in M01.b, looped for each data row of `SentimentText` column. For each looping, you will get `list of words`. Append this `list of words` for each looping result in to list, so, will get `list of list`.<br><br>

d. Split (random & stratified) `list of list` you get in M01.c into `training data` and `testing data`. The testing dataset must be 20% from overall dataset. Print the total number of initial dataset, total number of training dataset and testing dataset. <br>


In [9]:
#put your code here for M01.a
df = pd.DataFrame(np.array([tweets["SentimentText"], tweets["Sentiment"]]).T, columns=["attribute", "label"])
df.head(10)

Unnamed: 0,attribute,label
0,that film is fantastic #brilliant,1
1,this music is really bad #myband,1
2,winter is terrible #thumbs-down,0
3,this game is awful #nightmare,0
4,I love jam #loveit,1
5,I dislike skiing #rubbish,0
6,I like pop music #toptastic,1
7,this game is awful good,1
8,rock music is terrible #worstever,0
9,that movie is great #favorite,1


In [17]:
#put your code here for M01.b

import nltk
from nltk.tokenize import word_tokenize
import string 
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer


table = str.maketrans('', '', string.punctuation)
stop_words = set(stopwords.words("english"))
lemmatize = WordNetLemmatizer().lemmatize


def pre_process(input_ori):
    '''
    Write code implementation for text pre-processing here. 
    Use what you have learned before about text pre-processing.
    
    Parameter:
    input_ori = raw data text (single tweet data)
    
    Return value:
    processed_tweet = 'list of words'
    '''
    # (i)tokenization: word_tokenize, (ii)normalization:lower_case
    tokens = [token.lower() for token in word_tokenize(input_ori)]
    # (iii)cleaning: remove punctuation, filter isalpha and not stopwords, (iv)lemmatization
    processed_tweet = [lemmatize(token.translate(table)) for token in tokens if token.isalpha() and token not in stop_words]
    
    return processed_tweet


pre_process(df["attribute"][6])

['like', 'pop', 'music', 'toptastic']

In [41]:
#put your code here for M01.c
list_of_list = df["attribute"].apply(pre_process)
list_of_label =df["label"]

for x, y in zip(list_of_list, list_of_label[:10]):
    print("{}: {}".format(x, y))

['film', 'fantastic', 'brilliant']: 1
['music', 'really', 'bad', 'myband']: 1
['winter', 'terrible']: 0
['game', 'awful', 'nightmare']: 0
['love', 'jam', 'loveit']: 1
['dislike', 'skiing', 'rubbish']: 0
['like', 'pop', 'music', 'toptastic']: 1
['game', 'awful', 'good']: 1
['rock', 'music', 'terrible', 'worstever']: 0
['movie', 'great', 'favorite']: 1


In [78]:
#put your code here for M01.d
from sklearn.model_selection import StratifiedShuffleSplit

sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
# sss.get_n_splits(list_of_list, list_of_label)
train_idx, test_idx = next(sss.split(list_of_list, list_of_label))

training_data, training_label = list_of_list[train_idx], list(list_of_label[train_idx])
test_data, test_label = list_of_list[test_idx], list(list_of_label[test_idx])

print("length of init data: {}, training_data={}, test_data={}".format(
len(list_of_list), len(training_data), len(test_data)))

length of init data: 1932, training_data=1545, test_data=387


## M02
a. Build `tfidf_model` by using codes below with `training data` you get in M01.d. (`TfidfVectorizer` is from scikit-learn)
```
def dummy(doc):
    return doc
tfidf = TfidfVectorizer(
    analyzer='word',
    tokenizer=dummy,
    preprocessor=dummy,
    token_pattern=None)
```
b. Transform `training data` and `testing data` you get in M01.d by using `tfidf_model` you get in M02.a. In this case, you will get numerical features, both from `training data` and `testing data`.<br><br>
c. Choose a classification algorithm (you may use library such as scikit-learn), and explain why you choose it.<br><br>
d. Train the classifier model, based on the algorithm you have chosen, by using numerical features of `training data` from M02.b.<br><br>
e. Make predictions on the numerical features of `testing dataset` you get in M02.b using the classifier model that you have trained.

In [61]:
# put your code here for M02.a

from sklearn.feature_extraction.text import TfidfVectorizer


def dummy(doc):
    return doc

tfidf = TfidfVectorizer(
    analyzer='word',
    tokenizer=dummy,
    preprocessor=dummy,
    token_pattern=None)

tfidf_model = tfidf.fit(training_data)
tfidf_feats = tfidf_model.get_feature_names()
print("tfidf model shape={}\n{}, ...".format(len(tfidf_feats), tfidf_feats[:10]))

tfidf model shape=44
['adore', 'awesome', 'awful', 'bad', 'band', 'best', 'bestever', 'book', 'brilliant', 'cheese'], ...


In [62]:
# put your code here for M02.b
train_numfeats = tfidf_model.transform(training_data).toarray()
test_numfeats = tfidf_model.transform(test_data).toarray()
print("doc term matrix shape train={}, test={}".format(train_numfeats.shape, test_numfeats.shape))
train_numfeats[0]

doc term matrix shape train=(1545, 44), test=(387, 44)


array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.47193179, 0.        ,
       0.        , 0.66637412, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.57725723, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        ])

Your explanation (M02.c):

**eksperimen menggunakan SVM(support vector machine) supervised model dengan estimator SVC(support vector classification).** SVM merupakan model yang paling sering digunakan untuk analisi sentimen. Beberapa riset juga telah membuktikan bahwa SVM cocok digunakan untuk interpretasi sosial, paling cocok untuk analisis sentimen. SVM cocok untuk klasifikasi teks dikarenakan kemampuannya untuk mengatasi fitur atau dimensi yang berjumlah banyak. Kelebihan lainnya meliputi bahwa model SVM kuat ketika menghadapi data yang sangat jarang ditemui dikarenakan kebanyakan masalah data ini ialah *linearly separable*.

In [86]:
# put your code here for M02.d
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix

svc_model = SVC(C=1.0, kernel='rbf', degree=3, gamma='scale').fit(train_numfeats, training_label)
all_predictions = svc_model.predict(train_numfeats)

print("Classification report :\n", classification_report(training_label, all_predictions))
print("\n\nConfusion matrix: \n", confusion_matrix(training_label, all_predictions))

Classification report :
               precision    recall  f1-score   support

           0       0.99      0.98      0.98       779
           1       0.98      0.99      0.98       766

    accuracy                           0.98      1545
   macro avg       0.98      0.98      0.98      1545
weighted avg       0.98      0.98      0.98      1545



Confusion matrix: 
 [[762  17]
 [  7 759]]


In [90]:
# put your code here for M02.e
test_predictions = svc_model.predict(test_numfeats)

## M03
After train the classification model and make prediction using that model, now you will evaluate the performance of your model against testing dataset.<br>
a. Calculate and print the accuracy of your model's predictions in M02.e against testing dataset ground truth<br>
b. What you can infer based on the result?<br>

In [91]:
#put your code here for Q03.a

print("Classification report :\n", classification_report(test_label, test_predictions))
print("\n\nConfusion matrix: \n", confusion_matrix(test_label, test_predictions))

Classification report :
               precision    recall  f1-score   support

           0       0.99      0.99      0.99       195
           1       0.99      0.99      0.99       192

    accuracy                           0.99       387
   macro avg       0.99      0.99      0.99       387
weighted avg       0.99      0.99      0.99       387



Confusion matrix: 
 [[194   1]
 [  1 191]]


Your answer (M03.b) :

Akurasi dalam eksperimen ini yaitu 0.99. Akurasi dari model SVM ini mungkin dapat ditingkatkan lagi dengan mengubah hyperparameter sendiri melalui validation_testing dan juga melalui proses POS_tagger dahulu sebelum menjalani training model. Hasil akurasi ini tinggi sesuai dengan eksperimen riset, yaitu SVM cocok digunakan untuk analisis sentimen.