### Student Information
Name: Yu-Han Zhao 趙宇涵

Student ID: 110033635

GitHub ID: honey0703

Kaggle name: honey0703

Kaggle private scoreboard snapshot:

[Snapshot](img/pic0.png)

---

### Instructions

1. First: __This part is worth 30% of your grade.__ Do the **take home** exercises in the [DM2021-Lab2-master Repo](https://github.com/fhcalderon87/DM2021-Lab2-master). You may need to copy some cells from the Lab notebook to this notebook. 


2. Second: __This part is worth 30% of your grade.__ Participate in the in-class [Kaggle Competition](https://www.kaggle.com/c/dm2021-lab2-hw2/) regarding Emotion Recognition on Twitter. The scoring will be given according to your place in the Private Leaderboard ranking: 
    - **Bottom 40%**: Get 20% of the 30% available for this section.

    - **Top 41% - 100%**: Get (60-x)/6 + 20 points, where x is your ranking in the leaderboard (ie. If you rank 3rd your score will be (60-3)/6 + 20 = 29.5% out of 30%)   
    Submit your last submission __BEFORE the deadline (Dec. 24th 11:59 pm, Friday)__. Make sure to take a screenshot of your position at the end of the competition and store it as '''pic0.png''' under the **img** folder of this repository and rerun the cell **Student Information**.
    

3. Third: __This part is worth 30% of your grade.__ A report of your work developping the model for the competition (You can use code and comment it). This report should include what your preprocessing steps, the feature engineering steps and an explanation of your model. You can also mention different things you tried and insights you gained. 


4. Fourth: __This part is worth 10% of your grade.__ It's hard for us to follow if your code is messy :'(, so please **tidy up your notebook** and **add minimal comments where needed**.


Upload your files to your repository then submit the link to it on the corresponding e-learn assignment.

Make sure to commit and save your changes to your repository __BEFORE the deadline (Dec. 29th 11:59 pm, Wednesday)__. 

## 1. Data Preparation
### 1.1 Load data
1. Read json files
2. Split to train and test df
3. Append emotions after traon df

In [None]:
# Read json data to pd
import numpy as np 
import pandas as pd
import json

df = pd.read_json("dm2021-lab2-hw2/tweets_DM.json",lines=True, orient='columns')

In [None]:
# Show df
df.head()

In [None]:
# select '_source' column as 'source'
source = df._source

# normalize 'source'
twit = pd.json_normalize(source)

# show twit
twit.head()

In [None]:
# rename column for merge later

twit = twit.rename(columns={"tweet.hashtags": "hashtags", "tweet.tweet_id": "tweet_id", "tweet.text": "text"})
twit.head()

In [None]:
# to split train test
# read file 'data_identification.csv'

iden = pd.read_csv('dm2021-lab2-hw2/data_identification.csv')
iden.head()

In [None]:
# merge twit dataframe with data_identification.csv

total = pd.merge(twit, iden, on="tweet_id", how="left")
total.head()

In [None]:
# categorize the data into two set, train and test.

train = total[total["identification"] == "train"]
test = total[total["identification"] == "test"]

In [None]:
# drop hashtags column and identification column, since we will not use it

train = train.drop(columns=['hashtags', 'identification'])
test = test.drop(columns=['hashtags', 'identification'])

In [None]:
train.head()

In [None]:
test.head()

In [None]:
# read the labels
emo = pd.read_csv('dm2021-lab2-hw2/emotion.csv')
emo.head()

In [None]:
# merge train with emotions.csv

train = pd.merge(train, emo, on="tweet_id", how="left")
train.head()

In [None]:
# see the shape of the data

print("train shape :", train.shape)
print("test shape :", test.shape)

### 1.2 Save data

In [None]:
# save to pickle file

train.to_pickle("train_df.pkl") 
test.to_pickle("test_df.pkl")

In [None]:
# load a pickle file

train_df = pd.read_pickle("train_df.pkl")
test_df = pd.read_pickle("test_df.pkl")

### 1.3 Exploratory data analysis (EDA)

In [None]:
#group to find distribution

train_df.groupby(['emotion']).count()['text']

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

# the histogram of the data
labels = train_df['emotion'].unique()
post_total = len(train_df)
df1 = train_df.groupby(['emotion']).count()['text']
df1 = df1.apply(lambda x: round(x*100/post_total,3))

#plot
fig, ax = plt.subplots(figsize=(10,5))
plt.bar(df1.index,df1.values)

#arrange
plt.ylabel('% of instances')
plt.xlabel('Emotion')
plt.title('Emotion distribution')
plt.grid(True)
plt.show()

## 2. Feature Engineering
### 2.0 Sample and train/val split 

In [None]:
# sample 500,000 datas

train_sample = train_df.sample(n=500000, random_state=1)

In [None]:
train_sample.shape

In [None]:
# split data into training and testing set

from sklearn.model_selection import train_test_split

X = train_sample['text']
y = train_sample['emotion']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=1)

In [None]:
# see the shape of the data

print('X_train', X_train.shape)
print('y_train', y_train.shape)
print('X_val', X_val.shape)
print('y_val', y_val.shape)
print('X_test',X_test.shape)

### 2.1 Using Bag of Words

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# # build analyzers (bag-of-words)
# BOW_vectorizer = CountVectorizer() 

In [None]:
# # 1. Learn a vocabulary dictionary of all tokens in the raw documents.
# BOW_vectorizer.fit(X_train)

# # 2. Transform documents to document-term matrix.
# train_data_BOW_features = BOW_vectorizer.transform(X_train)
# val_data_BOW_features = BOW_vectorizer.transform(X_val)
# test_data_BOW_features = BOW_vectorizer.transform(X_test)

In [None]:
# # check the result
# train_data_BOW_features

In [None]:
# type(train_data_BOW_features)

In [None]:
# # # add .toarray() to show
# train_data_BOW_features.toarray()

In [None]:
# # check the dimension
# train_data_BOW_features.shape

In [None]:
# # observe some feature names
# feature_names = BOW_vectorizer.get_feature_names()
# feature_names[100:110]

In [None]:
# "😂" in feature_names

### 2.2 Use nltk tokenize

In [None]:
import nltk

# build analyzers (bag-of-words)
BOW_500 = CountVectorizer(max_features=500, tokenizer=nltk.word_tokenize) 

# apply analyzer to training data
BOW_500.fit(X_train)

train_data_BOW_features_500 = BOW_500.transform(X_train)

## check dimension
train_data_BOW_features_500.shape

In [None]:
train_data_BOW_features_500.toarray()

In [None]:
# observe some feature names
feature_names_500 = BOW_500.get_feature_names()
feature_names_500[100:110]

In [None]:
"😂" in feature_names_500

### 2.3 TFIDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
TFIDF_500 = TfidfVectorizer(max_features=500, tokenizer=nltk.word_tokenize)
train_data_TFIDF_features_500 = TFIDF_500.fit_transform(X_train)
train_data_TFIDF_features_500.shape

In [None]:
train_data_TFIDF_features_500.toarray()

In [None]:
feature_names_500 = TFIDF_500.get_feature_names()
feature_names_500[100:110]

In [None]:
"😂" in feature_names_500

## Models

### 3.1 Decision Tree
### 3.1.1 Built Model

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Choose feature (BOW_500 / TFIDF_500)
# featureType = BOW_500 

# for a classificaiton problem, you need to provide both training & testing data
X_train = TFIDF_500.transform(X_train)
X_val = TFIDF_500.transform(X_val)

In [None]:
## take a look at data dimension is a good habbit  :)
print('X_train.shape: ', X_train.shape)
print('y_train.shape: ', y_train.shape)
print('X_val.shape: ', X_val.shape)
print('y_val.shape: ', y_val.shape)

In [None]:
## build DecisionTree model
DT_model = DecisionTreeClassifier(random_state=0)

## training!
DT_model = DT_model.fit(X_train, y_train)

## predict!
y_train_pred = DT_model.predict(X_train)
y_val_pred = DT_model.predict(X_val)

## so we get the pred result
y_val_pred[:10]

### 3.1.2 Evaluation of DT

In [None]:
## accuracy
from sklearn.metrics import accuracy_score

acc_train = accuracy_score(y_true=y_train, y_pred=y_train_pred)
acc_val = accuracy_score(y_true=y_val, y_pred=y_val_pred)

print('training accuracy: {}'.format(round(acc_train, 2)))
print('testing accuracy: {}'.format(round(acc_val, 2)))

In [None]:
## precision, recall, f1-score,
from sklearn.metrics import classification_report

print(classification_report(y_true=y_val, y_pred=y_val_pred))

In [None]:
## check by confusion matrix
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_true=y_val, y_pred=y_val_pred) 
print(cm)

In [None]:
# Funciton for visualizing confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import itertools

def plot_confusion_matrix(cm, classes, title='Confusion matrix',
                          cmap=sns.cubehelix_palette(as_cmap=True)):
    """
    This function is modified from: 
    http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
    """
    classes.sort()
    tick_marks = np.arange(len(classes))    
    
    fig, ax = plt.subplots(figsize=(8,8))
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           xticklabels = classes,
           yticklabels = classes,
           title = title,
           xlabel = 'True label',
           ylabel = 'Predicted label')

    fmt = 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt), horizontalalignment="center", color="white" if cm[i, j] > thresh else "black")
    ylim_top = len(classes) - 0.5
    plt.ylim([ylim_top, -.5])
    plt.tight_layout()
    plt.show()

In [None]:
# plot your confusion matrix
my_tags = ['anger', 'anticipation', 'disgust', 'fear', 'sadness', 'surprise', 'trust', 'joy']
plot_confusion_matrix(cm, classes=my_tags, title='Confusion matrix')

## Output Result

In [None]:
# Load data
X_test = test['text']

# Transform X
X_test = TFIDF_500.transform(X_test)
print('X_test.shape: ', X_test.shape)

# Predict
y_test_pred = DT_model.predict(X_test)
print (y_test_pred[:10])

In [None]:
# Submission
submission = pd.DataFrame({'id':test['tweet_id'],'emotion':y_test_pred})
submission.head()

In [None]:
# Check the shape
submission.shape

In [None]:
path = 'Result/'
filename = 'DT_TFIDF500.csv'
submission.to_csv(path+filename, index=False)
print('Saved file: ' + filename)