## Background

As one of the largest social media websites in the world, Facebook is an attractive platform for businesses to reach their consumers. Almost all consumer-facing businesses have virtual presence on Facebook, in the form of Facebook business pages (e.g., see [here](https://www.facebook.com/target/) for Target's Facebook business page). Everyday, Facebook users who visit these business pages generate a large amount of posts. These user posts may represent customer complains, questions, or appreciations directed towards the focal businesses. 

For businesses, these user posts contain valuable information about customers' needs and preferences, and understanding what the user posts are talking about represents an important opportunity to get to know your customers in real-time.

## Dataset and Task

For this assignment, you will use a **labeled dataset** named "FB_posts_labeled.txt". It is a **tab-delimited** file with the following fields:
- postId: this is a unique identifier for each user post. There are 7961 posts in total;
- message: this is the text of each post;
- Appreciation: this is a binary (0/1) indicator of whether a post is an appreciation;
- Complaint: this is a binary (0/1) indicator of whether a post is a customer complaint;
- Feedback: this is a binary (0/1) indicator of whether a post is a customer feedback (e.g., questions and suggestions).

Appreciation, Complaint, and Feedback are the three mutually exclusive content categories / classes in this dataset. They were labeled by humans, and the labeling isn't perfect (i.e., there may be ambiguous cases where the labels are not appropriate). However, for the sake of this assignment, let's treat them as the ground truth. **Your task is to build a text classifier to predict the content category of a post based on its textual content.** 

To evaluate the out-of-sample performance of your model, you will use it to make predictions for 2039 posts in an **unlabeled dataset** named "FB_posts_unlabeled.txt". It is also a tab-delimited file, but only has postId and message fields. I keep the ground truth labels for these posts in a private place, in order to objectively evaluate your model's performance. The performance metric I will use is **averaged F-measure** across the three categories.

## Submit your Predictions

Throughout this assignment, you are encouraged to build different models and submit their predictions as many times as you'd like. To submit a set of predictions, you MUST adhere to the following format (a sample submission file that adheres to all the following requirements is provided on Canvas):

1. The submission must be a csv file, with exactly four columns and 2040 rows;
2. The first row must be the headers, specifically, "postId,Appreciation_pred,Complaint_pred,Feedback_pred". Spellings are case-sensitive;
3. The first column must contain postId. The order of the posts doesn't matter - I will do a join between your predictions and the ground truth table based on postId;
4. The remaining three columns contain your model's predictions for each post. Note that you must generate **binary predictions** for each category. In other words, the numbers in each of those three columns must be either 0 or 1. Also, a post can only belong to one category, so only 1 category can have value 1 and all the others must have value 0.

Because I use an automated system to evaluate prediction performance, if your prediction file does not follow the above format, it won't be recognized. I suggest adapting the following pseudocode to generate the prediction file:

In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.metrics import classification_report
import numpy as np
import pandas as pd
import os
os.getcwd()
os.chdir('/Users/haochunniu/Desktop/Python/Advance AI/HW1')

In [2]:
#1.Import train and test data
raw=pd.read_table('FB_posts_labeled.txt')
raw['label']=np.where(raw['Appreciation']==1,'Appreciation',np.where(raw['Complaint']==1,'Complaint','Feedback'))
label=raw[['Appreciation','Complaint','Feedback']]
test=pd.read_table('FB_posts_unlabeled.txt')

In [3]:
train_text=np.array(raw['message'])
test_text=np.array(test['message'])

In [4]:
#2-1.Text pre-processing for train data
vectorize_layer = keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens = None,
    standardize = 'lower_and_strip_punctuation',
    split = 'whitespace',
    ngrams = None,
    output_mode = 'int',
    output_sequence_length = None)

In [5]:
#2-2. Apply it to the text data with "adapt"
vectorize_layer.adapt(train_text)

In [6]:
len(vectorize_layer.get_vocabulary())

19465

In [25]:
#3-1. Classification model of simple RNN
model_rnn = keras.Sequential()

model_rnn.add(vectorize_layer)

model_rnn.add(keras.layers.Embedding(
    input_dim = len(vectorize_layer.get_vocabulary()),
    output_dim = 128,
    mask_zero = True
))

model_rnn.add(keras.layers.SimpleRNN(128,return_sequences=True,dropout=0.2)) # see note below
model_rnn.add(keras.layers.SimpleRNN(128,dropout=0.2))
model_rnn.add(keras.layers.Dense(3, activation = 'softmax'))

In [129]:
#3-2. Configure training / optimization
model_rnn.compile(loss = keras.losses.categorical_crossentropy,
                  optimizer='adam',
                  metrics=['accuracy'])

In [130]:
#3-3. Add early stopping layer
early_stopping = EarlyStopping(monitor='val_loss',patience=1)

#3-4. Fit the model
model_rnn.fit(x = train_text, y = label,
              validation_split = 0.2,
              epochs=10,
              batch_size = 64,
              callbacks=[early_stopping])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10


<keras.callbacks.History at 0x7f81aab9bcd0>

In [131]:
#3-5. Summary report of prediction on train data
train_data_result=pd.DataFrame(model_rnn.predict(train_text), columns = ['Appreciation','Complaint','Feedback'])
train_data_result['max']=train_data_result.max(axis = 1)
train_data_result['label']=np.where(train_data_result['Appreciation']==train_data_result['max'],'Appreciation',np.where(train_data_result['Complaint']==train_data_result['max'],'Complaint','Feedback'))
print(classification_report(raw['label'], train_data_result['label']))

              precision    recall  f1-score   support

Appreciation       0.93      0.93      0.93      2062
   Complaint       0.96      0.94      0.95      4255
    Feedback       0.89      0.93      0.91      1644

    accuracy                           0.94      7961
   macro avg       0.93      0.93      0.93      7961
weighted avg       0.94      0.94      0.94      7961



In [26]:
#4-1. Classification model of RNN with LSTM units
model_lstm = keras.Sequential()

model_lstm.add(vectorize_layer)

model_lstm.add(keras.layers.Embedding(
    input_dim = len(vectorize_layer.get_vocabulary()),
    output_dim = 128,
    mask_zero = True
))

model_lstm.add(keras.layers.LSTM(128,
                                 dropout=0.2,
                                 return_sequences=True))

model_lstm.add(keras.layers.LSTM(128,
                                 dropout=0.2,
                                 return_sequences=True))

model_lstm.add(keras.layers.LSTM(128,
                                 dropout=0.2))

model_lstm.add(keras.layers.Dense(3, activation = 'softmax'))

In [27]:
#4-2. Configure training / optimization

#Create F1-Score metrics
from keras import backend as K

def recall_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

def precision_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def f1_m(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

#Compile the loss function for the model
model_lstm.compile(loss = keras.losses.categorical_crossentropy,
                   optimizer='adam',
                   metrics=[f1_m])

In [28]:
#4-3. Add early stopping layer
early_stopping = EarlyStopping(monitor='val_f1_m',patience=1)

#4-4. Fit the model & Mannualy set class weight
weight = {0: 1.,
          1: 1.,
          2: 2}

model_lstm.fit(x = train_text, y = label,
               validation_split = 0.2,
               epochs=10,
               batch_size = 128,
               callbacks=[early_stopping],
               class_weight=weight
               )

Epoch 1/10
Epoch 2/10


<keras.callbacks.History at 0x7fd3d9d55110>

In [29]:
#4-5. Summary report of prediction on train data
train_data_result=pd.DataFrame(model_lstm.predict(train_text), columns = ['Appreciation','Complaint','Feedback'])
train_data_result['max']=train_data_result.max(axis = 1)
train_data_result['label']=np.where(train_data_result['Appreciation']==train_data_result['max'],'Appreciation',np.where(train_data_result['Complaint']==train_data_result['max'],'Complaint','Feedback'))
print(classification_report(raw['label'], train_data_result['label']))

              precision    recall  f1-score   support

Appreciation       0.91      0.88      0.89      2062
   Complaint       0.92      0.87      0.90      4255
    Feedback       0.74      0.88      0.81      1644

    accuracy                           0.87      7961
   macro avg       0.86      0.88      0.87      7961
weighted avg       0.88      0.87      0.88      7961



In [20]:
#Prediction on test data
test_data_result=pd.DataFrame(model_lstm.predict(test_text), columns = ['Appreciation','Complaint','Feedback'])
test_data_result['max']=test_data_result.max(axis = 1)
test_data_result['label']=np.where(test_data_result['Appreciation']==test_data_result['max'],'Appreciation',np.where(test_data_result['Complaint']==test_data_result['max'],'Complaint','Feedback'))


In [23]:
test['label']=test_data_result['label']
test=test.drop(columns=['message'])
test=pd.get_dummies(test,
                    columns=['label'])
test.columns=['postId','Appreciation_pred','Complaint_pred','Feedback_pred']

In [24]:
test.to_csv('Prediction_on_Test.csv',index = False)

**To use the submission system**:
1. Visit [http://3.22.117.95:3838/FBapp](http://3.22.117.95:3838/FBapp) to access the prediction submission system;
2. Enter your x500 ID (because I need to keep track of who submitted what). You should see a text display "welcome!" after you enter your ID;
3. Upload the prediction file with the correct format as discussed above. After the file is uploaded, the performance metrics will be shown automatically, including the precision/recall/F-measure of each class and the average F-measure. The entire confusion matrix is not provided to prevent gaming behavior.

If the submission system is not working at any point during this assignment, please contact me via email.

## Grading

Your grade (out of 25 points) of this assignment is determined as follows:
1. I rank everyone based on their highest performance. Say your rank is $A$;
2. I rank everyone based on their second-highest performance. Say your rank is $B$;
3. I rank everyone based on their third-highest performance. Say your rank is $C$;
4. I compute a score ("weighted average ranking") $S = \frac{1}{2}A + \frac{1}{3}B + \frac{1}{6}C$.
5. The person(s) with the lowest $S$ gets 25 points, the person(s) with the second-lowest $S$ gets 24.5 points, so on and so forth.

The design of this grading scheme **encourages consistent efforts that leads to steady performance improvement**, and demotes the relative importance of having one lucky high performance.