<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#packages" data-toc-modified-id="packages-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>packages</a></span></li><li><span><a href="#function" data-toc-modified-id="function-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>function</a></span></li><li><span><a href="#get-clean-data" data-toc-modified-id="get-clean-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>get clean data</a></span><ul class="toc-item"><li><span><a href="#get-raw-data" data-toc-modified-id="get-raw-data-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>get raw data</a></span></li><li><span><a href="#get-doc_list" data-toc-modified-id="get-doc_list-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>get <code>doc_list</code></a></span></li><li><span><a href="#get-bi_weapon_array" data-toc-modified-id="get-bi_weapon_array-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>get <code>bi_weapon_array</code></a></span></li><li><span><a href="#get-raw-embedding-dictionary" data-toc-modified-id="get-raw-embedding-dictionary-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>get raw embedding dictionary</a></span><ul class="toc-item"><li><span><a href="#test" data-toc-modified-id="test-3.4.1"><span class="toc-item-num">3.4.1&nbsp;&nbsp;</span>test</a></span></li></ul></li></ul></li><li><span><a href="#cv" data-toc-modified-id="cv-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>cv</a></span><ul class="toc-item"><li><span><a href="#main" data-toc-modified-id="main-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>main</a></span></li></ul></li><li><span><a href="#test" data-toc-modified-id="test-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>test</a></span></li></ul></div>

In [1]:
## Pre-setting
# automatically adjust the width of the notebook code cell
# from IPython.core.display import display, HTML
# display(HTML("<style>.container { width:100% !important; }</style>"))
# if one module is changed, this line will automatically reload that module
%load_ext autoreload
%autoreload 2
# display the figure in the notebook
%matplotlib inline
# To change the font size in acrobat
import matplotlib as mpl
mpl.rcParams['pdf.fonttype'] = 42

## packages

In [2]:
## Add path
import os
import sys
src_dir = os.path.abspath(os.path.join(os.pardir, 'src'))
if src_dir not in sys.path:
    sys.path.insert(0, src_dir)

In [3]:
import json
import numpy as np
import gensim as gs
from collections import Counter

import pandas as pd
pd.options.display.max_columns = 500

In [4]:
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import f1_score
from sklearn.metrics import label_ranking_average_precision_score, label_ranking_loss


In [5]:
import pandas as pd

In [6]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
# import numpy as np

Using TensorFlow backend.


In [7]:
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense, SpatialDropout1D, LSTM, Ｂidirectional


In [8]:
from evaluation.evaluation import get_label_via_training, doc_class_evaluation_fscore, baseline_doc_class_evaluation_fscore

## function

In [9]:
def prepare_raw_doc_array_train_test(doc_array_train, doc_array_test, doc_len_qtl, raw_embedding_dic, maxlen=100, max_words=10000):
    
    # tokenize raw doc based on training
    
    # # train tokenizer
    tokenizer_train = Tokenizer(num_words=max_words)
    tokenizer_train.fit_on_texts(doc_array_train)

    # # get dictionary of {word: wordID}
    dic_word_wordID_train = tokenizer_train.word_index
    num_unique_word_in_train = len(dic_word_wordID_train.keys())

    # # transfer the word in raw doc to wordID
    sequences_train = tokenizer_train.texts_to_sequences(doc_array_train)
    sequences_test = tokenizer_train.texts_to_sequences(doc_array_test)

    # # make all doc have the same length
    data_array_train = pad_sequences(sequences_train, maxlen=maxlen)
    data_array_test = pad_sequences(sequences_test, maxlen=maxlen)
    
    
    # create embedding_matrix for dnn
    
    # # get word embedding dimension from the raw_embedding_dic
    # # raw_embedding_dic is from pre-trained dataset from other researchers
    embedding_dim = len(list(raw_embedding_dic.values())[0])

    # # get embedding_matrix from raw_embedding_dic for later proposed dnn
    embedding_matrix = np.zeros((max_words, embedding_dim))

    for word, i in dic_word_wordID_train.items():
        embedding_vector = raw_embedding_dic.get(word)
        if i < max_words:
            if embedding_vector is not None:
                # Words not found in embedding index will be all-zeros.
                embedding_matrix[i] = embedding_vector
                
    return data_array_train, data_array_test, embedding_matrix

In [10]:
def fit_dnn_model_predict(data_train, label_train, data_test, label_test, max_words, embedding_dim, maxlen, embedding_matrix, epochs=10):
    
    # end layer node
    end_layer_node_num = label_train.shape[1]
    
    # build the dnn model
    model = Sequential()
    model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
    model.add(Ｂidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2)))
    model.add(Dense(end_layer_node_num, activation='sigmoid'))
    
    # set word embedding matrix
    model.layers[0].set_weights([embedding_matrix])
    model.layers[0].trainable = False
    
    # model compile
    model.compile(optimizer='rmsprop',
                  loss='mse',
                  metrics=['acc'])

    # training the model
    model.fit(data_train, label_train, epochs=epochs, batch_size=32)
    
    # predict on testing
    
    prob_predict_test = model.predict(data_test)
    
    label_test_predict = get_label_via_training(prob_predict_test, label_train)
    
    tmp_f1_score_dic = doc_class_evaluation_fscore(label_test_predict, label_test)

    return tmp_f1_score_dic

## get clean data

### get raw data

In [11]:
email_6p2_folder = os.path.abspath(os.path.join(os.pardir, 'data', 's2021_lumen_clean_data'))
email_6p2_file = 's2021_06_20_01_lumen_clean_doc_sia_liwc_classify.csv'
email_6p2_location = os.path.join(email_6p2_folder, email_6p2_file)

In [12]:
email_6p2_df = pd.read_csv(email_6p2_location)

In [13]:
email_6p2_df = email_6p2_df.rename(columns={'scarcity_time':'scarcity'})

In [14]:
print(email_6p2_df.shape)
email_6p2_df.head()

(2771, 35)


Unnamed: 0,raw_text_id,raw_text,text_type,nostop_stem_doc,nostop_stem_doc_len,clean_doc,clean_doc_len,pos_sia,compound_sia,neu_sia,neg_sia,posemo_liwc,negemo_liwc,anx_liwc,anger_liwc,sad_liwc,reward_liwc,risk_liwc,time_liwc,money_liwc,Authority or Expertise/Source Credibility,Blame/guilt,Commitment,Commitment- Call to Action,Commitment- Indignation,Emphasis,Gain framing,Liking,Loss framing,Objectivity,Reciprocation,Scarcity/Urgency/Opportunity,Social Proof,Social Proof- Admonition,Subjectivity
0,0,"""A Baker Swept By,"" by Edward Hirsch Audio: Re...",news left,baker swept edward hirsch audio read author al...,69,a baker swept by by edward hirsch audio read b...,137,0.075,0.7506,0.905,0.02,4,1,0,0,1,0,1,14,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1
1,1,"""Get Out"" Won’t Have A 100% Rating On Rotten T...",news left,get rate rotten tomato ever fact jordan peel g...,99,get out won t have a rating on rotten tomatoes...,194,0.115,-0.5607,0.772,0.113,6,6,0,0,0,7,1,10,0,1,1,1,0,0,0,0,1,0,1,0,0,0,0,1
2,3,"""Know Your Rights or Your Safety Is At Risk In...",russian ad,know right safeti risk interact polic shock vi...,76,know your rights or your safety is at risk in ...,146,0.068,-0.8751,0.79,0.142,4,2,1,0,0,2,4,3,1,0,1,1,1,1,1,0,0,0,1,0,0,1,1,1
3,4,"""Nancy Pelosi was drunk again today,"" begins a...",fake news,nanci pelosi drunk today begin post recent sha...,90,nancy pelosi was drunk again today begins a po...,172,0.078,-0.1027,0.831,0.092,6,1,0,0,0,5,0,11,1,1,0,0,0,0,0,0,0,0,1,0,0,1,0,1
4,5,"""Obama out"": POTUS ends speech with viral mic ...",news left,obama potu end speech viral mic drop presid ba...,58,obama out potus ends speech with viral mic dro...,105,0.047,0.1779,0.913,0.04,2,1,0,0,0,1,0,6,0,1,0,1,0,0,0,0,1,0,1,0,0,1,0,1


### get `doc_list`

In [15]:
doc_list = [i.split() for i in email_6p2_df.nostop_stem_doc]

In [16]:
len(doc_list)

2771

In [17]:
doc_list[1]

['get',
 'rate',
 'rotten',
 'tomato',
 'ever',
 'fact',
 'jordan',
 'peel',
 'get',
 'one',
 'film',
 'thriller',
 'sit',
 'impress',
 'fresh',
 'rotten',
 'tomato',
 'one',
 'point',
 'film',
 'held',
 'steadi',
 'fresh',
 'score',
 'came',
 'halt',
 'one',
 'review',
 'rip',
 'project',
 'nation',
 'review',
 'armond',
 'white',
 'describ',
 'get',
 'headlin',
 'review',
 'return',
 'movi',
 'get',
 'actor',
 'lakeith',
 'stanfield',
 'definit',
 'hide',
 'thought',
 'white',
 'review',
 'twitter',
 'movi',
 'still',
 'sit',
 'fresh',
 'rotten',
 'tomato',
 'even',
 'though',
 'ton',
 'glow',
 'review',
 'sinc',
 'come',
 'fresh',
 'review',
 'one',
 'rotten',
 'one',
 'never',
 'get',
 'back',
 'round',
 'nearest',
 'whole',
 'number',
 'two',
 'except',
 'repres',
 'rotten',
 'tomato',
 'told',
 'buzzfe',
 'news',
 'round',
 'round',
 'reserv',
 'absolut',
 'movi',
 'everi',
 'singl',
 'review',
 'fresh',
 'everi',
 'singl',
 'review',
 'rotten',
 'armond',
 'white',
 'say']

In [18]:
email_6p2_df.nostop_stem_doc_len.mean(), email_6p2_df.nostop_stem_doc_len.median()

(66.20064958498737, 66.0)

In [19]:
# for dnn (deep neural network) each email should have fixed doc length
# here, we can choose 80% quantile
doc_len_qtl = int(email_6p2_df.nostop_stem_doc_len.quantile(.80))
doc_len_qtl

88

### get `bi_weapon_array`

In [20]:
# 2021-06-30
influence_list = [
    'Authority or Expertise/Source Credibility',
    'Commitment',
    'Commitment- Call to Action',
    'Subjectivity',
    'Gain framing',
    'Blame/guilt',
    'Emphasis',
]

influence_list

['Authority or Expertise/Source Credibility',
 'Commitment',
 'Commitment- Call to Action',
 'Subjectivity',
 'Gain framing',
 'Blame/guilt',
 'Emphasis']

In [21]:
bi_weapon_array = email_6p2_df[influence_list].values.astype(int)

In [22]:
print(bi_weapon_array.shape)
bi_weapon_array

(2771, 7)


array([[0, 0, 0, ..., 0, 0, 0],
       [1, 1, 0, ..., 0, 1, 0],
       [0, 1, 1, ..., 0, 1, 1],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [1, 1, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 1]])

In [23]:
bi_weapon_array[0]

array([0, 0, 0, 1, 0, 0, 0])

### get raw embedding dictionary

In [24]:
%%time
glove_dir = src_dir = os.path.abspath(os.path.join(os.pardir, 'data', 'tmp_glove'))
'data/glove'

raw_embedding_dic = {}
# f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
f = open(os.path.join(glove_dir, 'glove.6B.50d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    raw_embedding_dic[word] = coefs
f.close()

CPU times: user 5.03 s, sys: 182 ms, total: 5.21 s
Wall time: 5.22 s


In [25]:
embedding_dim = len(list(raw_embedding_dic.values())[0])

In [26]:
print('Found %s word vectors.' % len(raw_embedding_dic))

Found 400000 word vectors.


#### test

In [27]:
len(raw_embedding_dic.keys())

400000

In [28]:
list(raw_embedding_dic.keys())[:10]

['the', ',', '.', 'of', 'to', 'and', 'in', 'a', '"', "'s"]

In [29]:
raw_embedding_dic['the']

array([ 4.1800e-01,  2.4968e-01, -4.1242e-01,  1.2170e-01,  3.4527e-01,
       -4.4457e-02, -4.9688e-01, -1.7862e-01, -6.6023e-04, -6.5660e-01,
        2.7843e-01, -1.4767e-01, -5.5677e-01,  1.4658e-01, -9.5095e-03,
        1.1658e-02,  1.0204e-01, -1.2792e-01, -8.4430e-01, -1.2181e-01,
       -1.6801e-02, -3.3279e-01, -1.5520e-01, -2.3131e-01, -1.9181e-01,
       -1.8823e+00, -7.6746e-01,  9.9051e-02, -4.2125e-01, -1.9526e-01,
        4.0071e+00, -1.8594e-01, -5.2287e-01, -3.1681e-01,  5.9213e-04,
        7.4449e-03,  1.7778e-01, -1.5897e-01,  1.2041e-02, -5.4223e-02,
       -2.9871e-01, -1.5749e-01, -3.4758e-01, -4.5637e-02, -4.4251e-01,
        1.8785e-01,  2.7849e-03, -1.8411e-01, -1.1514e-01, -7.8581e-01],
      dtype=float32)

In [30]:
len(raw_embedding_dic['the'])

50

## cv

### main

In [31]:
maxlen = doc_len_qtl  # We will cut reviews after 100 words
max_words = 10000  # We will only consider the top 10,000 words in the dataset
epochs = 10
n_splits = 5

In [32]:
kf = KFold(n_splits=n_splits, shuffle=True, random_state=0)

In [33]:
doc_array = np.array(doc_list)

  doc_array = np.array(doc_list)


In [34]:
%%time
df_result_f1_score_save = pd.DataFrame()

for train_index, test_index in kf.split(doc_array):
    print("TRAIN:", len(train_index), "TEST:", len(test_index))
    
    # get raw doc train & test
    doc_array_train = doc_array[train_index]
    doc_array_test = doc_array[test_index]
    
    # get label train & test
    label_train = bi_weapon_array[train_index]
    label_test = bi_weapon_array[test_index]

    # prepare raw doc train & test for dnn
    data_array_train, data_array_test, embedding_matrix = prepare_raw_doc_array_train_test(doc_array_train, doc_array_test, doc_len_qtl, raw_embedding_dic, maxlen=maxlen, max_words=max_words)
    
    # train & predict & evaluate
    tmp_f1_score_dic = fit_dnn_model_predict(data_array_train, label_train, data_array_test, label_test, max_words, embedding_dim, maxlen, embedding_matrix, epochs=epochs)
    
    # save result
    df_result_f1_score_save = df_result_f1_score_save.append(tmp_f1_score_dic, ignore_index=True) 
    

TRAIN: 2216 TEST: 555


2021-07-06 09:17:39.107872: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-07-06 09:17:39.123381: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fd7bf2be280 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-07-06 09:17:39.123399: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
TRAIN: 2217 TEST: 554
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
TRAIN: 2217 TEST: 554
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
TRAIN: 2217 TEST: 554
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
TRAIN: 2217 TEST: 554
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
CPU times: user 20min 33s, sys: 4min 3s, total: 24min 37s
Wall time: 6min 12s


In [35]:
df_result_f1_score_save

Unnamed: 0,acc_score,f1_score_macro,f1_score_micro
0,0.723552,0.645914,0.696953
1,0.718154,0.629712,0.68887
2,0.729242,0.64841,0.701705
3,0.714286,0.633222,0.681243
4,0.731821,0.652931,0.705549


In [36]:
df_result_f1_score_save.mean()

acc_score         0.723411
f1_score_macro    0.642038
f1_score_micro    0.694864
dtype: float64

## test

In [37]:
1

1