# FakeNewsDetector
Project from Kaggle :https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset

## 1. Step 1: Preprocessing Part

The DataSet was separated into `True.csv` and `Fake.csv` already. After importing the data, we can see the data structure is like below:

| index | title  | text | subject | date |
| ------------- | ------------- | ------------- | ------------- |------------- |
| type: integer | type: string | type: string | type: string | type: string|

(1) It's obvious that the `subject` containing limited choices, which should be taken as categorical attribute. Therefore, in preprocessing part, I transfered `subject` by using one hot encoding.

(2) And I also transfered date into different attributes: year, month, day and the judge of April Fool's Day. Because people are prone to produce fake news on April Fool's Day.

(3) Next step is to clean the title and text, in order to achieve high coverage of text that can be transfered into vectors based on GloVe pre-trained DataSet.

(4) Finally, for the purpose of making deep learning model easier to determine the different part of dataframe, I merged the cleaned data together by the order of: date related attributes, subject related attributes, title, text and label.

## 2. Step 2: GloVe Word-Embedding

For most machine learning algorithms such as SVM, XGBoost, each attribute must contain some significance for comparison. Thus purely changing the word into vectors with different length of word list is not a good idea for further training. So I transfered the title and text into vectors, and only store the mean of their vectors as new attributes, and merge them with date related attributes and subject related attributes.

For Deep Leaning Model such as LSTM, CNN, it's better to store all the vectors with different length of word list. But for better learning, I set one new attribute named as `mark for title` with same content `end of title` to separate the title part and text part easily, and purly transfered date related attributes and subject related attributes into vectors.

## 3. Step 3: XGBoost, 1dCNN, BiLSTM

For training models part, I chose XGBoost as baseline model, because normally XGBoost works pretty well in such topics by categorizing attributes as multiple trees for classification. And I chose BiLSTM based on my previous experience, because time-series deep learning model can always work well in NLP projects. And if we take text as the sentence spoken by someone, it's obvious that the context changed by the time. Therefore, considering the former part of text can be very important for training. Choosing 1dCNN is also according to the same reason, because if we take text as a whole picture, we can different parts of the sentence make up the final significance of the sentence. Thus, CNN, especially 1dCNN can work pretty well in NLP project.



# 1. Step 1 - Preprocessing
## 1.1. Data Importion

In [200]:
import pandas as pd
import numpy as np
import operator
import pickle
import re
from tqdm import tqdm

In [201]:
df_true = pd.read_csv('.\DataSet\True.csv',sep=',')
df_fake = pd.read_csv('.\DataSet\Fake.csv',sep=',')

In [202]:
df_true

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"
...,...,...,...,...
21412,'Fully committed' NATO backs new U.S. approach...,BRUSSELS (Reuters) - NATO allies on Tuesday we...,worldnews,"August 22, 2017"
21413,LexisNexis withdrew two products from Chinese ...,"LONDON (Reuters) - LexisNexis, a provider of l...",worldnews,"August 22, 2017"
21414,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,worldnews,"August 22, 2017"
21415,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,worldnews,"August 22, 2017"


In [203]:
df_fake

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"
...,...,...,...,...
23476,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,"January 16, 2016"
23477,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,"January 16, 2016"
23478,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,"January 15, 2016"
23479,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,"January 14, 2016"


## 1.2.Data Cleaning for subject, date
### 1.2.1.form new column as classifier
 (1) form a new column named as label as classifer
 
 (2) concatenate df_true & df_fake as df_all

In [204]:
# True is 1, Fake is 0
df_true['label'] = 1
df_fake['label'] = 0

In [205]:
# concat 2 dataframes
df_all = pd.concat([df_true,df_fake])

In [206]:
df_all

Unnamed: 0,title,text,subject,date,label
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",1
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",1
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",1
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",1
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",1
...,...,...,...,...,...
23476,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,"January 16, 2016",0
23477,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,"January 16, 2016",0
23478,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,"January 15, 2016",0
23479,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,"January 14, 2016",0


In [207]:
# reset index
df_all = df_all.reset_index()
df_all = df_all.drop(columns=['index'])

In [208]:
df_all

Unnamed: 0,title,text,subject,date,label
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",1
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",1
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",1
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",1
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",1
...,...,...,...,...,...
44893,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,"January 16, 2016",0
44894,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,"January 16, 2016",0
44895,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,"January 15, 2016",0
44896,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,"January 14, 2016",0


### 1.2.1. Categorical Data "Subject"

In [209]:
subject_list = df_all['subject'].unique()
subject_list

array(['politicsNews', 'worldnews', 'News', 'politics', 'Government News',
       'left-news', 'US_News', 'Middle-east'], dtype=object)

In [210]:
# turn the subject into 1 hot encoding, make it from norminal category into binary vector
df_subject= pd.get_dummies(df_all.subject,prefix = 'subject')
df_all=df_all.drop(columns=['subject'])
df_subject

Unnamed: 0,subject_Government News,subject_Middle-east,subject_News,subject_US_News,subject_left-news,subject_politics,subject_politicsNews,subject_worldnews
0,0,0,0,0,0,0,1,0
1,0,0,0,0,0,0,1,0
2,0,0,0,0,0,0,1,0
3,0,0,0,0,0,0,1,0
4,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...
44893,0,1,0,0,0,0,0,0
44894,0,1,0,0,0,0,0,0
44895,0,1,0,0,0,0,0,0
44896,0,1,0,0,0,0,0,0


### 1.2.2. Turn date into year, month, day 3 columns

In [211]:
date = df_all['date'].iloc[1]
print(date)
type(date)

December 29, 2017 


str

In [212]:

def formDataFrame(dictData):
    # form df from dict
    df = pd.DataFrame.from_dict(dictData)
    return df

In [213]:
from datetime import date


def turnDate(array):
    """
    Turning Date into new 5 attributes: year, month, day, weekday, aprilfool
    note: aprilfool -> 1: it's on April Fools' Day, 0: it's not on April Fools' Day, 2: not sure
    """
    dictMon = {
        'January':1,'Jan':1,
        'February':2,'Feb':2,
        'March':3,'Mar':3,
        'April':4,'Apr':4,
        'May':5,
        'June':6,'Jun':6,
        'July':7,'Jul':7,
        'August':8,'Aug':8,
        'September':9,'Sep':9,
        'October':10,'Oct':10,
        'November':11,'Nov':11,
        'December':12,'Dec':12
        }
    dictDate = dict()
    dictDate['year']=list()
    dictDate['month'] = list()
    dictDate['day'] = list()
    dictDate['weekday']=list()
    dictDate['aprilfool']=list()
    for each in tqdm(array):
        matchObj = re.match(r'([a-zA-Z]+) ([0-9]+), ([0-9]+)',each,flags=re.I)
        if matchObj is not None:
            recordedDate = date(int(matchObj.group(3)),int(dictMon[matchObj.group(1)]),int(matchObj.group(2)))
            dictDate['year'].append(recordedDate.year)
            dictDate['month'].append(recordedDate.month)
            dictDate['day'].append(recordedDate.day)
            dictDate['weekday'].append(recordedDate.weekday())
            if recordedDate.month == 4 and recordedDate.day == 1:
                dictDate['aprilfool'].append('yes')
            elif (recordedDate.month == 3 and recordedDate.day >=24) or (recordedDate.month == 4 and recordedDate.day <=8):
                dictDate['aprilfool'].append('close')
            else:
                dictDate['aprilfool'].append('no')
        else:
            dictDate['year'].append(0)
            dictDate['month'].append(0)
            dictDate['day'].append(0)
            dictDate['weekday'].append(0)
            dictDate['aprilfool'].append('not sure')
    
    return dictDate

In [214]:
dictDate = turnDate(df_all['date'])
df_all = df_all.drop(columns=['date'])
df_date = formDataFrame(dictDate)

100%|████████████████████████████████████████████████████████████████████████| 44898/44898 [00:00<00:00, 252102.61it/s]


In [215]:
df_date

Unnamed: 0,year,month,day,weekday,aprilfool
0,2017,12,31,6,no
1,2017,12,29,4,no
2,2017,12,31,6,no
3,2017,12,30,5,no
4,2017,12,29,4,no
...,...,...,...,...,...
44893,2016,1,16,5,no
44894,2016,1,16,5,no
44895,2016,1,15,4,no
44896,2016,1,14,3,no


In [216]:
df_april= pd.get_dummies(df_date.aprilfool,prefix = 'aprilfool')
df_date = df_date.drop(columns=['aprilfool'])

In [217]:
df_april

Unnamed: 0,aprilfool_close,aprilfool_no,aprilfool_not sure,aprilfool_yes
0,0,1,0,0
1,0,1,0,0
2,0,1,0,0
3,0,1,0,0
4,0,1,0,0
...,...,...,...,...
44893,0,1,0,0
44894,0,1,0,0
44895,0,1,0,0
44896,0,1,0,0


In [218]:
df_date = pd.merge(df_date, df_april, left_index=True, right_index=True)

In [219]:
df_date

Unnamed: 0,year,month,day,weekday,aprilfool_close,aprilfool_no,aprilfool_not sure,aprilfool_yes
0,2017,12,31,6,0,1,0,0
1,2017,12,29,4,0,1,0,0
2,2017,12,31,6,0,1,0,0
3,2017,12,30,5,0,1,0,0
4,2017,12,29,4,0,1,0,0
...,...,...,...,...,...,...,...,...
44893,2016,1,16,5,0,1,0,0
44894,2016,1,16,5,0,1,0,0
44895,2016,1,15,4,0,1,0,0
44896,2016,1,14,3,0,1,0,0


## 1.3.Data Cleaning for Text Part
### 1.3.1. lower the words, insert space into adhered combination of words and punctuations
Since after lower and insertion of space, the coverage of words in GloVe has rised to 99.14%, and the rest words are mostly special names or messy code, it's ok now to use the data for further analysis.

In [220]:
# get the array from dataframe fot text cleaning
array_text = list(df_all['text'])
array_title = list(df_all['title'])

In [221]:
# becuase GloVe can also represent punctuations as vector, I take punctuation as important part of determining the meaning of text
# thus I didn't choose to remove the punctuations, instead I chose to insert space into the adhered combinations of words and punctuations
# such as "$30", "play for fun, "
def lowerAinsertSpace(array):
    length = len(array)
    for i in tqdm(range(length)):
        # 1st: lower the word
        array[i]=array[i].lower()
        
        # 2nd: insert space between punctuations

        # (1) add space between normal punctuation and word
        array[i] = re.sub(r'[\'a-zA-Z0-9]+',r' \g<0> ',array[i]).strip()
        # (2) add space making word such as 'i'm' to 'i 'm'
        array[i] = re.sub(r'([a-zA-Z]+)(\'[a-zA-Z])+',r'\g<1> \g<2>',array[i]).strip()
        # (3) add space between multiple punctuations
        array[i] = re.sub(r'([^a-zA-Z\'0-9])([^\'a-zA-Z0-9])',r' \g<1> \g<2> ',array[i]).strip()
        # (4) add space between combination of words and numbers
        array[i] = re.sub(r'([\'a-zA-Z])([0-9])',r'\g<1> \g<2>',array[i]).strip()
        array[i] = re.sub(r'([0-9])([\'a-zA-Z])',r'\g<1> \g<2>',array[i]).strip()
        # (5) remove the redundant space
        array[i] = re.sub(r' +',r' ',array[i])
        
    return array

In [222]:
array_text2 = lowerAinsertSpace(array_text)

100%|███████████████████████████████████████████████████████████████████████████| 44898/44898 [01:09<00:00, 649.21it/s]


In [223]:
array_title2 = lowerAinsertSpace(array_title)

100%|█████████████████████████████████████████████████████████████████████████| 44898/44898 [00:02<00:00, 16174.72it/s]


## 1.4. Check the percentage of words in data can be processed by GloVe pre-trained Data

In [224]:
# import GloVe pre-trained DataSet
# link for downloading the pre-trained GloVe DataSet: https://nlp.stanford.edu/projects/glove/
# import processed words list from glove
GloVe_path=".\DataSet\glove.840B.300d_words.pkl"
def open_pkl(path):
    pickle_file = open(path,mode='rb')
    data = pickle.load(pickle_file)
    pickle_file.close()

    return data

def form_dict(words_list):
    words_dict = dict()
    content="just for data check"
    for word in words_list:
        words_dict[word]=content
    
    return words_dict

In [225]:
glo_words=open_pkl(GloVe_path)
glo_dict = form_dict(glo_words)# because process dict is much faster than processing list

In [226]:
glo_dict["it's"]

'just for data check'

In [227]:
glo_dict["$"]

'just for data check'

In [228]:
glo_dict["0"]

'just for data check'

In [229]:
glo_dict["'m"]

'just for data check'

In [230]:
"""
1. check the distinct words number;
2. check the coverage of GloVe on the dataset;
"""

class CheckCovered:
    def __init__(self,embedding_source):
        self.embedding_source = embedding_source
    
    def distinctWords(self,text_array):
        # gain the distinct words as dict
        vocab = dict()
        for each in tqdm(text_array):
            words_list = each.split(sep=" ")
            for word in words_list:
                if word == '':
                    continue
                try:
                    vocab[word] +=1
                except KeyError:
                    vocab[word] = 1
        return vocab
    
    def checkCoverage(self, text_array):
        # in order to get the coverage of GloVe on the text of DataSet
        vocab = self.distinctWords(text_array)
        cov_vocab =0
        num_vocab = len(vocab)# gain the number of distinct words in all text
        cov_text = 0
        not_cov = dict()
        not_text = 0
        for word in tqdm(vocab.keys()):
            try:
                x = self.embedding_source[word]
                cov_vocab += 1
                cov_text += vocab[word]
            except:
                not_cov[word]=vocab[word]
                not_text += vocab[word]
                pass
        percent_cov_vocab = cov_vocab/num_vocab
        percent_cov_text = cov_text/(cov_text+not_text)
        print("the number of distinct vocabulary of data is {a}".format(a=len(vocab)))
        print("In Embedding Index we have {:.2%} coverage of distinct vocabulary".format(percent_cov_vocab))
        print("And we have {:.2%} coverage of all text".format(percent_cov_text))
        sorted_not_cov = sorted(not_cov.items(),key= operator.itemgetter(1),reverse = True)
        print("The number of words which are not covered in GloVe resource is: {0}".format(len(sorted_not_cov)))
        return sorted_not_cov

In [231]:
cc = CheckCovered(glo_dict)

In [232]:
cc.checkCoverage(array_text)

100%|██████████████████████████████████████████████████████████████████████████| 44898/44898 [00:05<00:00, 8240.49it/s]
100%|█████████████████████████████████████████████████████████████████████| 118935/118935 [00:00<00:00, 1019262.61it/s]

the number of distinct vocabulary of data is 118935
In Embedding Index we have 61.26% coverage of distinct vocabulary
And we have 99.14% coverage of all text
The number of words which are not covered in GloVe resource is: 46076





[('\xa0', 5173),
 ('realdonaldtrump', 4627),
 ('tillerson', 2783),
 ('brexit', 2147),
 ('rohingya', 2055),
 ('manafort', 1235),
 ("'t", 1002),
 ('mnuchin', 980),
 ('gorsuch', 965),
 ('rakhine', 897),
 ('priebus', 872),
 ('duterte', 868),
 ('raqqa', 694),
 ('puigdemont', 645),
 ('filessupport', 594),
 ('jinping', 567),
 ('kellyanne', 553),
 ('abedin', 542),
 ('lavrov', 511),
 ('scaramucci', 504),
 ('reince', 482),
 ('somodevilla', 466),
 ('rosenstein', 447),
 ('chaffetz', 413),
 ('mnangagwa', 403),
 ('mulvaney', 391),
 ('henningsen', 390),
 ('blasio', 384),
 ('houthis', 384),
 ('finicum', 361),
 ('gulen', 354),
 ('houthi', 351),
 ('peskov', 350),
 ('juncker', 347),
 ('scalise', 340),
 ('tmsnrt', 338),
 ('kuczynski', 338),
 ('kislyak', 338),
 ('cfpb', 334),
 ('rouhani', 325),
 ('rauner', 323),
 ('ramaphosa', 288),
 ('barzani', 286),
 ('barnier', 284),
 ('angerer', 281),
 ('idlib', 276),
 ('hillaryclinton', 271),
 ('veselnitskaya', 269),
 ('yellen', 263),
 ('macri', 263),
 ('zinke', 261),

In [233]:
cc.checkCoverage(array_text2)

100%|██████████████████████████████████████████████████████████████████████████| 44898/44898 [00:05<00:00, 8087.12it/s]
100%|█████████████████████████████████████████████████████████████████████| 118935/118935 [00:00<00:00, 1010344.57it/s]

the number of distinct vocabulary of data is 118935
In Embedding Index we have 61.26% coverage of distinct vocabulary
And we have 99.14% coverage of all text
The number of words which are not covered in GloVe resource is: 46076





[('\xa0', 5173),
 ('realdonaldtrump', 4627),
 ('tillerson', 2783),
 ('brexit', 2147),
 ('rohingya', 2055),
 ('manafort', 1235),
 ("'t", 1002),
 ('mnuchin', 980),
 ('gorsuch', 965),
 ('rakhine', 897),
 ('priebus', 872),
 ('duterte', 868),
 ('raqqa', 694),
 ('puigdemont', 645),
 ('filessupport', 594),
 ('jinping', 567),
 ('kellyanne', 553),
 ('abedin', 542),
 ('lavrov', 511),
 ('scaramucci', 504),
 ('reince', 482),
 ('somodevilla', 466),
 ('rosenstein', 447),
 ('chaffetz', 413),
 ('mnangagwa', 403),
 ('mulvaney', 391),
 ('henningsen', 390),
 ('blasio', 384),
 ('houthis', 384),
 ('finicum', 361),
 ('gulen', 354),
 ('houthi', 351),
 ('peskov', 350),
 ('juncker', 347),
 ('scalise', 340),
 ('tmsnrt', 338),
 ('kuczynski', 338),
 ('kislyak', 338),
 ('cfpb', 334),
 ('rouhani', 325),
 ('rauner', 323),
 ('ramaphosa', 288),
 ('barzani', 286),
 ('barnier', 284),
 ('angerer', 281),
 ('idlib', 276),
 ('hillaryclinton', 271),
 ('veselnitskaya', 269),
 ('yellen', 263),
 ('macri', 263),
 ('zinke', 261),

## 1.5. Concatenate the DataFrame and Storage

In [234]:
df_all['title'] = array_title2
df_all['text'] = array_text2

In [235]:
# in order to make sure attributes other than text and title can be taken systematically, I put the other attributes ahead of title and text
df_all = df_date.merge(right = df_all, how = 'inner',left_index=True, right_index=True)

In [236]:
df_all

Unnamed: 0,year,month,day,weekday,aprilfool_close,aprilfool_no,aprilfool_not sure,aprilfool_yes,title,text,label
0,2017,12,31,6,0,1,0,0,"as u . s . budget fight looms , republicans fl...",washington ( reuters ) - the head of a conserv...,1
1,2017,12,29,4,0,1,0,0,u . s . military to accept transgender recruit...,washington ( reuters ) - transgender people wi...,1
2,2017,12,31,6,0,1,0,0,senior u . s . republican senator : 'let mr . ...,washington ( reuters ) - the special counsel i...,1
3,2017,12,30,5,0,1,0,0,fbi russia probe helped by australian diplomat...,washington ( reuters ) - trump campaign advise...,1
4,2017,12,29,4,0,1,0,0,trump wants postal service to charge 'much mor...,seattle / washington ( reuters ) - president d...,1
...,...,...,...,...,...,...,...,...,...,...,...
44893,2016,1,16,5,0,1,0,0,mcpain : john mccain furious that iran treated...,21 st century wire says as 21 wire reported ea...,0
44894,2016,1,16,5,0,1,0,0,justice ? yahoo settles e - mail privacy class...,21 st century wire says it s a familiar theme ...,0
44895,2016,1,15,4,0,1,0,0,sunnistan : us and allied ‘ safe zone ’ plan t...,patrick henningsen 21 st century wireremember ...,0
44896,2016,1,14,3,0,1,0,0,how to blow $ 700 million : al jazeera america...,21 st century wire says al jazeera america wil...,0


In [237]:
df_all = df_subject.merge(right = df_all, how = 'inner',left_index=True, right_index=True)

In [238]:
df_all

Unnamed: 0,subject_Government News,subject_Middle-east,subject_News,subject_US_News,subject_left-news,subject_politics,subject_politicsNews,subject_worldnews,year,month,day,weekday,aprilfool_close,aprilfool_no,aprilfool_not sure,aprilfool_yes,title,text,label
0,0,0,0,0,0,0,1,0,2017,12,31,6,0,1,0,0,"as u . s . budget fight looms , republicans fl...",washington ( reuters ) - the head of a conserv...,1
1,0,0,0,0,0,0,1,0,2017,12,29,4,0,1,0,0,u . s . military to accept transgender recruit...,washington ( reuters ) - transgender people wi...,1
2,0,0,0,0,0,0,1,0,2017,12,31,6,0,1,0,0,senior u . s . republican senator : 'let mr . ...,washington ( reuters ) - the special counsel i...,1
3,0,0,0,0,0,0,1,0,2017,12,30,5,0,1,0,0,fbi russia probe helped by australian diplomat...,washington ( reuters ) - trump campaign advise...,1
4,0,0,0,0,0,0,1,0,2017,12,29,4,0,1,0,0,trump wants postal service to charge 'much mor...,seattle / washington ( reuters ) - president d...,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44893,0,1,0,0,0,0,0,0,2016,1,16,5,0,1,0,0,mcpain : john mccain furious that iran treated...,21 st century wire says as 21 wire reported ea...,0
44894,0,1,0,0,0,0,0,0,2016,1,16,5,0,1,0,0,justice ? yahoo settles e - mail privacy class...,21 st century wire says it s a familiar theme ...,0
44895,0,1,0,0,0,0,0,0,2016,1,15,4,0,1,0,0,sunnistan : us and allied ‘ safe zone ’ plan t...,patrick henningsen 21 st century wireremember ...,0
44896,0,1,0,0,0,0,0,0,2016,1,14,3,0,1,0,0,how to blow $ 700 million : al jazeera america...,21 st century wire says al jazeera america wil...,0


In [239]:
# mix the order of dataframe, in order to make sure the order of label are mixed
df_all = df_all.sample(frac=1, replace=True, random_state=1)

In [240]:
df_all['label']

33003    0
12172    1
5192     1
32511    0
43723    0
        ..
11419    1
36294    0
31438    0
32646    0
33429    0
Name: label, Length: 44898, dtype: int64

In [241]:
# reset index
df_all = df_all.reset_index()
df_all = df_all.drop(columns=['index'])

In [242]:
df_all

Unnamed: 0,subject_Government News,subject_Middle-east,subject_News,subject_US_News,subject_left-news,subject_politics,subject_politicsNews,subject_worldnews,year,month,day,weekday,aprilfool_close,aprilfool_no,aprilfool_not sure,aprilfool_yes,title,text,label
0,0,0,0,0,0,1,0,0,2017,2,24,4,0,1,0,0,standing ovation ! nigel farage trolls cnn dur...,. @ nigel _ farage tells the # cpac 2017 crowd...,0
1,0,0,0,0,0,0,0,1,2017,12,15,4,0,1,0,0,'congratulations' : eu moves to brexit phase t...,brussels ( reuters ) - the european union agre...,1
2,0,0,0,0,0,0,1,0,2017,3,2,3,0,1,0,0,white house aides told to preserve materials i...,washington ( reuters ) - the white house couns...,1
3,0,0,0,0,0,1,0,0,2017,4,20,3,0,1,0,0,democrats sell promo t - shirt : “ democrats g...,"yes , the democrats think it s a good thing to...",0
4,0,0,0,1,0,0,0,0,2016,12,25,6,0,1,0,0,this year : let ’ s make christmas great again …,"this year , let s try something a little diffe...",0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44893,0,0,0,0,0,0,0,1,2017,12,25,0,0,1,0,0,driver rams german party headquarters in appar...,berlin ( reuters ) - a man drove a car at the ...,1
44894,0,0,0,0,0,1,0,0,2015,11,27,4,0,1,0,0,chicago thugs watched 9 yr old play on swings ...,update : no # blacklivesmatter protests planne...,0
44895,0,0,0,0,0,1,0,0,2017,8,28,0,0,1,0,0,one beer company praised for shutting down bus...,the anheuser - busch brewery put beer producti...,0
44896,0,0,0,0,0,1,0,0,2017,4,2,6,1,0,0,0,republican turns tables on fbi : deputy direct...,gchq director robert hannigan is stepping down...,0


In [243]:
save_path='.\DataSet\preprocessed_data.csv'
df_all.to_csv(save_path, sep='|', index=False)

In [244]:
df_all = pd.read_csv(save_path,sep='|')

## 1.6. Interception for Train and Test Set 

In [245]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df_all, test_size=0.2)

In [246]:
train['label']

7687     0
26994    1
39284    1
33434    1
35037    0
        ..
29677    1
39098    0
58       1
36869    1
23348    1
Name: label, Length: 35918, dtype: int64

In [247]:
train.to_csv('.\DataSet\Train.csv', sep='|', index=False)
test.to_csv('.\DataSet\Test.csv', sep='|', index=False)

# 2. Step 2 - GloVe Word-Embedding Vectorization

In [248]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import pickle
import re

## 2.1. Data Sample And Elicitation

In [324]:
# sampling if needed for shorten the test process
df_train = train.sample(frac=0.1, replace=True, random_state=1)
df_test = test.sample(frac=0.1, replace=True, random_state=1)

In [325]:
# add a mark between title and text for separating title and text when doing deep learning model
df_train['mark for title'] = 'end of title'
df_test['mark for title'] ='end of title'

In [326]:
# reassign position of a column
def reassignPos(df,ab_pos, bis_pos):
    cols = df.columns.tolist()
    col_change = cols.pop(ab_pos)
    cols.insert(bis_pos,col_change)
    df = df[cols]
    return df

In [327]:
df_train = reassignPos(df_train,19,17)
df_test = reassignPos(df_test,19,17)

In [328]:
print(len(df_train))
print(len(df_test))

3592
898


In [329]:
df_train

Unnamed: 0,subject_Government News,subject_Middle-east,subject_News,subject_US_News,subject_left-news,subject_politics,subject_politicsNews,subject_worldnews,year,month,day,weekday,aprilfool_close,aprilfool_no,aprilfool_not sure,aprilfool_yes,title,mark for title,text,label
22878,0,0,1,0,0,0,0,0,2016,7,29,4,0,1,0,0,"oh good grief , now the rnc is accusing hillar...",end of title,melania trump plagiarized part of her republic...,0
31424,0,0,0,0,0,1,0,0,2017,8,2,2,0,1,0,0,verified # fakenews ap attempts to discredit f...,end of title,ap published an article today that attempts to...,0
43534,0,0,1,0,0,0,0,0,2016,5,11,2,0,1,0,0,watch : target ceo tells anti - transgender bo...,end of title,"brian cornell , the ceo of the target chain of...",0
17366,0,0,0,0,0,0,0,1,2017,12,6,2,0,1,0,0,u . s . officials warn of isis' new caliphate ...,end of title,washington ( reuters ) - the collapse of islam...,1
25016,0,0,1,0,0,0,0,0,2017,3,3,4,0,1,0,0,chuck schumer responds to trump with tweet of ...,end of title,there s nothing sadder than when a person uses...,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10502,0,0,0,1,0,0,0,0,2017,10,29,6,0,1,0,0,lest we forget : ‘ independent ’ mueller is pa...,end of title,21 st century wire says while the mainstream p...,0
31021,0,0,0,0,0,0,1,0,2017,3,24,4,1,0,0,0,exclusive : trump to approve keystone xl at me...,end of title,( reuters ) - u . s . president donald trump w...,1
21712,0,0,1,0,0,0,0,0,2016,7,27,2,0,1,0,0,this scary message from obama is why everyone ...,end of title,when donald trump first announced that he was ...,0
24719,0,0,1,0,0,0,0,0,2016,6,12,6,0,1,0,0,hillary hits trump and his so - called ‘ unive...,end of title,hillary clinton is pulling out all the stops i...,0


In [330]:
def getXY(df):
    X = df.iloc[:, :-1].values
    y = df.iloc[:, -1].values
    return X,y

In [331]:
X_train,y_train = getXY(df_train)
X_test, y_test = getXY(df_test)

In [332]:
len(X_test[0])

19

In [333]:
X_test[10][17]

'end of title'

In [334]:
len(X_test)

898

In [335]:
y_test

array([1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0,
       0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0,
       1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1,
       1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0,

## 2.2. Vectorization
### 2.2.1. Import GloVe pre-trained Data 

In [315]:
# import learned embedding resource glove pretrained data
GloVe_path=".\DataSet\glove.840B.300d.txt"

In [316]:
# function for loading GloVe
def load_glove(path):
    glove_dict = dict()
    words = list()
    with open(path, mode='r', encoding="utf-8") as vec_file:
        for line in tqdm(vec_file):
            values = line.split()
            word = values[0]
            vec = np.array(values[1:])
            glove_dict[word] = vec

    print("There are {0} distinct word vectors in this pretrained GloVe DataSet.".format(len(glove_dict)))
    return glove_dict

In [317]:
glove_dict = load_glove(GloVe_path)

2196017it [04:06, 8899.05it/s] 


There are 2195884 distinct word vectors in this pretrained GloVe DataSet.


In [318]:
# function for get vectorization of sentence and output as matrix
def getVec(sentence):
    'function for getting vector of the sentence'
    try:
        sentence = re.sub(r' +',r' ',sentence)
        word_list=sentence.split()
        sentence_matrix = list()
        for word in word_list:
            try:
                vec = glove_dict[word]
                vec = np.array(vec).astype(np.float)
                sentence_matrix.append(vec)
            except:
                pass
        sentence_matrix = np.array(sentence_matrix)
        
    except TypeError:# if there is np.nan value
        sentence_matrix = np.zeros((1,300))
    return sentence_matrix

In [319]:
# test
sen1="i like milk"
sen2='ok'
ma1=getVec(sen1)
ma2=getVec(sen2)
ma3 = np.append(ma1,ma2,0)
print(len(ma3))
ma3

4


array([[ 0.18733 ,  0.40595 , -0.51174 , ...,  0.16495 ,  0.18757 ,
         0.53874 ],
       [-0.18417 ,  0.055115, -0.36953 , ..., -0.23808 ,  0.37132 ,
         0.36197 ],
       [-0.62388 ,  0.21805 ,  0.29327 , ..., -0.96244 ,  0.28478 ,
         0.27701 ],
       [-0.13199 ,  0.15186 , -0.65313 , ..., -0.33259 , -0.17585 ,
         0.12933 ]])

### 2.2.2. Vectorization for Deep Learning

In [336]:
# function for vectorization of X
def vectorization(X):
    X_out = list()
    for eachRow in tqdm(X):
        text_row = ''
        for eachEle in eachRow:# concatenate the attributes into one sentence
            text_row = text_row +" " +str(eachEle)
        
        text_row = text_row.strip()
        text_matrix = getVec(text_row)
        X_out.append(text_matrix)
    X_out = np.array(X_out)
    return X_out

In [337]:
X_test_vec = vectorization(X_test)

100%|████████████████████████████████████████████████████████████████████████████████| 898/898 [00:58<00:00, 15.36it/s]


In [338]:
X_train_vec = vectorization(X_train)

100%|██████████████████████████████████████████████████████████████████████████████| 3592/3592 [03:38<00:00, 16.46it/s]


In [339]:
print(len(X_test_vec))
print(len(X_train_vec))

898
3592


In [340]:
len(X_test_vec[0])

859

### 2.2.3. Vectorization for Machine Learning

In [341]:
# function for vectorization only the text part of X, and use the average vector as the new attributes to replace the text column
def vectorizationPart(X,pos_title=16,pos_text=18):
    X_out = list()
    for eachRow in tqdm(X):
        list_row=list()
        for i in range(pos_title):
            # make categorical attributes
            list_row.append(eachRow[i])
        #print(eachRow[pos_title])
        #try:
        title_vec = getVec(eachRow[pos_title])
        title_vec_mean = list(np.mean(title_vec,axis=0))
        #except:
        #    title_vec_mean = np.zeros((1,300))
        #    title_vec_mean = list(title_vec_mean[0])
        list_row = list_row + title_vec_mean
        #print(eachRow[pos_text])
        #try:
        text_vec = getVec(eachRow[pos_text])
        text_vec_mean = list(np.mean(text_vec,axis=0))
            
        #except:
        #    text_vec_mean = np.zeros((1,300))
        #    text_vec_mean = list(title_vec_mean[0])
        list_row = list_row + text_vec_mean
        
        list_row = np.array(list_row)

            
        X_out.append(list_row)
    X_out = np.array(X_out)
    return X_out

In [342]:
X_test_cols = vectorizationPart(X_test)
X_train_cols = vectorizationPart(X_train)

100%|████████████████████████████████████████████████████████████████████████████████| 898/898 [00:54<00:00, 16.57it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 3592/3592 [03:36<00:00, 16.59it/s]


In [343]:
X_test_cols

array([[ 0.        ,  0.        ,  0.        , ..., -0.04039734,
         0.0130999 , -0.01040898],
       [ 0.        ,  0.        ,  1.        , ..., -0.04934576,
        -0.01580824,  0.02186732],
       [ 0.        ,  0.        ,  0.        , ..., -0.0248349 ,
         0.04446531, -0.00751571],
       ...,
       [ 0.        ,  0.        ,  0.        , ..., -0.10365419,
         0.04221718,  0.06005114],
       [ 0.        ,  0.        ,  0.        , ..., -0.02165765,
         0.02012159,  0.06221512],
       [ 0.        ,  0.        ,  0.        , ..., -0.06825698,
        -0.00831745,  0.04705567]])

## 2.3. Storage

In [344]:
# function for saving as pickle file
def saveData(name,data):
    pickle_file = open('DataSet\\'+name,mode='wb')
    pickle.dump(data,pickle_file)
    pickle_file.close()

In [345]:
saveData('y_train.pkl',y_train)
saveData('y_test.pkl',y_test)
saveData('X_train_vec.pkl',X_train_vec)
saveData('X_test_vec.pkl',X_test_vec)

In [346]:
saveData('X_train_cols.pkl',X_train_cols)
saveData('X_test_cols.pkl',X_test_cols)

# 3. Step 3 - XGBoost
## 3.1. Data Importion

In [347]:
import numpy as np
import pickle
import sklearn
from tqdm import tqdm

In [348]:
# function for loading pickle file
def loadPickle(file_name):
    picklefile = open('.\\DataSet\\'+file_name+".pkl",mode="rb")
    data = pickle.load(picklefile)
    picklefile.close()
    return data

In [349]:
y_train = loadPickle('y_train')
y_test = loadPickle('y_test')
X_train = loadPickle('X_train_cols')
X_test = loadPickle('X_test_cols')

In [360]:
len(X_train)

3592

In [361]:
len(X_test)

898

## 3.2. Train on XGBoost
Initialize the Model and train Model

In [350]:
import xgboost
from xgboost import XGBClassifier 

In [351]:
model = XGBClassifier()
model.fit(X_train, y_train)





XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=12,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

### 3.2.1. Storage for Model

In [352]:
# global parameter
modelname="XGBoost"

In [353]:
# save model
def saveModel(modelname,data):
    pickle_file = open('DataSet\\'+modelname+".pkl",mode='wb')
    pickle.dump(data,pickle_file)
    pickle_file.close()

In [354]:
saveModel(modelname,model)

### 3.2.2. Load Model

In [355]:
model = loadPickle(modelname)

## 3.3. Prediction

In [356]:
y_preds = model.predict(X_test)

In [357]:
y_preds

array([1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0,
       0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0,
       1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1,
       1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0,

## 3.4. Evaluation

In [358]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

def getEvaluation(y_test,y_pred):
    ev = dict()
    ev["confusion_matrix"]=confusion_matrix(y_test,y_pred)
    ev["accuracy"] = accuracy_score(y_test,y_pred)
    ev["precision"]=precision_score(y_test,y_pred,average="binary")
    ev["recall"]=recall_score(y_test,y_pred,average="binary")
    ev["F1 score"]=f1_score(y_test,y_pred,average="binary")
    ev['Accuracy']=accuracy_score(y_test,y_pred)
    
    for key in ev.keys():
        if key !="confusion_matrix":
            print("{a} is: {b}".format(a=key, b=ev[key]))
        else:
            print(ev[key])
    
    return ev

In [359]:
ev = getEvaluation(y_test,y_preds)

[[470   0]
 [  0 428]]
accuracy is: 1.0
precision is: 1.0
recall is: 1.0
F1 score is: 1.0
Accuracy is: 1.0


# 4. Step 3 - BiLSTM

## 4.1.Import Data

In [1]:
import numpy as np
import pickle
import sklearn
import tensorflow as tf
from tqdm import tqdm

In [2]:
# function for loading pickle file
def loadPickle(file_name):
    picklefile = open('.\\DataSet\\'+file_name+".pkl",mode="rb")
    data = pickle.load(picklefile)
    picklefile.close()
    return data

In [3]:
y_train = loadPickle('y_train')
y_test = loadPickle('y_test')
X_train = loadPickle('X_train_vec')
X_test = loadPickle('X_test_vec')

## 4.2. Preprocessing for padding the matrix to same size
### 4.2.1. Get Max Length of Rows in X

In [4]:
# increase the length of each text in each row till it matches the max length of all the texts
def getlenMax(input_data):
    'function for estimating the max length of text in data'
    max_no = 0
    for text in input_data:
        num_word = len(text)
        if max_no <= num_word:
            max_no = num_word

    return max_no

def compare(train,test):
    'function for comparison between test and train'
    no_train = getlenMax(train)
    no_test = getlenMax(test)
    if no_train > no_test:
        return no_train
    else:
        return no_test

In [5]:
max_len = compare(X_train,X_test)
print(max_len) # get the max length of rows in X

5845


## 4.2.2. Padding the format
because LSTM is time-series model, pre-padding make more sense 

In [6]:
def extendText(input_data, len_max,padding="pre",dimension=300):
    content=list()
    if padding == "post":
        for text in tqdm(input_data):
            add_no = len_max - len(text)
            x = np.zeros((add_no,dimension))
            new_text = np.vstack((text,x))#padding is post
            content.append(new_text)
            
    elif padding == "pre":
        for text in tqdm(input_data):
            add_no = len_max - len(text)
            x = np.zeros((add_no,dimension))
            new_text = np.vstack((x,text))#padding is pre
            content.append(new_text)

    content = np.array(content)
    return content

In [7]:
test = [[[1,2,3],[3,4,5]],[[1,2,3],[4,5,6]]]
test = np.array(test)
test = extendText(test,max_len,dimension=3)
print(test)
print(len(test[0]))

100%|██████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2004.45it/s]

[[[0. 0. 0.]
  [0. 0. 0.]
  [0. 0. 0.]
  ...
  [0. 0. 0.]
  [1. 2. 3.]
  [3. 4. 5.]]

 [[0. 0. 0.]
  [0. 0. 0.]
  [0. 0. 0.]
  ...
  [0. 0. 0.]
  [1. 2. 3.]
  [4. 5. 6.]]]
5845





In [8]:
X_train = extendText(X_train,max_len)
X_test = extendText(X_test,max_len)

100%|██████████████████████████████████████████████████████████████████████████████| 3592/3592 [02:16<00:00, 26.26it/s]
100%|████████████████████████████████████████████████████████████████████████████████| 898/898 [00:23<00:00, 38.38it/s]


## 4.3. Build BiLSTM
### 4.3.1. Initialize the Model and Add the Layers

In [9]:
# import libraries
import keras
from keras.layers import LSTM
from keras.layers import MaxPooling1D
from keras.layers import Dropout
from keras.layers import Dense
from keras.layers import Bidirectional
from keras_self_attention import SeqSelfAttention

In [10]:
# firstly we need to make sure our input share the same format
input_shape = X_train[0].shape
print(input_shape) # this means 300 length vectors with 162 timesteps

(5845, 300)


In [11]:
def modelBuild(input_shape):
    # initialize model
    model = keras.Sequential()
    
    # add 1st BiLSTM layer, I need to set the input_shape directly, 
    # which should be 2 dimensions, one for the timesteps, one for the indicators inside
    # because sentence can be complex, here I should set the LSTM cells number as 50, but in order to lower the consuming time, I only set 5
    forward_layer = LSTM(units = 5, activation="tanh",dropout=0.2, recurrent_activation="sigmoid", return_sequences = True)
    #backward_layer = LSTM(units = 10, activation='relu',dropout=0.2, return_sequences=True, go_backwards=True)
    model.add(Bidirectional(layer = forward_layer,merge_mode="concat",input_shape = input_shape))
    
    forward_layer2 = LSTM(units = 5, activation="tanh",dropout=0.2, recurrent_activation="sigmoid")
    # add 2nd LSTM layer
    model.add(Bidirectional(layer = forward_layer2))
    
   
    # add output layer, since the data has 2 output [0,1] as label, I set the final neurons number as 1
    model.add(Dense(units=1, activation='sigmoid'))
    
    return model

## 4.3.2. Compilation

In [12]:
model = modelBuild(input_shape)

In [13]:
# compilation
# we use stochastic gradient descent as our optimizer, cross entropy as our loss function
# it's because I use integer binary result [0,1] as label, so I use binary crossentropy as my loss function
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

In [14]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bidirectional (Bidirectional (None, 5845, 10)          12240     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 10)                640       
_________________________________________________________________
dense (Dense)                (None, 1)                 11        
Total params: 12,891
Trainable params: 12,891
Non-trainable params: 0
_________________________________________________________________


### 4.3.3. Function for Storage of Model

In [15]:
# global parameter
modelname="BiLSTM"

In [16]:
def saveModel(model,modelname):
    model_json=model.to_json()
    # specialize model to json
    name_path = "DataSet\\{a}".format(a=modelname)
    with open(name_path+".json","w") as json_file:
        json_file.write(model_json)
        
    # sepcialize weights to HDF5
    model.save_weights(name_path+".h5")
    print("Save model to DataSet archive successfully")

## 4.4. Train Model

In [17]:
# in order to shorten the cost time, I only set epochs = 2
epochs = 2
batch = 50
vali_split=0.1
# though validation_data is not used in propagation, the model will be gradually familier with validation data. There can be information leak.
# therefore, normally use validation_split for testing model to avoid overfitting

In [18]:
history = model.fit(x = X_train,y=y_train,validation_split=vali_split, epochs =epochs, batch_size=batch)
saveModel(model,modelname)

Epoch 1/2
Epoch 2/2
Save model to DataSet archive successfully


## 4.5. Load Model

In [19]:
def loadModel(modelname):
    # load json file
    load_path = ".\\DataSet\\{a}".format(a=modelname)
    json_file = open(load_path+".json",'r')
    model_json = json_file.read()
    json_file.close()
    model = keras.models.model_from_json(model_json)
    # load weights and assign them to the model
    model.load_weights(load_path+".h5")
    print("loaded {a} model successfully".format(a=modelname))
    
    return model

In [20]:
model = loadModel(modelname)

loaded BiLSTM model successfully


## 4.6. Prediction 

In [21]:
y_possible = model.predict(X_test)

In [22]:
y_possible

array([[0.8065181 ],
       [0.12902287],
       [0.16234314],
       [0.7873483 ],
       [0.8087996 ],
       [0.13635454],
       [0.16339886],
       [0.80341756],
       [0.8088933 ],
       [0.14090991],
       [0.81483513],
       [0.19146952],
       [0.8082435 ],
       [0.7983103 ],
       [0.145408  ],
       [0.14510137],
       [0.4848252 ],
       [0.80281115],
       [0.1398944 ],
       [0.17815602],
       [0.30631027],
       [0.13695174],
       [0.81620085],
       [0.14707956],
       [0.13743979],
       [0.17318967],
       [0.19190821],
       [0.24233684],
       [0.7995265 ],
       [0.223856  ],
       [0.1402162 ],
       [0.14059532],
       [0.8121161 ],
       [0.4848252 ],
       [0.14012495],
       [0.14364243],
       [0.16175118],
       [0.14251223],
       [0.20067525],
       [0.79938686],
       [0.8040622 ],
       [0.81182194],
       [0.13624674],
       [0.19564337],
       [0.14528435],
       [0.29098338],
       [0.80889547],
       [0.807

In [23]:
def setResult(y_possible,standard):
    y_preds = list()
    for each in y_possible:
        if each >= standard:
            y_preds.append(1)
        else:
            y_preds.append(0)
    y_preds=np.array(y_preds)
    return y_preds

In [24]:
y_preds=setResult(y_possible,standard=0.5)

In [25]:
y_preds

array([1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0,
       0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0,
       0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0,
       1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0,
       1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1,
       1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0,

## 4.7. Evaluation

In [26]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

def getEvaluation(y_test,y_pred,X_test, model):
    ev = dict()
    ev["confusion_matrix"]=confusion_matrix(y_test,y_pred)
    ev["accuracy"] = accuracy_score(y_test,y_pred)
    ev["precision"]=precision_score(y_test,y_pred,average="binary")
    ev["recall"]=recall_score(y_test,y_pred,average="binary")
    ev["F1 score"]=f1_score(y_test,y_pred,average="binary")
    
    for key in ev.keys():
        if key !="confusion_matrix":
            print("{a} is: {b}".format(a=key, b=ev[key]))
        else:
            print(ev[key])
    
    # and also the basic evaluation from keras
    model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
    eval_ = model.evaluate(X_test,y_test)
    print("Loss: {0:.7}".format(eval_[0]))
    print("Accuracy: {0:.2%}".format(eval_[1]))
    
    return ev

In [27]:
ev = getEvaluation(y_test,y_preds,X_test,model)

[[443  27]
 [ 30 398]]
accuracy is: 0.9365256124721604
precision is: 0.9364705882352942
recall is: 0.9299065420560748
F1 score is: 0.9331770222743259
Loss: 0.2792055
Accuracy: 93.65%


# 5. Step 3 - CNN

## 5.1. Data Importion

In [28]:
import numpy as np
import pickle
import sklearn
import tensorflow as tf
from tqdm import tqdm

In [29]:
# function for loading pickle file
def loadPickle(file_name):
    picklefile = open('.\\DataSet\\'+file_name+".pkl",mode="rb")
    data = pickle.load(picklefile)
    picklefile.close()
    return data

In [30]:
y_train = loadPickle('y_train')
y_test = loadPickle('y_test')
X_train = loadPickle('X_train_vec')
X_test = loadPickle('X_test_vec')

## 5.2. Preprocessing for padding the matrix to same size
### 5.2.1. Get Max Length of Rows in X 

In [31]:
# increase the length of each text in each row till it matches the max length of all the texts
def getlenMax(input_data):
    'function for estimating the max length of text in data'
    max_no = 0
    for text in input_data:
        num_word = len(text)
        if max_no <= num_word:
            max_no = num_word

    return max_no

def compare(train,test):
    'function for comparison between test and train'
    no_train = getlenMax(train)
    no_test = getlenMax(test)
    if no_train > no_test:
        return no_train
    else:
        return no_test

In [32]:
max_len = compare(X_train,X_test)
print(max_len) # get the max length of rows in X

5845


### 5.2.2. Padding the format
because LSTM is time-series model, pre-padding make more sense

In [33]:
def extendText(input_data, len_max,padding="pre",dimension=300):
    content=list()
    if padding == "post":
        for text in tqdm(input_data):
            add_no = len_max - len(text)
            x = np.zeros((add_no,dimension))
            new_text = np.vstack((text,x))#padding is post
            content.append(new_text)
            
    elif padding == "pre":
        for text in tqdm(input_data):
            add_no = len_max - len(text)
            x = np.zeros((add_no,dimension))
            new_text = np.vstack((x,text))#padding is pre
            content.append(new_text)

    content = np.array(content)
    return content

In [34]:
X_train = extendText(X_train,max_len,padding = 'post')
X_test = extendText(X_test,max_len,padding = 'post')

100%|██████████████████████████████████████████████████████████████████████████████| 3592/3592 [02:48<00:00, 21.26it/s]
100%|████████████████████████████████████████████████████████████████████████████████| 898/898 [00:46<00:00, 19.15it/s]


## 5.3. Build 1dCNN
### 5.3.1. Initialize the Model and Add the Layers

In [35]:
# import libraries
import keras
from tensorflow.keras.layers import Conv1D
from tensorflow.keras.layers import MaxPooling1D
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dense

In [36]:
# firstly we need to make sure our input share the same format
input_shape = X_train[0].shape
print(input_shape) # this means 300 length vectors with 162 timesteps

(5845, 300)


In [37]:
def modelBuild(input_shape):
    # initialize CNN model
    model = tf.keras.Sequential()
    
    # add 1st convolution layer. here particularly need input_shape
    model.add(Conv1D(filters=150, kernel_size=2, activation="relu",padding="same", input_shape=input_shape))
    
    # add 2nd convolution layer with different kernal_size 3
    model.add(Conv1D(filters=150, kernel_size=3, activation="relu",padding="same"))
    
    # add 3rd convolution layer with different kernal_size 4
    model.add(Conv1D(filters=150, kernel_size=4, activation="relu",padding="same"))
    # add 1st max pooling, because sentences share different length, so I only count max value from each feature map.
    # therefore ,finally when I get to the ann input layer, I can have fixed dimensionality of input data.
    model.add(MaxPooling1D(pool_size=max_len,padding="valid"))
    model.add(Dropout(0.2))
    
    # add flattern
    model.add(Flatten())
    
    # from now on it's ANN analysis layer, we set enough neurons for full connection, add 1st hidden layer
    # because sentences are built based on fluent thinking style, so I use rectifier as activation function to avoid gradual change
    model.add(Dense(units=5, activation='relu'))
    
    # here I add 2nd hidden layer
    model.add(Dense(units=5, activation='relu'))
    
    # add output layer, since the category of label is "happy, sad, angry, others", i use 4 neurons as output outcome
    # it's because I also need the possibility for the prediction, so I set sigmoid as activation function
    model.add(Dense(units=1, activation='sigmoid'))
    
    return model

### 5.3.2.Compilation

In [38]:
model = modelBuild(input_shape)

In [39]:
# compilation
# we use stochastic gradient descent as our optimizer, cross entropy as our loss function
# it's because I use integer binary result [0,1] as label, so I use binary crossentropy as my loss function
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

In [40]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv1d (Conv1D)              (None, 5845, 150)         90150     
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 5845, 150)         67650     
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 5845, 150)         90150     
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 1, 150)            0         
_________________________________________________________________
dropout (Dropout)            (None, 1, 150)            0         
_________________________________________________________________
flatten (Flatten)            (None, 150)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 5)                

### 5.3.3. Function for Storage of Model

In [41]:
# global parameter
modelname="1dCNN"

In [42]:
def saveModel(model,modelname):
    model_json=model.to_json()
    # specialize model to json
    name_path = "DataSet\\{a}".format(a=modelname)
    with open(name_path+".json","w") as json_file:
        json_file.write(model_json)
        
    # sepcialize weights to HDF5
    model.save_weights(name_path+".h5")
    print("Save model to DataSet archive successfully")

## 5.4. Train Model

In [43]:
# in order to shorten the cost time, I only set epochs = 2
epochs = 2
batch = 50
vali_split=0.1
# though validation_data is not used in propagation, the model will be gradually familier with validation data. There can be information leak.
# therefore, normally use validation_split for testing model to avoid overfitting

In [44]:
history = model.fit(x = X_train,y=y_train,validation_split=vali_split, epochs =epochs, batch_size=batch)
saveModel(model,modelname)

Epoch 1/2
Epoch 2/2
Save model to DataSet archive successfully


## 5.5. Load Model

In [45]:
def loadModel(modelname):
    # load json file
    load_path = ".\\DataSet\\{a}".format(a=modelname)
    json_file = open(load_path+".json",'r')
    model_json = json_file.read()
    json_file.close()
    model = keras.models.model_from_json(model_json)
    # load weights and assign them to the model
    model.load_weights(load_path+".h5")
    print("loaded {a} model successfully".format(a=modelname))
    
    return model

In [46]:
model = loadModel(modelname)

loaded 1dCNN model successfully


## 5.6. Prediction

In [47]:
y_possible = model.predict(X_test)

In [48]:
y_possible

array([[9.99980688e-01],
       [6.53200868e-06],
       [1.97356094e-05],
       [9.99968886e-01],
       [9.99962211e-01],
       [2.36159940e-05],
       [4.70864052e-06],
       [9.99981284e-01],
       [9.99982357e-01],
       [6.08891132e-05],
       [9.99969423e-01],
       [7.18705360e-06],
       [9.99972582e-01],
       [9.99981284e-01],
       [5.90951868e-06],
       [6.06274671e-06],
       [9.99981999e-01],
       [9.99965906e-01],
       [1.21898483e-05],
       [1.95871235e-05],
       [2.78308999e-05],
       [2.53389208e-05],
       [9.99981880e-01],
       [2.14221254e-05],
       [1.93891356e-05],
       [1.83540615e-05],
       [1.69848208e-05],
       [2.51985730e-05],
       [7.99300142e-06],
       [2.86092472e-05],
       [1.20805475e-04],
       [2.45512911e-05],
       [9.99982595e-01],
       [9.99981999e-01],
       [1.29079699e-05],
       [2.27603250e-05],
       [5.15969532e-06],
       [6.62807142e-05],
       [5.88815283e-06],
       [9.99983490e-01],


In [49]:
def setResult(y_possible,standard):
    y_preds = list()
    for each in y_possible:
        if each >= standard:
            y_preds.append(1)
        else:
            y_preds.append(0)
    y_preds=np.array(y_preds)
    return y_preds

In [50]:
y_preds=setResult(y_possible,standard=0.5)

In [51]:
y_preds

array([1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0,
       0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0,
       1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1,
       1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0,

## 5.7. Evaluation

In [52]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

def getEvaluation(y_test,y_pred,X_test, model):
    ev = dict()
    ev["confusion_matrix"]=confusion_matrix(y_test,y_pred)
    ev["accuracy"] = accuracy_score(y_test,y_pred)
    ev["precision"]=precision_score(y_test,y_pred,average="binary")
    ev["recall"]=recall_score(y_test,y_pred,average="binary")
    ev["F1 score"]=f1_score(y_test,y_pred,average="binary")
    
    for key in ev.keys():
        if key !="confusion_matrix":
            print("{a} is: {b}".format(a=key, b=ev[key]))
        else:
            print(ev[key])
    
    # and also the basic evaluation from keras
    model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
    eval_ = model.evaluate(X_test,y_test)
    print("Loss: {0:.7}".format(eval_[0]))
    print("Accuracy: {0:.2%}".format(eval_[1]))
    
    return ev

In [53]:
ev = getEvaluation(y_test,y_preds,X_test,model)

[[470   0]
 [  0 428]]
accuracy is: 1.0
precision is: 1.0
recall is: 1.0
F1 score is: 1.0
Loss: 0.0001031947
Accuracy: 100.00%
