### Detecting Stance for Fake News Identification

We try to tackle the issue of fake news using Artificial Intelligence and Machine Learning. According to [fakenewschallenge.org](fakenewschallenge.org), first step towards identifying fake news is to see what other news sources are talking about the same news.

We use the dataset provided by fakenewschallenge.org. The task is to identify whether the given body is 'related' or 'unrelated' to the title. If related, does it 'discuss', 'agree', 'disagree', with the given title.

I have referred the approach of Benjamin Riedel et al. mentioned in their [paper](https://arxiv.org/abs/1707.03264) - "A simple but tough-to-beat baseline for the Fake News Challenge stance detection task". The approach uses a sinlge layer neural network as a classifier. The features are identified as concatination of TF-vector of Headline, TF-vector of Body, and TF-IDF cosine similarity between the headline and the body. 

The mentioned apporach has been optimized for getting a good score on fakenewschallenge.org's scoring system. As the scoring system gives more weightage to the distinction between 'Unrelated' and 'Related', the accuracy of the mentioned approach is high for 'Unrelated' classes. It is very low for 'disagree' (6.6%) labelled data. I have tried to improve the algorithm to perform equally on all classes. 

I have used undersampling and oversampling techniques to balance the training data. Further, I have given class weights to specify to the model that classification of all classes is equally important.

In [256]:
from comet_ml import Experiment

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction import text
from sklearn.preprocessing import LabelEncoder
import numpy as np
import pandas as pd
from keras.utils import np_utils
from sklearn.metrics import confusion_matrix
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras import optimizers
from nltk.stem.snowball import SnowballStemmer
from sklearn.utils import class_weight
import nltk
from keras.optimizers import Adam


### Prepare The Training data:

The training data consists of 49972 instances the distribution of the label is as follows:

```
unrelated 	discuss 	agree 		disagree
73.13%      17.82% 	    7.36% 	    1.68%
```

This is highly imbalanced data. Hence we will try to balance it.

We will pick 3000 instances from each of 'unrelated','agree','disagree' data. 
We have additional data given by fakenewschallenge.org for testing. We will pick 347(50%) instances from 'disagree' label, and add it to the train data. 



In [257]:
balance_data = pd.DataFrame()

original_data = pd.read_csv( 'train_stances.csv' )
original_test_data = pd.read_csv( 'competition_test_stances.csv' )

disagree_data = original_data.loc[original_data['Stance']=='disagree']
disagree_data_test = original_test_data.loc[original_test_data['Stance']=='disagree']
disagree_data_test=disagree_data_test.head(348)
disagree_data=disagree_data.append(disagree_data_test)

balance_data = balance_data.append(disagree_data)

stances = original_test_data['Stance'].unique().tolist()

for stance in stances:
    if stance != 'disagree':
        stance_data = original_data.loc[original_data['Stance']==stance]
        stance_data=stance_data.sample(3000)
        balance_data=balance_data.append(stance_data,ignore_index=True)


balance_data= balance_data.sample(frac=1)
print('Balnced Train Data:')
print(balance_data.groupby('Stance').count())

balance_data.to_csv('my_train_stances.csv')


Balnced Train Data:
           Headline  Body ID
Stance                      
agree          3000     3000
disagree       1188     1188
discuss        3000     3000
unrelated      3000     3000


In [258]:
def readFiles(bodies,stances):
    bodiesFile=pd.read_csv(bodies)
    stancesFile = pd.read_csv(stances)
    consolidated_data=pd.merge(bodiesFile,stancesFile, on=['Body ID'])
    consolidated_data['merged'] = consolidated_data['Headline']+' '+consolidated_data['articleBody']
    return {
        'consolidated_data':consolidated_data,
        'bodiesFile':bodiesFile,
        'stancesFile':stancesFile
    }

def getCorpus(data):
    unique_headlines = data['Headline'].unique().tolist()
    unique_bodies = data['articleBody'].unique().tolist()
    corpus = unique_headlines + unique_bodies
    return corpus


In [259]:
from sklearn.model_selection import train_test_split

merged_data=readFiles('train_bodies.csv','my_train_stances.csv')['consolidated_data']
train_data, test_data = train_test_split(merged_data, test_size=0.2)

print('Distribution of the labels in Training Data:')
train_labels = train_data.groupby('Stance').size()
for stance in stances:
    print(str(stance), '-',int(train_labels[stance]))

print('\n\nDistribution of the labels in Test Data:')
test_labels = test_data.groupby('Stance').size()
for stance in stances:
    print(stance, test_labels[stance])


Distribution of the labels in Training Data:
unrelated - 2384
agree - 2422
discuss - 2389
disagree - 677


Distribution of the labels in Test Data:
unrelated 616
agree 578
discuss 611
disagree 163


In [260]:
train_corpus=getCorpus(train_data)
test_corpus=getCorpus(test_data)

In [261]:
stemmer = nltk.stem.snowball.SnowballStemmer('english')
top_N = 3000


def preprocess(data_frame,corpus, countVectorer,train=False):      
        
    corpus = [' '.join([stemmer.stem(word) for word in str.split(' ')]) for str in corpus]
    if (train == True):
        cvectorizer = CountVectorizer( lowercase=True,
                                       stop_words=text.ENGLISH_STOP_WORDS, token_pattern="\w+[\-\']?\w+", max_features=top_N, ngram_range=(1,4) )
        countVectorer = cvectorizer.fit( corpus );

    cvec = countVectorer.transform( corpus )
  #  print(countVectorer.vocabulary_)

    idxToContentMap = {}
    for index, element in enumerate( corpus ):
        idxToContentMap[element] = index

    tfreq_transformer = TfidfTransformer( use_idf=False ).fit( cvec )  # transform the counts to normalized tf.
    tfreq = tfreq_transformer.transform( cvec ).toarray()

    tfidf_vector = TfidfVectorizer( max_features=top_N, lowercase=True,
                                    stop_words=text.ENGLISH_STOP_WORDS, token_pattern="\w+[\-\']?\w+",ngram_range=(1,4) ) \
        .fit_transform( corpus )  # Train and test sets
    features = []
    feat = np.array( [] )
    cosineMap = {}
    for index, row in data_frame.iterrows():
        body_id = row['Body ID']
        Headline = ' '.join(stemmer.stem(word) for word in row['Headline'].split(" "))
        body = ' '.join(stemmer.stem(word) for word in row['articleBody'].split(" "))
        head_index = idxToContentMap[Headline]
        body_index = idxToContentMap[body]
        tf_head = tfreq[head_index]
        tf_body = tfreq[body_index]
        tfidf_head = tfidf_vector[head_index]
        tfidf_body = tfidf_vector[body_index]
        cos = 0.0
        if (Headline, body) not in cosineMap:
            cos = cosine_similarity( tfidf_head, tfidf_body )[0]
            cosineMap[(Headline, body)] = cos
        else:
            cos = cosineMap[(Headline, body)]
        features.append( np.concatenate( [tf_head,tf_body,cos ] ) )

    featuresndarray = np.array( features )

    
    return {
        'X': featuresndarray,
        'countVectorer': countVectorer
    }


def encodeLabels(rawStances):
# encode the labels
    lencoder = LabelEncoder()
    lencoder = lencoder.fit(rawStances)
    print('Sample class input:','[0,1,2,3]')
    y_int = lencoder.transform(rawStances)
    print('encoded as',lencoder.inverse_transform([0,1,2,3]))
    y_hot_encoded = np_utils.to_categorical( y_int )
    y_encoded = y_int
    return {
        'encoder':lencoder,
        'y_hot_encoded':y_hot_encoded,
        'y_encoded':y_encoded
    }

In [262]:
processed_train = preprocess(train_data,train_corpus,None,train=True)
encodedLabels_train = encodeLabels(train_data['Stance'])
train_X = processed_train['X']
train_Y = encodedLabels_train['y_encoded']
train_Y_hot = encodedLabels_train['y_hot_encoded']

countVectorer = processed_train['countVectorer']


processed_test = preprocess(test_data,test_corpus,countVectorer,train=False)
encodedLabels_test = encodeLabels(test_data['Stance'])
test_X = processed_test['X']
test_Y = encodedLabels_test['y_encoded']
test_Y_hot = encodedLabels_test['y_hot_encoded']

Sample class input: [0,1,2,3]
encoded as ['agree' 'disagree' 'discuss' 'unrelated']
Sample class input: [0,1,2,3]
encoded as ['agree' 'disagree' 'discuss' 'unrelated']


In [264]:

c_weights = class_weight.compute_class_weight('balanced', np.unique(train_Y), train_Y)
le = encodedLabels_train['encoder']

class_weights_dict = dict(zip(le.transform(list(le.classes_)), c_weights))

experiment = Experiment(api_key="8FN1yPW4wibx67lyHFiRHWzVk", project_name="general", workspace="ptambvekar")


model = Sequential()
model.add(Dense(100, input_dim=(top_N*2)+1, activation='relu'))
model.add(Dense(40, activation='relu'))
model.add(Dense(20, activation='relu'))
model.add(Dropout(0.6))
model.add(Dense(4, activation='softmax'))


sgd = optimizers.SGD(lr=0.01)
adam = optimizers.Adam(lr=0.001)

model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
model.fit(train_X,train_Y_hot,epochs=60,batch_size=32,shuffle=True,verbose=2,class_weight=class_weights_dict)


COMET INFO: Experiment is live on comet.ml https://www.comet.ml/ptambvekar/general/43c0bbd943554bb7b3084ef5f201bf2b

COMET INFO: old comet version (1.0.37) detected. current: 1.0.38 please update your comet lib with command: `pip install --no-cache-dir --upgrade comet_ml`
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/ptambvekar/general/9bd6fe046fe34350a1e1465c512db82f



Epoch 1/60
 - 2s - loss: 1.1225 - acc: 0.5005
Epoch 2/60
 - 2s - loss: 0.6805 - acc: 0.7156
Epoch 3/60
 - 2s - loss: 0.5301 - acc: 0.7734
Epoch 4/60
 - 2s - loss: 0.4372 - acc: 0.8189
Epoch 5/60
 - 2s - loss: 0.3919 - acc: 0.8370
Epoch 6/60
 - 2s - loss: 0.3404 - acc: 0.8594
Epoch 7/60
 - 2s - loss: 0.2938 - acc: 0.8712
Epoch 8/60
 - 2s - loss: 0.2530 - acc: 0.8890
Epoch 9/60
 - 2s - loss: 0.2621 - acc: 0.8801
Epoch 10/60
 - 2s - loss: 0.2293 - acc: 0.8874
Epoch 11/60
 - 2s - loss: 0.2155 - acc: 0.8914
Epoch 12/60
 - 2s - loss: 0.2018 - acc: 0.8901
Epoch 13/60
 - 2s - loss: 0.1829 - acc: 0.8974
Epoch 14/60
 - 2s - loss: 0.1644 - acc: 0.9094
Epoch 15/60
 - 2s - loss: 0.1715 - acc: 0.9094
Epoch 16/60
 - 2s - loss: 0.1593 - acc: 0.9131
Epoch 17/60
 - 2s - loss: 0.1470 - acc: 0.9295
Epoch 18/60
 - 2s - loss: 0.1423 - acc: 0.9296
Epoch 19/60
 - 2s - loss: 0.1359 - acc: 0.9298
Epoch 20/60
 - 2s - loss: 0.1354 - acc: 0.9356
Epoch 21/60
 - 2s - loss: 0.1190 - acc: 0.9439
Epoch 22/60
 - 2s - lo

<keras.callbacks.History at 0x1cb011e9780>

In [265]:
loss_and_metrics = model.evaluate(test_X, test_Y_hot)
pred_Y = model.predict(test_X)

cm=confusion_matrix(test_Y_hot.argmax(axis=1),pred_Y.argmax(axis=1))
print('the confusion matrix:\n',cm)
print('\nmodel metrics are:',model.metrics_names)
print(loss_and_metrics)

class_accu = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print('\nThe class wise acuracy is (agree,disagree,discuss,unrelated):',class_accu.diagonal())

the confusion matrix:
 [[530  25  17   6]
 [ 18 140   4   1]
 [ 35   9 535  32]
 [  5   3   8 600]]

model metrics are: ['loss', 'acc']
[0.793403940260168, 0.9171747967479674]

The class wise acuracy is (agree,disagree,discuss,unrelated): [0.91695502 0.85889571 0.87561375 0.97402597]


In [60]:
model.save('split_model_85accuracy.h5')

### Try on another dataset:

From the given dataset by fakenewschallenge, we picked and gave half of the 'disagree' labeled data to training. We will remove 50% of data of each label, to create a new test data.

In [266]:
original_test_data = pd.read_csv( 'competition_test_stances.csv' )
test_sizes=original_test_data.groupby('Stance').size()
for stance in stances:
        stance_data = original_test_data.loc[original_test_data['Stance']==str(stance)]
        length = int(test_sizes[stance]/2)
        stance_data_index=stance_data.head(length).index.tolist()
        original_test_data=original_test_data.drop(stance_data_index)
print('The Test Data #2 :')
print(original_test_data.groupby('Stance').count())
original_test_data.to_csv('my_test_stances.csv',index=False)


The Test Data #2 :
           Headline  Body ID
Stance                      
agree           952      952
disagree        349      349
discuss        2232     2232
unrelated      9175     9175


In [267]:
test_data_two=readFiles('test_bodies.csv','my_test_stances.csv')
test_corpus_two=getCorpus(test_data_two['consolidated_data'])
processed_test_two = preprocess(test_data_two['consolidated_data'],test_corpus_two,countVectorer,train=False)
encodedLabels_test_two = encodeLabels(test_data_two['consolidated_data']['Stance'])
test_two_X = processed_test_two['X']
test_two_Y = encodedLabels_test_two['y_encoded']
test_two_Y_hot = encodedLabels_test_two['y_hot_encoded']

Sample class input: [0,1,2,3]
encoded as ['agree' 'disagree' 'discuss' 'unrelated']


In [268]:
#experiment = Experiment(api_key="8FN1yPW4wibx67lyHFiRHWzVk", project_name="general", workspace="ptambvekar")

loss_and_metrics_test_two = model.evaluate(test_two_X, test_two_Y_hot)
pred_Y_test_two = model.predict(test_two_X)

cm_test_two=confusion_matrix(test_two_Y_hot.argmax(axis=1),pred_Y_test_two.argmax(axis=1))
print('the confusion matrix:\n',cm_test_two)
print('model metrics are:\n',model.metrics_names)
print('\n',loss_and_metrics_test_two)

test_two_class_accu = cm_test_two.astype('float') / cm_test_two.sum(axis=1)[:, np.newaxis]
print('\nThe class wise acuracy is (agree,disagree,discuss,unrelated):',test_two_class_accu.diagonal())

the confusion matrix:
 [[ 498  147  233   74]
 [ 118   92   92   47]
 [ 602  252 1141  237]
 [ 374  266  432 8103]]
model metrics are:
 ['loss', 'acc']

 [2.2398043678596578, 0.7738432483474976]

The class wise acuracy is (agree,disagree,discuss,unrelated): [0.52310924 0.26361032 0.51120072 0.88316076]
