# Stance Detection for the Fake News Challenge

## Identifying Textual Relationships with Deep Neural Nets

### Check the problem context [here](https://drive.google.com/open?id=1KfWaZyQdGBw8AUTacJ2yY86Yxgw2Xwq0).

### Download files required for the project from [here](https://drive.google.com/open?id=10yf39ifEwVihw4xeJJR60oeFBY30Y5J8).

## Step1: Load the given dataset  

1. Mount the google drive

2. Import Glove embeddings

3. Import the test and train datasets

### Mount the google drive to access required project files

Run the below commands

In [0]:
from google.colab import drive

In [6]:
drive.mount('/content/drive/')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive/


#### Path for Project files on google drive

**Note:** You need to change this path according where you have kept the files in google drive. 

In [0]:
project_path = "/content/drive/My Drive/Project 12/Dataset/"

We will reuse the same dataset that we loaded in the imbalanced dataset.


In [8]:
import pandas as pd

%cd /content/drive/My Drive/Project\ 12/Dataset/
%ls
bodies = pd.read_csv("./train_bodies.csv")
stances = pd.read_csv("./train_stances.csv")

/content/drive/My Drive/Project 12/Dataset
glove.6B.zip  train_bodies.csv     train_stances.csv
[0m[01;34mglovefiles[0m/   train_bodies.gsheet


In [9]:
bodies.head()

Unnamed: 0,Body ID,articleBody
0,0,A small meteorite crashed into a wooded area i...
1,4,Last week we hinted at what was to come as Ebo...
2,5,(NEWSER) – Wonder how long a Quarter Pounder w...
3,6,"Posting photos of a gun-toting child online, I..."
4,7,At least 25 suspected Boko Haram insurgents we...



<h2> Check1:</h2>
  
<h3> You should see the below output if you run `dataset.head()` command as given below </h3>

In [10]:
dataset = pd.merge(bodies, stances, on='Body ID')
dataset.head()

Unnamed: 0,Body ID,articleBody,Headline,Stance
0,0,A small meteorite crashed into a wooded area i...,"Soldier shot, Parliament locked down after gun...",unrelated
1,0,A small meteorite crashed into a wooded area i...,Tourist dubbed ‘Spider Man’ after spider burro...,unrelated
2,0,A small meteorite crashed into a wooded area i...,Luke Somers 'killed in failed rescue attempt i...,unrelated
3,0,A small meteorite crashed into a wooded area i...,BREAKING: Soldier shot at War Memorial in Ottawa,unrelated
4,0,A small meteorite crashed into a wooded area i...,Giant 8ft 9in catfish weighing 19 stone caught...,unrelated


In [11]:
dataset.Stance.value_counts()

unrelated    36545
discuss       8909
agree         3678
disagree       840
Name: Stance, dtype: int64

In [21]:
from sklearn.utils import resample

df_unrelated = dataset[dataset.Stance == 'unrelated']
df_discuss = dataset[dataset.Stance == 'discuss']
df_agree = dataset[dataset.Stance == 'agree']
df_disagree = dataset[dataset.Stance == 'disagree']
 
df_unrelated = resample(df_unrelated, 
                                 replace=True, 
                                 n_samples=10000,
                                 random_state=5)
 
df_discuss = resample(df_discuss, 
                                 replace=True, 
                                 n_samples=10000,
                                 random_state=5)

df_agree = resample(df_agree, 
                                 replace=True, 
                                 n_samples=10000,
                                 random_state=5)

df_disagree = resample(df_disagree, 
                                 replace=True, 
                                 n_samples=10000,
                                 random_state=5)

dataset = pd.concat([df_unrelated, df_discuss, df_agree, df_disagree])
# Display new class counts
dataset.Stance.value_counts()

dataset.head(10)

Unnamed: 0,Body ID,articleBody,Headline,Stance
49054,2498,"In case you missed it, Vogue Magazine, one of ...",Pumpkin Spice Condoms Could Be The Only Thing ...,unrelated
40454,2127,"KANSAS CITY, Mo. - Kansas City health official...",ISIS Reportedly Beheads American Photojournali...,unrelated
13322,800,"Well, here’s the creepiest thing you’ll read a...",Nigeria Boko Haram blamed for raids despite tr...,unrelated
18525,1094,"On Saturday, the entire Internet watched in ho...",Say 'eh-oh!' to the Teletubbies SUN BABY - can...,unrelated
10340,633,Knightscope co-founder Stacy Stephens said rum...,Media outlets identify 'Jihadi John',unrelated
4690,269,Islamic State militants have released a video ...,Did Kim Yo-Jong Take Kim Jong Un’s Role? North...,unrelated
8564,527,A hallucinogenic fungi has been found growing ...,IRAQI AND KURDISH MEDIA REPORTS: ISIS FIGHTERS...,unrelated
37591,2009,Twitter is abuzz with rumours that Cuba's form...,U.S. accidentally delivered weapons to the Isl...,unrelated
33151,1826,One passenger at Dulles International Airport ...,Tiger Woods prices private island at $7.1 million,unrelated
47307,2412,Reporting in the Telegraph states that US dron...,That powerful Lego letter to parents from the ...,unrelated


## Step2: Data Pre-processing and setting some hyper parameters needed for model


#### Run the code given below to set the required parameters.

1. `MAX_SENTS` = Maximum no.of sentences to consider in an article.

2. `MAX_SENT_LENGTH` = Maximum no.of words to consider in a sentence.

3. `MAX_NB_WORDS` = Maximum no.of words in the total vocabualry.

4. `MAX_SENTS_HEADING` = Maximum no.of sentences to consider in a heading of an article.

In [0]:
MAX_NB_WORDS = 20000
MAX_SENTS = 20
MAX_SENTS_HEADING = 1
MAX_SENT_LENGTH = 20


### Download the `Punkt` from nltk using the commands given below. This is for sentence tokenization.

For more info on how to use it, read [this](https://stackoverflow.com/questions/35275001/use-of-punktsentencetokenizer-in-nltk).



In [15]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# Tokenizing the text and loading the pre-trained Glove word embeddings for each token  [5 marks] 

Keras provides [Tokenizer API](https://keras.io/preprocessing/text/) for preparing text. Read it before going any further.

#### Import the Tokenizer from keras preprocessing text

In [16]:
from tensorflow.keras.preprocessing.text import Tokenizer

#### Initialize the Tokenizer class with maximum vocabulary count as `MAX_NB_WORDS` initialized at the start of step2. 

In [0]:
t = Tokenizer(num_words=MAX_NB_WORDS,filters= '!"#$%&()*+,-./:;<=>?@[\]^_`{|}\n“~')

#### Now, using fit_on_texts() from Tokenizer class, lets encode the data 

Note: We need to fit articleBody and Headline also to cover all the words.

In [0]:
t.fit_on_texts(dataset['articleBody'])
t.fit_on_texts(dataset['Headline'])

#### fit_on_texts() gives the following attributes in the output as given [here](https://faroit.github.io/keras-docs/1.2.2/preprocessing/text/).

* **word_counts:** dictionary mapping words (str) to the number of times they appeared on during fit. Only set after fit_on_texts was called.

* **word_docs:** dictionary mapping words (str) to the number of documents/texts they appeared on during fit. Only set after fit_on_texts was called.

* **word_index:** dictionary mapping words (str) to their rank/index (int). Only set after fit_on_texts was called.

* **document_count:** int. Number of documents (texts/sequences) the tokenizer was trained on. Only set after fit_on_texts or fit_on_sequences was called.



### Now, tokenize the sentences using nltk sent_tokenize() and encode the senteces with the ids we got form the above `t.word_index`

Initialise 2 lists with names `texts` and `articles`.

```
texts = [] to store text of article as it is.

articles = [] split the above text into a list of sentences.
```

In [0]:
from nltk.tokenize import sent_tokenize,word_tokenize

import numpy as np
texts = np.array(dataset.articleBody)

articles = []

for wholeArticle in texts:
  articles.append(sent_tokenize(wholeArticle))

## Check 2:

first element of texts and articles should be as given below. 

In [38]:

texts[0]

"In case you missed it, Vogue Magazine, one of the most glamorous institutions in the country has been dealing with the least glamorous issue ever: a rat infestation.\n\nThe rodents have literally been living it up in Vogue’s new luxurious digs at 1 World Trade Center in New York City. Reportedly, the rats took up residence in Anna Wintour’s office and have moved into the magazine’s world famous accessories closet.\n\nGawker reported that the critters have made Vogue’s infamous editor-in-chief scared to enter her office without taking precautions first.\n\n“The infestation is so acute, one source said, that the fashion title’s editor-in-chief, Anna Wintour, recently issued a standing order: Staffers must ensure that her personal office is rat-free before she enters it…”\n\nA source told People that: “the girls that work there see the droppings everywhere. It’s nasty.”\n\nThe rats have also reportedly eaten holes into shoe boxes and left droppings on the floor of the accessories closet.

In [39]:
articles[0]

['In case you missed it, Vogue Magazine, one of the most glamorous institutions in the country has been dealing with the least glamorous issue ever: a rat infestation.',
 'The rodents have literally been living it up in Vogue’s new luxurious digs at 1 World Trade Center in New York City.',
 'Reportedly, the rats took up residence in Anna Wintour’s office and have moved into the magazine’s world famous accessories closet.',
 'Gawker reported that the critters have made Vogue’s infamous editor-in-chief scared to enter her office without taking precautions first.',
 '“The infestation is so acute, one source said, that the fashion title’s editor-in-chief, Anna Wintour, recently issued a standing order: Staffers must ensure that her personal office is rat-free before she enters it…”\n\nA source told People that: “the girls that work there see the droppings everywhere.',
 'It’s nasty.”\n\nThe rats have also reportedly eaten holes into shoe boxes and left droppings on the floor of the accesso

# Now iterate through each article and each sentence to encode the words into ids using t.word_index  [5 marks] 

Here, to get words from sentence you can use `text_to_word_sequence` from keras preprocessing text.

1. Import text_to_word_sequence

2. Initialize a variable of shape (no.of articles, MAX_SENTS, MAX_SENT_LENGTH) with name `data` with zeros first (you can use numpy [np.zeros](https://docs.scipy.org/doc/numpy/reference/generated/numpy.zeros.html) to initialize with all zeros)and then update it while iterating through the words and sentences in each article.

In [0]:
from tensorflow.keras.preprocessing.text import text_to_word_sequence

data = np.zeros((len(articles),MAX_SENTS,MAX_SENT_LENGTH),dtype=int)

In [42]:
data.shape


(40000, 20, 20)

In [0]:
i=0
for article in articles:
  j=0
  for sentence in article:
    if j < 20:
      wordArr =  text_to_word_sequence(sentence,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n“', lower=True, split=' ')
      k=0
      for word in wordArr:
        if k < 20:
          data[i,j,k] = round(t.word_index[word])
          k+=1
      j+=1
  i+=1



### Check 3:

Accessing first element in data should give something like given below.

In [44]:
data[0, :, :]

array([[    5,   326,    50,  2016,    13,  1085,   884,    41,     4,
            1,   196,  7055,  3702,     5,     1,   305,    21,    29,
         6008,    15],
       [    1,  3240,    17,  2165,    29,   680,    13,    42,     5,
         4767,    62,  7317,  5868,    22,   339,   144,  1882,   700,
            5,    62],
       [  226,     1,  1235,   248,    42,  3962,     5,  1180,  7145,
          390,     6,    17,  1108,    80,     1,  9443,   144,  1377,
         5479,  5321],
       [ 3407,    95,     7,     1, 13222,    17,   112,  4767,  3352,
         1147,     5,   489,  2509,     3,  2269,    67,   390,   556,
          572,  6554],
       [    1,  2377,     9,    69,  4509,    41,   251,    14,     7,
            1,  2327, 11783,  1147,     5,   489,  1180,  1298,   506,
          934,     2],
       [  156,  5484,    28,     1,  1235,    17,    53,   226,  3353,
         5201,    80,  8512,  1943,     6,   208,  3498,    10,     1,
         2350,     4],
       [  

# Repeat the same process for the `Headings` as well. Use variables with names `texts_heading` and `articles_heading` accordingly. [5 marks] 

In [0]:
texts_headings = dataset["Headline"]

article_headings = []

for text_heading in texts_headings:
  article_headings.append(sent_tokenize(text_heading))

data_headline = np.zeros((len(article_headings),MAX_SENTS_HEADING,MAX_SENT_LENGTH),dtype=int)

i=0
for article_heading in article_headings:
  j=0
  for sentence in article_heading:
    if j < 1:
      wordArr =  text_to_word_sequence(sentence,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n“', lower=True, split=' ')
      k=0
      for word in wordArr:
        if k < 20:
          data_headline[i,j,k] = round(t.word_index[word])
          k+=1
      j+=1
  i+=1


In [46]:
data_headline[0,:,:]

array([[ 904,  812, 2104,   76,   24,    1,  125,  453,    3, 1788,    1,
         144,   20, 3846,  505,    0,    0,    0,    0,    0]])

### Now the features are ready, lets make the labels ready for the model to process.

### Convert labels into one-hot vectors

You can use [get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) in pandas to create one-hot vectors.

### Check 4:

The shape of data and labels should match the revised data set.

### Shuffle the data

In [0]:
indices = np.arange(data.shape[0])
## shuffle the numbers
np.random.shuffle(indices)

In [0]:
## shuffle the data
data = data[indices]
data_headline = data_headline[indices]
## shuffle the labels according to data

targets = pd.Series(stances)
one_hot = pd.get_dummies(targets, sparse = True)
one_hot_labels = np.asarray(one_hot)
labels = one_hot_labels

labels = labels[indices]

In [50]:
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

Shape of data tensor: (40000, 20, 20)
Shape of label tensor: (40000, 4)


### Split into train and validation sets. Split the train set 80:20 ratio to get the train and validation sets.


Use the variable names as given below:

x_train, x_val - for body of articles.

x-heading_train, x_heading_val - for heading of articles.

y_train - for training labels.

y_val - for validation labels.



In [0]:
from sklearn.model_selection import train_test_split
X_train, X_val, X_heading_train, X_heading_val, y_train,y_val = train_test_split(data,data_headline, labels, test_size = 0.20, random_state=1)


### Check 5:

The shape of x_train, x_val, y_train and y_val should match the revised dataset.

In [52]:
print(X_train.shape)
print(X_heading_train.shape)
print(y_train.shape)

print(X_val.shape)
print(X_heading_val.shape)
print(y_val.shape)

(32000, 20, 20)
(32000, 1, 20)
(32000, 4)
(8000, 20, 20)
(8000, 1, 20)
(8000, 4)


### Create embedding matrix with the glove embeddings


Run the below code to create embedding_matrix which has all the words and their glove embedding if present in glove word list.

In [53]:
# load the whole embedding into memory
embeddings_index = dict()
f = open(project_path+'glovefiles/glove.6B.100d.txt')
for line in f:
	values = line.split()
	word = values[0]
	coefs = np.asarray(values[1:], dtype='float32')
	embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

vocab_size = len(t.word_index) + 1

# create a weight matrix for words in training docs
embedding_matrix = np.zeros((vocab_size, 100))


for word, i in t.word_index.items():
	embedding_vector = embeddings_index.get(word)
	if embedding_vector is not None:
		embedding_matrix[i] = embedding_vector

Loaded 400000 word vectors.


# Try the sequential model approach and report the accuracy score. [10 marks]  

### Import layers from Keras to build the model

In [0]:
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Embedding, Dense, TimeDistributed, Activation,Bidirectional,Dropout, concatenate, LSTM
from tensorflow.keras.layers import Flatten, Permute, Input, Add
from tensorflow.keras.optimizers import SGD


### Model

In [55]:
sentence_input = Input(shape=(MAX_SENT_LENGTH,), dtype='int32')
embedded_sequences = Embedding(vocab_size, 100, input_length=(MAX_SENT_LENGTH,), weights=[embedding_matrix])(sentence_input)

bi_lstm = Bidirectional(LSTM(64, dropout=0.3, activation='tanh', recurrent_dropout=0.3, return_sequences=True))(embedded_sequences)
l_dense = Flatten()(TimeDistributed(Dense(100))(bi_lstm))
sentenceEncoder = Model(sentence_input, l_dense)

article_input = Input(shape=(MAX_SENTS,MAX_SENT_LENGTH,), dtype='int32')
article_encoder = TimeDistributed(sentenceEncoder)(article_input)
bi_lstm_article = Bidirectional(LSTM(64, dropout=0.3, activation='tanh', recurrent_dropout=0.3, return_sequences=True))(article_encoder)
article_dense_sent = Flatten()((TimeDistributed(Dense(100))(bi_lstm_article)))

heading_input = Input(shape=(MAX_SENTS_HEADING,MAX_SENT_LENGTH,), dtype='int32')
heading_encoder = TimeDistributed(sentenceEncoder)(heading_input)
bi_lstm_heading = LSTM(64, dropout=0.3, activation='tanh', recurrent_dropout=0.3, return_sequences=True)(heading_encoder)
heading_dense_sent = Flatten()((TimeDistributed(Dense(100))(bi_lstm_heading)))

article_output = concatenate([article_dense_sent, heading_dense_sent], name='concatenate_heading')

news_vector = Dense(100, activation='relu')(article_output)
preds = Dense(4, activation='softmax')(news_vector)
merged_model = Model([article_input, heading_input], [preds])

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


In [56]:
from tensorflow.keras.optimizers import SGD

# Concatenate the layers

mergedOut = Dense(4, activation='softmax')(article_output)
merged_model = Model([article_input,heading_input], mergedOut)
merged_model.summary()

Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            [(None, 20, 20)]     0                                            
__________________________________________________________________________________________________
input_3 (InputLayer)            [(None, 1, 20)]      0                                            
__________________________________________________________________________________________________
time_distributed_1 (TimeDistrib (None, 20, 2000)     2786780     input_2[0][0]                    
__________________________________________________________________________________________________
time_distributed_3 (TimeDistrib (None, 1, 2000)      2786780     input_3[0][0]                    
____________________________________________________________________________________________

### Compile and fit the model

In [61]:
from tensorflow.keras.callbacks import ModelCheckpoint
filepath = "saved-model-{epoch:02d}-{val_acc:.2f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=False, mode='max')

merged_model.compile(loss='categorical_crossentropy', optimizer='Adam', metrics=['accuracy'])

merged_model.fit([X_train,X_heading_train], y_train, epochs=5, validation_data=([X_val,X_heading_val], y_val),batch_size = 128,callbacks=[checkpoint])  

Train on 32000 samples, validate on 8000 samples
Epoch 1/5
Epoch 00001: saving model to saved-model-01-0.77.hdf5
Epoch 2/5
Epoch 00002: saving model to saved-model-02-0.82.hdf5
Epoch 3/5
Epoch 00003: saving model to saved-model-03-0.84.hdf5
Epoch 4/5
Epoch 00004: saving model to saved-model-04-0.84.hdf5
Epoch 5/5
Epoch 00005: saving model to saved-model-05-0.86.hdf5


<tensorflow.python.keras.callbacks.History at 0x7f8e5045d9b0>

In [62]:
score = merged_model.evaluate([X_val,X_heading_val], y_val, verbose=0)
print("Accuracy: %.2f%%" % (score[1]*100))

Accuracy: 85.55%


##Conclusion:
We can observe that the accuracy has improved and the the loss has reduced with balanced data. Running it for a few more epoch might have shown more improvement. Also changing some of the parameters and introducing drop out layers might improve the model. 