Your final assignment will be based on the fake news challenge -1 2017(fnc-1 2017)
The dataset consists of a headline and body and you need to build a classifier which states whether the headline and body are in agreement,unrelated,disagree with each other or whether the headline describes the body

Here are two baselines you can refer to to get started off with building your model


1.   https://arxiv.org/pdf/1707.03264.pdf
2.   https://github.com/FakeNewsChallenge/fnc-1-baseline



In [1]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer 
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Implement bow and tfidf as a baseline ***in pure python*** to convert the headline and body into vectors

In [2]:
def stem_word(vocab):
    Mystemmer = PorterStemmer() 
    vocab = [Mystemmer.stem(word) for word in vocab]
    return vocab

def lemmatize_word(vocab):

    lemmatizer = WordNetLemmatizer()
    vocab = [lemmatizer.lemmatize(word) for word in vocab]
    return vocab

def remove_stopwords(text):

    stop_words = stopwords.words('english')
    word_tokens = nltk.word_tokenize(text)

    filtered_text = [word for word in word_tokens if word not in stop_words]

    return filtered_text

def my_tokenize(text):

    text = text.lower() # Converting to lower case
    text = re.sub('[^\w]',' ',text) # Removal of Special characters

    vocab = remove_stopwords(text)  # Removal of StopWords
    vocab = stem_word(vocab)        # Stemming
    vocab = lemmatize_word(vocab)   # Lemmatize
    vocab = set(list(vocab))        # List of Unique Words

    return vocab


def bow(dataframe,column_name,vectorizer):
    bag = vectorizer.transform(dataframe[column_name]).toarray()
    return bag

def tfidf(dataframe,column_name,vectorizer):
    tf_idf = vectorizer.transform(dataframe[column_name]).toarray() 
    return tf_idf

def similarity(X,Y):
    len = X.shape[0]

    Z = np.zeros((len,1))

    for i in range(len):
         Z[i] = np.dot(X[i],Y[i])/(np.linalg.norm(X[i]) * np.linalg.norm(Y[i]))
    return Z

In [3]:
class feature_engineering():
  """
  If you want you can perform some feature engineering. Refer to the second link 
  above to find ways to do this. You can implement different feature_engineering
  functions in this class.
  An example is given below.
  """
  def __init__(self,dataset):
    pass
  def binary_co_occurence(self,headline, body):
        # Count how many times a token in the title
        # appears in the body text.
        pass
  

In [19]:
from sklearn.utils import shuffle

db = pd.read_csv('train_bodies.csv')
ds = pd.read_csv('train_stances.csv')
ds = ds.merge(db,on = 'Body ID')
ds.drop(['Body ID'], axis = 1,inplace= True)

ds = shuffle(ds)
print(ds)

stance_mapper = {'unrelated' : 0, 'discuss' : 1 , 'agree' : 2 , 'disagree' : 3}


                                                Headline  ...                                        articleBody
12     Isis claims US hostage Kayla Mueller killed in...  ...  Danny Boyle is directing the untitled film\r\n...
7454      Christian Bale to Play Steve Jobs in New Movie  ...  New Delhi: AK Verma, an executive engineer at ...
35734          For sale: Tiger's former island in Sweden  ...  First famine and war, and now this? Is nothing...
25299  Islamic State accused of using chlorine gas on...  ...  Apple is hosting its ‘Spring Forward’ event to...
18893  Lego letter from the 1970s still offers a powe...  ...  Reporting in the Telegraph states that US dron...
...                                                  ...  ...                                                ...
43518  Pentagon: ISIS seized materials airdropped to ...  ...  The gunman in a fatal shooting that rocked Ott...
17692  Durex on Rumored ‘Pumpkin Spice’ Condom: No Co...  ...  Google Inc. plans to buy about ha

In [None]:
# Combining Text to generate common Vocabulary
text = []
ds = ds.head(2000)
for sentence in ds['Headline']:
    text.append(sentence)
for sentence in ds['articleBody']:
    text.append(sentence)

bag_vectorizer = CountVectorizer(tokenizer=my_tokenize, max_features= 5000)
bag_vectorizer = bag_vectorizer.fit(text)

tfidf_vectorizer = TfidfVectorizer(tokenizer=my_tokenize, max_features= 5000)
tfidf_vectorizer = tfidf_vectorizer.fit(text)



In [21]:
bag_headline = bow(ds,'Headline',bag_vectorizer)
bag_body = bow(ds,'articleBody',bag_vectorizer)

tfidf_headline = tfidf(ds,'Headline',tfidf_vectorizer)
tfidf_body = tfidf(ds,'articleBody',tfidf_vectorizer)

# !Warning : This cell takes long time for execution

print(bag_headline.shape)
print(bag_body.shape)

print(tfidf_headline.shape)
print(tfidf_body.shape)

(2000, 5000)
(2000, 5000)
(2000, 5000)
(2000, 5000)


In [27]:
# Computation of Cosine Similarity for tfidf vectors
tfidf_similar = similarity(tfidf_body,tfidf_headline)
tfidf_similar.shape



(2000, 1)

In [28]:
# TRAIN AND TEST DATA SPLIT

from sklearn.model_selection import train_test_split

x = np.concatenate((bag_headline,tfidf_similar,bag_body),axis = 1)
n_ex = x.shape[0] # no. of examples
y = np.zeros((n_ex,4))

# Mapping every output to one-encoded vector
for i,stance in enumerate(ds['Stance']):
    y[i][stance_mapper[stance]] = 1;


print(y.shape)
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size = 0.3, random_state = 42)


(2000, 4)


### Model building.

1.   You can use Ml models like gradient boosting from whatever feature engineering you do or use the bow/tfidf vectors and build a standard FNN. Again refer to the links for an idea




In [36]:

import keras
from keras.models import Sequential
from keras.layers import Dense,Dropout
from keras import regularizers

n_cols = xTrain.shape[1]

model = Sequential()
model.add(Dense(100, activation='relu', input_shape=(n_cols,),kernel_regularizer = regularizers.l1(1e-4)))
model.add(Dropout(0.5) )
model.add(Dense(4, activation='relu',kernel_regularizer = regularizers.l1(1e-4)))
model.add(Dropout(0.5))
model.add(Dense(4, activation='softmax'))

my_optimizer = keras.optimizers.Adam(lr = 0.001,decay = 1e-4)
model.compile(optimizer = my_optimizer, loss='categorical_crossentropy',metrics=['accuracy'])
model.summary()

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_19 (Dense)             (None, 100)               1000200   
_________________________________________________________________
dropout_13 (Dropout)         (None, 100)               0         
_________________________________________________________________
dense_20 (Dense)             (None, 4)                 404       
_________________________________________________________________
dropout_14 (Dropout)         (None, 4)                 0         
_________________________________________________________________
dense_21 (Dense)             (None, 4)                 20        
Total params: 1,000,624
Trainable params: 1,000,624
Non-trainable params: 0
_________________________________________________________________


In [37]:
num_epochs = 100
history = model.fit(xTrain,yTrain,batch_size = 200, epochs = num_epochs, verbose = 2, validation_split=0.2)

Train on 1120 samples, validate on 280 samples
Epoch 1/100
 - 0s - loss: 2.4084 - accuracy: 0.5964 - val_loss: 1.9924 - val_accuracy: 0.7536
Epoch 2/100
 - 0s - loss: 1.9804 - accuracy: 0.7232 - val_loss: 1.6827 - val_accuracy: 0.7536
Epoch 3/100
 - 0s - loss: 1.7235 - accuracy: 0.7330 - val_loss: 1.5038 - val_accuracy: 0.7536
Epoch 4/100
 - 0s - loss: 1.5730 - accuracy: 0.7312 - val_loss: 1.4277 - val_accuracy: 0.7536
Epoch 5/100
 - 0s - loss: 1.5104 - accuracy: 0.7339 - val_loss: 1.3722 - val_accuracy: 0.7536
Epoch 6/100
 - 0s - loss: 1.4548 - accuracy: 0.7330 - val_loss: 1.2918 - val_accuracy: 0.7536
Epoch 7/100
 - 0s - loss: 1.3537 - accuracy: 0.7321 - val_loss: 1.2483 - val_accuracy: 0.7536
Epoch 8/100
 - 0s - loss: 1.3255 - accuracy: 0.7339 - val_loss: 1.2112 - val_accuracy: 0.7536
Epoch 9/100
 - 0s - loss: 1.2804 - accuracy: 0.7321 - val_loss: 1.1809 - val_accuracy: 0.7536
Epoch 10/100
 - 0s - loss: 1.2305 - accuracy: 0.7339 - val_loss: 1.1374 - val_accuracy: 0.7536
Epoch 11/100

In [38]:
score,acc = model.evaluate(xTest,yTest,batch_size = 64,verbose = 2)
print(" TEST ACCURACY = " , acc)

 TEST ACCURACY =  0.7633333206176758
