# Detecting Fake Tweets

This is a very simple ensemble model to detect Fake Tweets for the Kaggle competition:  https://www.kaggle.com/c/nlp-getting-started . Accuracy is about 70-72% depending on a few hyper parameters. We start with a Random Forrest classifier and Logistic Regression, combine them via the Voting ensemble method, then finally use Bagging ensemble method to reduce variance. This is mostly a boiler plate, to improve upon this model, you can use NN models or even word2vec vectors to better capture word meanings instead of tfidf/bag of words approach. This competition was interesting because fake news is a very relevant topic now a days, and Twitter data often poses a lot of challenge for NLP because tweets are short and context is hard to build/learn.

# Importing Packages

In [129]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from collections import defaultdict
from sklearn.feature_extraction.text import TfidfVectorizer
import warnings
from nltk.corpus import stopwords
warnings.filterwarnings("ignore")
stopwords = set(stopwords.words("english"))

In [117]:
import re
from string import punctuation

regex = re.compile('[%s]' % re.escape(punctuation))
def clean_string(string):
    global regex
    string = string.lower()
    

    string = regex.sub("", string)

    return string

In [118]:
df = pd.read_csv("train.csv", encoding='utf-8')

df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [158]:
def get_sentence_vectors(df):
    """
    A method to get convert sentences into bag of words vectors
    :param df: Dataframe
        sentences, sense, and labels
    :return: Array

    """
    df["text"] = df["text"].apply(clean_string)
    all_data = df[["text", "target"]].values
    sentences, Y = all_data[:, 0], np.array(all_data[:, 1], dtype='int64')
    vectorizor = TfidfVectorizer(min_df=0.005, ngram_range=(1,2)) #modify min_df
    X_fitted = vectorizor.fit_transform(sentences)
    X = X_fitted.toarray()

    return X, Y



# Get sentence vectors

In [159]:
X_train,Y_train = get_sentence_vectors(df)


# Build Model

In [160]:
model = VotingClassifier([("l1",LogisticRegression()),("l2",RandomForestClassifier())],voting="soft" )
model = BaggingClassifier(model)

## Evaluate model with 5 Fold Cross validation


In [161]:
scores =cross_val_score(model, X_train,Y_train,cv=5, )
scores.mean()

0.7082548954199692