# I. Introduction

## 1. Domain-specific area (rewrite)
In this work, we present a text classifier for detecting fake news. Fake news, also known as misinformation, refers to false or misleading information presented as if it were real news. It has become a major issue in recent years, with the proliferation of social media platforms and the ease with which false information can be disseminated. The negative impact of fake news cannot be understated, as it can lead to harm to individuals and society as a whole.

To address this problem, we have developed a machine learning-based text classifier that can accurately identify fake news articles. The classifier is trained on a large dataset of real and fake news articles, and uses various features of the text, such as word counts and sentiment, to make predictions.

We evaluate the performance of the classifier using several metrics, and show that it is able to achieve high accuracy in detecting fake news. We also discuss the potential applications of the classifier, including its use by news organizations to fact-check articles, and by social media platforms to combat the spread of fake news.

## 2. Objectives (add references)
This project aims to find a suitable way to perform text classification in news articles in order to classify if the article is or is not fake news. In order to adapt ourselves to the social media era and avoid the spread of misinformation we need to improve the ways we validate what is truth and what is not. Historically, we've seen that fake news can contribute to problems such as:
1. Damaging the reputation of people through spreading misinformaiton. [reference]
2. Advertise false propaganda in order to misguide elections and/or election results. [reference]
3. Generate confirmation bias manipulating one's perception of reality. [reference]
4. Estimulating conflicts in a situation where polarity is arising in society. [reference]

We've also seen the widespread of fake news during COVID which, according to studies [reference], have been one of the causes of vaccine hesitancy, which has lead to unnecessary deaths all over the world.

This work consists in an automated way to fact-check news in order to tackle the problems above and many more.

## 3. Dataset

### 3.1. Description
In this work we will explore a dataset consisting of two CSV files containing classified fake and real news and we will use it to train our Machine Learning Model in order to be able to evaluate and classify other news. The language is english and the dataset consists of the following features:
1. title: The title of the news article.
2. text: The article itself.
3. subject: Examples of a subject could be: politics, middle-east and news.
4. date: The date that the article was published.
### 3.2. Dataset size
The first CSV file called 'True.csv' holding the articles categorized as not fake news consists of 21417 articles. The second one called 'Fake.csv' consists of 23481 articles.
### 3.3. Data types
All the data types are strings, except for the last column in the dataset which is a Date.
### 3.4. Source
Source: 'Fake and real news - Classifying the news' taken from kaggle. [link]

## 4. Evaluation methodology

For the evaluation of the model the technique being used here is accuracy, since it is a simple and quick way to give a perspective of the performance in one single number and also very easy to use with the classification algorithm we are using (logistic regression). We are using numpy to calculate that based on the results of the prediction.

# II. Implementation

## 5. Preprocessing

### 5.1. Text representation
As for the text representation and lexical analysis we are using a Word2Vec model with the gensim library. The reason we decided to use this is due to the fact that it keeps information about the ordering of the words in the vector, which is going to be useful to later analyze the bigrams (words that keep appearing together) which can be informative in order to understand properties of fake news articles.
### 5.2 Pre-processing the data
As for the preprocessing and text normalization step, we are using the following techniques:
1. Tokenizing
Using nltk to separate each sentence into tokens.
2. Removing stopwords
We are also using nltk's stopwords list for the english vocabulary in order to remove words that have no meaning (such as 'is' and 'are').

### 5.3 File type format
As per the file type format, the raw data is in two CSV files, which will then be added labels and merged in order to extract the features for the classifier.

## Loading and inspect the dataset: https://www.kaggle.com/code/arund8888/titanic-classification-models-score-73
TO DO:
1. Create dataframes from CSVs (DONE)
2. Add column label to both dataframes (DONE)
3. Merge both the dataframes into one (DONE)
4. Describe columns/find and fix missing data (is it worth it?)
5. Plot two graphics: number of articles per subject and number of fake news per month/year

In [1]:
# Using pandas to load the dataset

import pandas as pd
import json

fake_df = pd.read_csv('Fake.csv')
true_df = pd.read_csv('True.csv')

In [17]:
# Adding a label to the true and fake dataframes so we can use the classifier next

fake_df['label'] = 'False'
true_df['label'] = 'True'

# Merging the two dataframes into one (we are going to need this in order to train the model)
data = pd.concat([fake_df, true_df])
data

# Shuffling the information
data = data.sample(frac = 1)

# Since there are too much rows (44898) in this dataframe and it is too costly to do operations such as iterate through it, I am going to use a subset of it
data_copy = data.head(20000) # This is what we are going to be using from now on
data_copy

Unnamed: 0,title,text,subject,date,label
20650,MALIA OBAMA TO ATTEND UNIVERSITY With 5.9% Acc...,While both of her parents travel around the co...,left-news,"May 1, 2016",False
1200,Trump Just Offered To Testify Under Oath And ...,After James Comey called him a liar five times...,News,"June 9, 2017",False
10743,"WOW! SEAN SPICER Destroys BBC, New York Times ...",Here is the Gateway Pundit s real news accou...,politics,"May 30, 2017",False
16427,Catalan leader's address canceled: regional go...,MADRID (Reuters) - A statement by the leader o...,worldnews,"October 26, 2017",True
2704,Arkansas Republicans Pass Bill Giving Rapists...,A vicious new law in Arkansas proves that Repu...,News,"February 3, 2017",False
...,...,...,...,...,...
192,Factbox: Trump on Twitter (December 12) - Demo...,The following statements were posted to the ve...,politicsNews,"December 13, 2017",True
3990,Trump could target 'carried interest' tax loop...,WASHINGTON (Reuters) - The Trump administratio...,politicsNews,"April 30, 2017",True
14271,WATCH VETERAN Embarrass Trump Hater In Kansas ...,This veteran exposes Trump hating protester in...,politics,"Mar 17, 2016",False
12496,HYSTERICAL! TRUMP LIFE: “There ain’t a brother...,#SourcesHaveConfirmed that Trump will lock her...,politics,"Nov 5, 2016",False


### Exploratory data analysis

In [3]:
# Printing the unique subjects in each dataset
print("Subjects: ", data_copy.subject.unique())

Subjects:  ['News' 'left-news' 'politicsNews' 'politics' 'worldnews'
 'Government News' 'US_News' 'Middle-east']


In [4]:
# Printing the different columns
print("Columns:", data_copy.columns)

Columns: Index(['title', 'text', 'subject', 'date', 'label'], dtype='object')


In [4]:
# Plotting the number of fake news per subject
import matplotlib.pyplot as plt
import seaborn as sb

# data['subject']
# plt.figure(figsize=(10,8))
# sb.countplot(data["subject"])
# plt.title("News")
# plt.xlabel("Politics")
# plt.xlabel("Government News")
# plt.xlabel("left-news")
# plt.xlabel("US_News")
# plt.xlabel("Middle-east")
# plt.ylabel("politicsNews")
# plt.ylabel("worldnews")

## Preprocessing the dataset

In [10]:
# DELETE
# Generating a list of sentences so that we can preprocess and analyze later
fake_articles = fake_df['title']
true_articles = true_df['title']

In [5]:
import nltk
from gensim.models import Word2Vec
from nltk.corpus import stopwords

# Preprocess the sentences
def preprocess(sentences):
    # Tokenize the sentences
    sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
    # Lowercase the words
    sentences = [[word.lower() for word in sentence] for sentence in sentences]
    # Removing stopwords
    stop_words = stopwords.words('english')
    sentences = [word for word in sentences if not word in stop_words]
    
    return sentences

In [None]:
# Here we are creating a new column in the dataframe called 'clean_text' and adding the results of the preprocess method

sentences = data_copy['text']
sentences = preprocess(sentences)

data_copy['clean_text'] = sentences;
data_copy.head(10)

## Lexical Analysis

## Text representation: Word2Vec (Re-do: https://www.kaggle.com/code/hamishdickson/training-and-plotting-word2vec-with-bigrams)
TO DO:
1. Generate the model according to the link
2. Find a good way to plot the model

In [None]:
from gensim.models import Word2Vec

# Create a Word2Vec model and train it on the preprocessed sentences
fake_lexical_model = Word2Vec(preprocessed_fake_sentences, window=5, min_count=1, workers=4)
fake_lexical_model.train(preprocessed_fake_sentences, total_examples=len(sentences), epochs=10)

true_lexical_model = Word2Vec(preprocessed_true_sentences, window=5, min_count=1, workers=4)
true_lexical_model.train(preprocessed_true_sentences, total_examples=len(sentences), epochs=10)

In [None]:
# Inspecting the fake lexicalmodel

# Length of the model:
print("Model length: ", len(fake_lexical_model.wv.key_to_index))

# how many dimensions?
print("Model dimensions:", len(fake_lexical_model.wv['embarrassing']))

# Finding similar terms
fake_lexical_model.wv.most_similar('corruption', topn=20)

In [None]:
# Inspecting the true lexical model

# Length of the model:
print("Model length: ", len(true_lexical_model.wv.key_to_index))

# how many dimensions?
print("Model dimensions:", len(true_lexical_model.wv['rating']))

# Finding similar terms
true_lexical_model.wv.most_similar('rating', topn=20)

In [None]:
from gensim.test.utils import datapath
from gensim.models.word2vec import Text8Corpus
from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS
# Create training corpus. Must be a sequence of sentences (e.g. an iterable or a generator).
sentences = preprocessed_fake_sentences[:10]

# Train a toy phrase model on our training corpus.
phrase_model = Phrases(sentences, min_count=1, threshold=1, connector_words=ENGLISH_CONNECTOR_WORDS)

# Apply the trained phrases model to a new, unseen sentence.
new_sentence = preprocessed_fake_sentences[11:20]
phrase_model[new_sentence]

# The toy model considered "trees graph" a single phrase => joined the two
# tokens into a single "phrase" token, using our selected `_` delimiter.
# Apply the trained model to each sentence of a corpus, using the same [] syntax:
for sent in phrase_model[sentences]:
    pass

# Update the model with two new sentences on the fly.
# phrase_model.add_vocab([["hello", "world"], ["meow"]])

# Export the trained model = use less RAM, faster processing. Model updates no longer possible.
frozen_model = phrase_model.freeze()

# Apply the frozen model; same results as before:
frozen_model[new_sentence]


# # Save / load models.
frozen_model.save("/tmp/my_phrase_model.pkl")
model_reloaded = Phrases.load("/tmp/my_phrase_model.pkl")
model_reloaded[preprocessed_fake_sentences[21:22]]  # apply the reloaded model to a sentence

## 7. Classification approach

### 7.1 Features and Labels
For the classifier we are using two features: the 'title' which in the dataframe is the representation of the article and the 'label' which is the feature that tells which articles are true or fake.

### 7.2 Classifier
For the classifier we are using the logistic regression algorithm.

## Creating the model for the classifier
TO DO:
1. Make it work
2. Add column 'prediction' to dataframe and add prediction information for each sentence
3. Plot the number of fake news per subject

In [None]:
# Fake news classifier
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

# Pre-process the data
texts = data['text'].values
labels = data['label'].values

# Tokenize the texts
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.33, random_state=42)

# Train the model
model = MultinomialNB()
model.fit(X_train, y_train)

# Test the model
y_pred = model.predict(X_test)

for x in range(10):
    print(texts[x] + ' | ' + y_pred[x])

## 6. Baseline performance
Describe and justify the baseline against which you are going to compare the performance
of your chosen approach. This can be an already published baseline (e.g. cited in the
literature) or the results of a basic algorithm that you implement yourself. The baseline
should represent a meaningful benchmark for comparison.

## 8. Coding style
Your code is expected to meet certain standards as described by accepted coding
conventions. This includes code indentation, avoiding unnamed numerical constants and
undue use of string literals, assigning meaningful names to variables and subroutines, etc.
The code is expected to be fully commented, including variables, sub-routines and calls to
library methods.

# III. Conclusions

## 9. Evaluation

In [None]:
import numpy as np

# Evaluate the model
accuracy = np.mean(y_pred == y_test)
print(f'Accuracy: {accuracy:.2f}')

In [None]:
# Using the model above on new data

# Pre-process the new data
X_new = vectorizer.transform(true_sentences[10])

# Use the model to make predictions on the new data
y_pred = model.predict(X_new)

# Print the predictions
y_pred

## 10. Summary and conclusions
Provide a reflective evaluation of the project in light of your results. Describe its
contributions to the problem area, and discuss the extent to which your solution is
transferable to other domain-specific areas. Discuss the extent to which your approach can
be replicated by others, e.g. using different programming languages, development
environments, libraries and algorithms. Review the potential benefits and drawbacks of
alternative approaches.