# ICS 438 Project: Fake News
## by: Leilani Reich

### GitHub Repo: https://github.com/leilani-reich/ICS438-FinalProject-FakeNews

## Introduction

#### Problem Domain

The problem I'm tackling is classifying fake news. The domain of the problem includes news, natural language processing, and earth and nature.

#### Data Source and Description

Link to Dataset: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset?select=True.csv

Note: The dataset I'm using is titled "Fake and real news dataset" from user Clément Bisaillon on Kaggle.


Data Description:
There are two Comma Delimited Value (CSV) data files: Fake.csv (62.79 MB) and True.csv (53.58 MB).

Each data file contains the following 4 attributes for each record:

- title: the title of the article

- text: the text within the article 

- subject: the subject of the article

- date: the date at which the article was posted formatted as month, day year


#### Problems to tackle

I want to learn more about fake news and the characteristics that set it apart from true news.
This includes the length of fake news, the most prominent words in fake news vs true news, and
the dates at which fake vs true news are written. In the end, I want to use approximate nearest
neighbors to try and detect and classify fake news accurately.


## Install Libraries

In [None]:
!pip install pyspark
!python -m pip install -U gensim
%pip install -U sentence-transformers
!pip install faiss-cpu --no-cache
!pip install autofaiss
!python -m pip install "dask[complete]"

## Imports used

In [None]:
# Just in case, I put all imports used at the beginning of the notebook

import matplotlib.pyplot as plt
import pyspark.sql.functions as F
from gensim.parsing.preprocessing import remove_stopwords, strip_punctuation, strip_short
import re
from pyspark.sql.functions import explode, row_number, desc, col
from pyspark.sql.window import Window
import numpy as np
from pyspark.ml import Pipeline
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, NGram, BucketedRandomProjectionLSH
from sentence_transformers import SentenceTransformer
import dask.dataframe as dd
from glob import glob
import dask.array as da
import os
import numpy as np
import dask.array as da
import faiss
import glob
from collections import Counter
from sklearn.metrics import confusion_matrix
import seaborn as sn
from sklearn import metrics

In [None]:
# Create new Spark Context
from pyspark import SparkContext
sc = SparkContext()

In [None]:
# Create new Spark Session
from pyspark.sql import SparkSession
session = SparkSession(sc)

## Load in Data

In [None]:
# Some of the data like the text contains double quotes, which really cause a lot of issues!
# So I need escape='"'
fake_df = session.read.csv("Fake.csv", inferSchema = True, header=True, multiLine=True, escape='"')

print(type(fake_df))

print(fake_df.show())

In [None]:
fake_df.printSchema()

In [None]:
true_df = session.read.csv("True.csv", inferSchema = True, header=True, multiLine=True, escape='"')

print(type(true_df))

print(true_df.show())

In [None]:
true_df.printSchema()

## Preprocess Data

### Remove missing info

In [None]:
# Remove missing info if any

print("True title null:", true_df.filter(F.col("title").isNull()).count())
print("Fake title null:", fake_df.filter(F.col("title").isNull()).count())

print("True title nan:", true_df.filter(F.isnan(F.col("title"))).count())
print("Fake title nan:", fake_df.filter(F.isnan(F.col("title"))).count())

print("True text null:", true_df.filter(F.col("text").isNull()).count())
print("Fake text null:", fake_df.filter(F.col("text").isNull()).count())

print("True text nan:", true_df.filter(F.isnan(F.col("text"))).count())
print("Fake text nan:", fake_df.filter(F.isnan(F.col("text"))).count())

print("True subject null:", true_df.filter(F.col("subject").isNull()).count())
print("Fake subject null:", fake_df.filter(F.col("subject").isNull()).count())

print("True subject nan:", true_df.filter(F.isnan(F.col("subject"))).count())
print("Fake subject nan:", fake_df.filter(F.isnan(F.col("subject"))).count())

print("True date null:", true_df.filter(F.col("date").isNull()).count())
print("Fake date null:", fake_df.filter(F.col("date").isNull()).count())

print("True date nan:", true_df.filter(F.isnan(F.col("date"))).count())
print("Fake date nan:", fake_df.filter(F.isnan(F.col("date"))).count())

# I didn't detect any missing data but just to be safe

fake_df = fake_df.dropna()
true_df = true_df.dropna()


### Visualize the subjects/types of fake news by frequency

In [None]:
import matplotlib.pyplot as plt

# Visualize the types of fake news by frequency

# Get the unique subject names for the articles
fake_news_types = fake_df.select("subject").distinct()
fake_news_types = list(fake_news_types.toPandas()["subject"])
print("fake news types", fake_news_types)

# Get the total counts for each type of article
fake_news_types_counts = fake_df.groupBy("subject").count().select("count")
fake_news_types_counts = list(fake_news_types_counts.toPandas()["count"])
print("fake news types counts:", fake_news_types_counts)

# Show subject names and corresponding counts in table
fake_df.groupBy("subject").count().show()

# Create dictionary with subjects as keys and counts as values
fake_news_dict = dict(zip(fake_news_types, fake_news_types_counts))

# Sort in ascending order by value
fake_news_by_frequency = sorted(fake_news_dict.items(), key=lambda x: x[1], reverse=True)

# Get sorted keys and values
fn_subjects, fn_counts = zip(*fake_news_by_frequency)

# Show subject names and corresponding counts in barchart
plt.bar(x = fn_subjects, height = fn_counts)

plt.xticks(rotation=-45)

plt.tight_layout()

plt.xlabel("Subject")
plt.ylabel("Count")

plt.title("Subjects of Fake News by Frequency")

plt.show()

### Visualize the subjects/types of true news by frequency

In [None]:
# Visualize the types of true news by frequency

# Get the unique subject names for the articles
true_news_types = true_df.select("subject").distinct()
true_news_types = list(true_news_types.toPandas()["subject"])
print("true news types", true_news_types)

# Get the total counts for each type of article
true_news_types_counts = true_df.groupBy("subject").count().select("count")
true_news_types_counts = list(true_news_types_counts.toPandas()["count"])
print("true news types counts:", true_news_types_counts)

# Show subject names and corresponding counts in table
true_df.groupBy("subject").count().show()

# Create dictionary with subjects as keys and counts as values
true_news_dict = dict(zip(true_news_types, true_news_types_counts))

# Sort in ascending order by value
true_news_by_frequency = sorted(true_news_dict.items(), key=lambda x: x[1], reverse=True)

# Get sorted keys and values
fn_subjects, fn_counts = zip(*true_news_by_frequency)

# Show subject names and corresponding counts in barchart
plt.bar(x = fn_subjects, height = fn_counts)

plt.xticks(rotation=-45)

plt.tight_layout()

plt.xlabel("Subject")
plt.ylabel("Count")

plt.title("Subjects of Fake News by Frequency")

plt.show()


### Visualizing the top 20 most prominent dates of fake news

In [None]:
# Visualizing the top 20 most prominent dates of fake news

# Get the unique dates for the articles
fake_news_dates = fake_df.select("date").distinct()
fake_news_dates = list(fake_news_dates.toPandas()["date"])
#print("fake news dates", fake_news_dates)

# Get the total counts for each type of article
fake_news_dates_counts = fake_df.groupBy("date").count().select("count")
fake_news_dates_counts = list(fake_news_dates_counts.toPandas()["count"])
#print("fake news dates counts:", fake_news_dates_counts)

# Show dates and corresponding counts in table
fake_df.groupBy("date").count().show()

# Create dictionary with subjects as keys and counts as values
fake_news_dict = dict(zip(fake_news_dates, fake_news_dates_counts))

# Sort in ascending order by value
fake_news_dates_by_frequency = sorted(fake_news_dict.items(), key=lambda x: x[1], reverse=True)

print("Top 20 most prevalent dates of fake news posts", list(fake_news_dict.items())[:10])

# Get sorted keys and values
fn_dates, fn_counts = zip(*fake_news_dates_by_frequency)

# Show subject names and corresponding counts in barchart
plt.bar(x = fn_dates[:20], height = fn_counts[:20])

plt.xticks(rotation=-90)

plt.title("Top 20 most prevalent dates of fake news posts")

plt.xlabel("Date")
plt.ylabel("Number of posts")

plt.tight_layout()

plt.show()

In [None]:
# Visualizing the top 20 most prominent dates of true news

# Get the unique dates for the articles
true_news_dates = true_df.select("date").distinct()
true_news_dates = list(true_news_dates.toPandas()["date"])
#print("true news dates", true_news_dates)

# Get the total counts for each type of article
true_news_dates_counts = true_df.groupBy("date").count().select("count")
true_news_dates_counts = list(true_news_dates_counts.toPandas()["count"])
#print("true news dates counts:", true_news_dates_counts)

# Show dates and corresponding counts in table
true_df.groupBy("date").count().show()

# Create dictionary with subjects as keys and counts as values
true_news_dict = dict(zip(true_news_dates, true_news_dates_counts))

# Sort in ascending order by value
true_news_dates_by_frequency = sorted(true_news_dict.items(), key=lambda x: x[1], reverse=True)

print("Top 20 most prevalent dates of true news posts", list(true_news_dict.items())[:10])

# Get sorted keys and values
fn_dates, fn_counts = zip(*true_news_dates_by_frequency)

# Show subject names and corresponding counts in barchart
plt.bar(x = fn_dates[:20], height = fn_counts[:20])

plt.xticks(rotation=-90)

plt.title("Top 20 most prevalent dates of true news posts")

plt.xlabel("Date")
plt.ylabel("Number of posts")

plt.tight_layout()

plt.show()

## Lengths of fake vs real news
- Assumption: fake news is longer

### Comparing lengths of titles for fake and true news

In [None]:
# Comparing lengths of title of posts for fake and true news
import pyspark.sql.functions as F

# Fake news average title length
fake_news_length = fake_df.withColumn("title_length", F.length(fake_df.title))
fake_news_title_avg = fake_news_length.agg(F.avg(F.col("title_length"))).first()[0]

# True news average title length
true_news_length = true_df.withColumn("title_length", F.length(true_df.title))
true_news_title_avg = true_news_length.agg(F.avg(F.col("title_length"))).first()[0]

print("Fake news title average length:", fake_news_title_avg)
print("True news title average length:", true_news_title_avg)

# Show subject names and corresponding counts in barchart
plt.bar(x = ["fake news title avg length", "true news title avg length"], height = [fake_news_title_avg, true_news_title_avg])

plt.xticks(rotation=0)

plt.title("Comparing average length of titles for fake and true news")

plt.xlabel("Type of news")
plt.ylabel("Average title length in characters")

plt.tight_layout()

plt.show()

### Comparing lengths of titles for fake and true news

In [None]:
# Comparing lengths of text of posts for fake and true news

# Fake news average text length
fake_news_length = fake_df.withColumn("text_length", F.length(fake_df.text))
fake_news_text_avg = fake_news_length.agg(F.avg(F.col("text_length"))).first()[0]

# True news average title length
true_news_length = true_df.withColumn("text_length", F.length(true_df.text))
true_news_text_avg = true_news_length.agg(F.avg(F.col("text_length"))).first()[0]

print("Fake news text average length:", fake_news_text_avg)
print("True news text average length:", true_news_text_avg)

# Show subject names and corresponding counts in barchart
plt.bar(x = ["fake news text avg length", "true news text avg length"], height = [fake_news_text_avg, true_news_text_avg])

plt.xticks(rotation=0)

plt.title("Comparing average length of text for fake and true news")

plt.xlabel("Type of news")
plt.ylabel("Average text length in characters")

plt.tight_layout()

plt.show()

## Preprocessing Text Data

### Cleaning function

In [None]:
# Preprocess the text data
from gensim.parsing.preprocessing import remove_stopwords, strip_punctuation, strip_short, strip_numeric
import re

# Do some common cleaning options to remove noise from text
def clean_text(text):
    
    # Code from: https://stackoverflow.com/questions/640001/how-can-i-remove-text-within-parentheses-with-a-regex
    # by user Can Berk Güder, and this regex removes words in parentheses, ex. @(twitterName)
    text_reg1 = re.sub(r'\([^)]*\)', '', text)
     
    # This code comes from https://stackoverflow.com/questions/53071255/how-to-remove-urls-without-http-in-a-text-document-using-r
    # from user Wiktor Stribiżew and is used to remove URLs which may not have http in them
    text_reg2 = re.sub("\\s*[^ /]+/[^ /]+","", text_reg1)
    
    # Remove punctuation
    text_p1 = strip_punctuation(text_reg2)
    
    # Remove short words
    text_p2 = strip_short(text_p1)
    
    # A lot of dates repeating like (Dec 2017), so I removed numbers to try and remove this redundancy
    text_p3 = strip_numeric(text_p2)
    
    # Finally remove stopwords and make text lowercase
    return remove_stopwords(text_p3.lower())


### Cleaning and arranging text data for fake news

In [None]:
# Cleaning and arranging text data for fake news

print("Before Cleaning:\n", fake_df.select("text").first())

# I combined the title and text and am considering them together and applying cleaning to each
fake_text = fake_df.rdd.map(lambda x: clean_text(x["title"]) + " " + clean_text(x["text"]))
print(type(fake_text))

print("\n After Cleaning:\n", fake_text.first())

### Cleaning and arranging text data for fake news

In [None]:
# Cleaning and arranging text data for true news

print("Before Cleaning:\n", true_df.select("text").first())

# I combined the title and text and am considering them together
true_text = true_df.rdd.map(lambda x: clean_text(x["title"])+ " " + clean_text(x["text"]))
print(type(true_text))

print("\n After Cleaning:\n", true_text.first())

### Load in news text as spark dataframes and add column for the type of news (fake or true)

In [None]:
# Load in news text as spark dataframes and add column for the type of news (fake or true)
from pyspark.sql.functions import lit
from pyspark.sql import Row

fake_text_df = fake_text.map(Row("value")).toDF()
# adding new column for class_name, which is all "fake"
fake_text_df = fake_text_df.withColumn("class_name", lit("fake"))

true_text_df = true_text.map(Row("value")).toDF()
# adding new column for class_name, which is all "true"
true_text_df = true_text_df.withColumn("class_name", lit("true"))

print(fake_text_df.columns)
print(true_text_df.columns)

### Combine the fake news and true news dataframes into one

In [None]:
# Combine the dataframes into one
import pyspark.sql.functions as F

news_text_df = fake_text_df.union(true_text_df)

news_text_df = news_text_df.coalesce(4)

# make the order of fake/true news random
news_text_df = news_text_df.select("*").orderBy(F.rand())

#print(news_text_df.rdd.getNumPartitions())

news_text_df.show()

## Getting frequent words for types of news

### Tokenizing words

In [None]:
# Start by tokenizing words

from pyspark.ml.feature import Tokenizer
# https://spark.apache.org/docs/latest/mllib-feature-extraction.html

tokenizer = Tokenizer(inputCol="value", outputCol="tokens")
news_text_tokenized = tokenizer.transform(news_text_df)
news_text_tokenized.show()

### Getting frequent words

In [None]:
from pyspark.sql.functions import explode, row_number, desc, col
from pyspark.sql.window import Window

# Getting most frequent words using code
# from stackoverflow https://stackoverflow.com/questions/72523404/get-topn-keywords-with-pyspark-countvectorizer
# by a user named walking
news_text_frequent_words = news_text_tokenized.select("class_name", explode("tokens").alias("word"))\
    .groupBy("class_name", "word").count()\
    .withColumn("rn", row_number()\
                .over(Window.partitionBy("class_name").orderBy(desc("count")))) \
    .filter(col("rn") <= 20) \

news_text_frequent_words.show()


## Top 20 most frequent words of fake news posts

In [None]:
import matplotlib.pyplot as plt

# Visualizing the top 20 most prominent words of fake news

# Get the unique dates for the articles
fake_news_details = news_text_frequent_words.select(F.col("word"), F.col("count")).filter(F.col("class_name") == "fake").sort("count", ascending=False)
fake_news_words = list(fake_news_details.toPandas()["word"])
print("fake news words", fake_news_words)

fake_news_counts = list(fake_news_details.toPandas()["count"])
print("fake news words", fake_news_counts)

plt.bar(x = fake_news_words, height = fake_news_counts)

plt.xticks(rotation=-90)

plt.title("Top 20 most prevalent words of fake news posts")

plt.xlabel("Word")
plt.ylabel("Count in fake news posts")

plt.tight_layout()

plt.show()

## Top 20 most frequent words of fake news posts

In [None]:
import matplotlib.pyplot as plt

# Visualizing the top 20 most prominent words of fake news

# Get the unique dates for the articles
true_news_details = news_text_frequent_words.select(F.col("word"), F.col("count")).filter(F.col("class_name") == "true").sort("count", ascending=False)
true_news_words = list(true_news_details.toPandas()["word"])
print("true news words", true_news_words)

true_news_counts = list(true_news_details.toPandas()["count"])
print("true news words", true_news_counts)

plt.bar(x = true_news_words, height = true_news_counts)

plt.xticks(rotation=-90)

plt.title("Top 20 most prevalent words of true news posts")

plt.xlabel("Word")
plt.ylabel("Count in true news posts")

plt.tight_layout()

plt.show()

## Split Data

In [None]:
import numpy as np

# Setting up for embedding for news text:

# Take a sample of the data
# Learned code from user prudenko at https://stackoverflow.com/questions/43637625/how-to-shuffle-the-rows-in-a-spark-dataframe
news_text_df = news_text_train_df.orderBy(F.rand()).sample(0.1)

# Splitting data into train and test
news_text_train_df, news_text_test_df = news_text_df.randomSplit([0.9, 0.1])

# Double check randomsplit gives what we expect
news_train_len = news_text_train_df.count()
news_test_len = news_text_test_df.count()
total_len = news_train_len + news_test_len

print("Percent training:", round(news_train_len / total_len, 2))
print("Percent testing:", round(news_test_len / total_len, 2))


In [None]:
# Checking if data is random order

news_text_train_df.show()

In [None]:
# Checking if data is random order

news_text_test_df.show()

## Can we classify news as being fake or true?

- what sentences are closest to query and do the class_names match?

### Using BucketedRandomProjectionLSH for approximate nearest neighbors

In [None]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, NGram, BucketedRandomProjectionLSH

# Create a pipeline
model = Pipeline(stages=[
    # Create tokens from words
    Tokenizer(inputCol="value", outputCol="tokens"),
    # Get ngrams from tokens (speeds up computation)
    NGram(n=8, inputCol="tokens", outputCol="ngrams"),
    # Get feature vectors to input to LSH
    HashingTF(inputCol="ngrams", outputCol="vectors"),
]).fit(news_text_train_df)

news_text_trans = model.transform(news_text_train_df)

In [None]:
# Create LSH model Bucket Random Projection (https://spark.apache.org/docs/2.2.3/ml-features.html#lsh-operations)
LSH_model = BucketedRandomProjectionLSH(inputCol="vectors", outputCol="lsh", bucketLength=2.0, numHashTables=3).fit(news_text_trans)

LSH_model.transform(news_text_trans).show()

In [None]:
# Double check what columns were for test set
print(news_text_test_df.columns)

In [None]:
keys = model.transform(news_text_test_df)

print(type(keys.first()[4]))
print(keys.columns)

In [None]:
result = LSH_model.approxNearestNeighbors(news_text_trans, keys.first()[4], 5)

result.groupBy("class_name").count().show()

In [None]:
# Get counts of how many neighbors were from the fake news class and the true news class
class_name_counts = result.groupBy("class_name").count()

# First index of keys.first() is class_name
print("Real class:", keys.first()[1])
# Get class with max count from neighbors
print("Predicted class:", class_name_counts.first()[0])

In [None]:
key_list = keys.take(10)

correct_preds = 0
for i in range(10):
    result = LSH_model.approxNearestNeighbors(news_text_trans, key_list[i][4], 5)
    class_name_count = result.groupBy("class_name").count()
    pred_class = class_name_counts.first()[0]
    real_class = key_list[i][1]
    print("Predicted class:", pred_class)
    print("Actual class:", real_class)
    if (pred_class == real_class):
        correct_preds += 1
    print("correct predictions:", correct_preds, "\n")


This method is very slow to get nearest neighbors, in particular, the line "pred_class = class_name_counts.first()[0]"
seems to slow things down a lot. So I will try using Faiss for quicker results.

### Using Autofaiss for approximate nearest neighbors

In [None]:
# Create SentenceTransformer model for embedding text in 384 dimensions

from sentence_transformers import SentenceTransformer

# I chose a small, speedy model
# https://www.sbert.net/docs/pretrained_models.html
ST_model = SentenceTransformer('paraphrase-MiniLM-L3-v2')

In [None]:
# Test embeddings

print("Embedding Dimension:", ST_model.encode(fake_text.first()).reshape(1, -1).shape)

In [None]:
# Save data to csv, load csv with dask

news_text_train_df.coalesce(1).write.option("header", "true").csv("news_text_train.csv")

In [None]:
import dask.dataframe as dd
from glob import glob

path = glob('news_text_train.csv/*.csv')
print(path)

news_train_dask_df = dd.read_csv(path)

In [None]:
import dask.array as da
import os

my_arr = np.empty((1,384), dtype='float64')
my_labels = []

def nump(arr, x):
    x = ST_model.encode(x)
    
    return np.append(arr, x.reshape(1,384), axis=0)
    
    
# Converts to pandas - try batching?

# For training, get vectors and corresponding class labels
for i in news_train_dask_df['value'].compute():    
    my_arr = nump(my_arr, i)
    
for i in news_train_dask_df['class_name'].compute():    
    my_labels.append(i)

    
my_arr_test = []#np.empty((1,384), dtype='float64')
my_labels_test = []


# For testing, getting vectors and corresponding class labels
news_test = news_text_test_df.toPandas()

for i in news_test["value"]:    
    my_arr_test.append(i)
    
for i in news_test["class_name"]:   
    my_labels_test.append(i)



In [None]:
# I am using autofaiss and I got guidance on setting up code from the documentation
# at https://github.com/criteo/autofaiss

import os
import numpy as np
import dask.array as da

os.mkdir("news_train_embeddings")

# Save my sentence embeddings to a file because in next block of code
# will create index using these embeddings
np.save('news_train_embeddings/embeddings.npy', my_arr)

os.mkdir("my_index_folder")

In [None]:
!autofaiss build_index --embeddings="news_train_embeddings/" --index_path="my_index_folder/knn.index" --index_infos_path="my_index_folder/index_infos.json" --metric_type="ip"

In [None]:
# Using autoFAISS, documentation here https://github.com/criteo/autofaiss

import faiss
import glob
import numpy as np

# Read the index that was just built
my_index = faiss.read_index(glob.glob("my_index_folder/*.index")[0])

# Print first key as a test
key = news_text_test_df.select("value").take(1)[0][0]

print(key)


In [None]:
# Print first label as a test

label = news_text_test_df.select("class_name").take(1)[0][0]

print(label)

In [None]:
from collections import Counter

# Testing if faiss returns the labels of the correct class for the majority of
# nearest neighbors (with k=5) for each test sample

preds = []
for key in my_arr_test:
    distances, indices = my_index.search(ST_model.encode(key).reshape(1, 384), 5)

    classes = []
    for i in indices[0]:
        classes.append(my_labels[i-1])

    class_counter = Counter(classes)
    preds.append(class_counter.most_common(1)[0][0])

### Metrics for ANN Model

In [None]:
# A confusion matrix can help give important metrics

from sklearn.metrics import confusion_matrix

conf_mat = confusion_matrix(my_labels_test, preds)
print(conf_mat)

In [None]:
import seaborn as sn

# Using https://medium.com/@dtuk81/confusion-matrix-visualization-fc31e3f30fea
# by Dennis T for formatting the confusion matrix

group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
                conf_mat.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
                     conf_mat.flatten()/np.sum(conf_mat)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
          zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)

sn.heatmap(conf_mat/np.sum(conf_mat), fmt='', annot=labels)

In [None]:
from sklearn import metrics

TN, FP, FN, TP = conf_mat.flatten()

# Formulas from https://towardsdatascience.com/accuracy-recall-precision-f-score-specificity-which-to-optimize-on-867d3f11124
# by Salma Ghoneim

accuracy = (TP + TN) / (TP + FP + FN + TN)

precision = TP / (TP + FP)

recall = TP / (TP + FN)

specificity = TN / (TN + FP)

f1_score = 2 * (recall * precision) / (recall + precision)


# Convert fake to 0, true to 1 to pass to roc_curve and auc functions
my_labels_test_num = [0 if i == "fake" else 1 for i in my_labels_test]

my_preds_num = [0 if i == "fake" else 1 for i in preds]

fpr, tpr, thresholds = metrics.roc_curve(my_labels_test_num, my_preds_num, pos_label = 1)

auc_score = metrics.auc(fpr, tpr)

print("Metrics:\n",
      f"   Accuracy: {accuracy}\n",
      f"  Precision: {precision}\n",
      f"     Recall: {recall}\n",
      f"Specificity: {specificity}\n",
      f"   F1 Score: {f1_score}\n",
      f"        AUC: {auc_score}")

## Conclusions

I've made several conclusions from my project. First off, fake news is more wordy than true news, which was
what I expected. With fake news, people can spew a lot of nonsense and run on and on, but true news is generally
more concise. In addition, I found that the most frequent words in both the fake news and true news datasets relate to
politics, which is sensible.