<div class="alert">  
    <center><h1><strong>Final Project: NLP Fake news detection</strong></h1></center>

<div class="alert alert-block alert-info">  
    <center><h2><strong>Analysis / exploration of the data set</strong></h2></center>

<div class="alert alert-block alert-warning">  
<b>Summary of the section </b> 
<hr>
<ul>
    <li>Analyze the chosen dataset and its properties</li>
</ul>
    
<hr>
<b>Source of Data: </b> 
<hr> 
 <a href="https://www.cs.ucsb.edu/~william/data/liar_dataset">https://www.cs.ucsb.edu/~william/data/liar_dataset</a>
   
</div>

In [None]:
#Importing the required libraries
import numpy as np
import pandas as pd

## 1. Loading the dataset

In [None]:
# Here we are loading th dataset into a pandas dataframe we can do further analysis on it
df = pd.read_csv("datasets/Liar_Dataset.csv")

# We are also going to look at the first 5 records in it so to have an idead on what we are working with
df.head(5)

## 2. Exploring the dataset

In [None]:
# Let us first take a look at what we are dealing with
df.info()

<div class="alert alert-block alert-warning">  
<b>In the above we can point a few things about the dataset</b> 
<hr>
<ul>
    <li>1. It has 14 columns and 12787 rows</li>
    <li>2. It has two datatypes: objects (strings) and int64 (numbers) </li>
    <li>3. The first column - which seems to be some kind of ID - is actually a string. And we can see by looking at the data inside the CSV looks like the filename.</li>
    <li>4. We can also notice here that we have a few nulls, as for some columns (such as) 'speaker's job title ', 'state info' and 'venue' the Non-Null Counts are lower than the number of rows. </li>
</ul>
</div>

In [None]:
# Now take a look at the columns inside the dataset so to have an idea about the features we can use further in our model
df.columns

<div class="alert alert-block alert-warning">  
<b>Further description by matching the columns with the rows</b> 
<hr>
<ul>
    <li>1. '[ID].json' - As already stated, this seems to be a filename. This may not be an useful feature for us to use so we may delete it in the future.</li>
    <li>2. 'label' -  This column is the classification of the news statement, it can be either TRUE, FALSE, half-true, half-false or pants-fire.</li>
    <li>3. 'statement' - This represents the text in the news article. </li>
    <li>3. 'subject(s)' -   The subjects that the news statement talks about. We will further analyze what are them below. </li>
    <li>3. 'speaker' -  The person who talked/wrote about the news statement.  We will further analyze what are them below. </li>
    <li>4. 'speakers job title' -  The speaker's job.  We will further analyze what are them below. </li>
    <li>5. 'state info' -  ???</li>
    <li>7. 'party affiliation' - The political party the speaker is related to. </li>
    <li>8. 'barely true counts' - The counts of statements that are classified as being barely true. Meaning ??? </li>
    <li>9. 'false counts' - The counts of statements that are classified as being barely false </li>
    <li>10. 'half true counts' -  The counts of statements that are classified as being half true. Meaning ??? </li>
    <li>11. 'mostly true counts' -  The counts of statements that are classified as being mostly true. Meaning ??? </li>
    <li>12. 'pants on fire counts' -  The counts of statements that are classified as being pants on fire. Meaning ??? </li>
    <li>13. 'venue' - The "place" where the statement was said. It could be a radio station, a blog or any other media venue. </li>
</ul>
</div>

## 3. Cleaning the data

In [None]:
# We already noticed above that we have some null values. Now let us confirm how many and in which columns
df.isnull().sum()

In [None]:
np.sum(df.isnull().any(axis=1))

<div class="alert alert-block alert-warning">  
<b>In the above we can see that:</b> 
<hr>
<ul>
    <li>1. We have 3565 nulls in the speaker's job title column, 2747 nulls in the state info column, and 129 nulls in the venue column.</li>
    <li>2. In total we have 4351 nulls</li>
    <li>3. We can also see by analysing the rows that we have two kids of nulls when the value is missing: NaN and tje np.nan return type.</li>
</ul>
<h3>Below we will apply some strategies to clean these nulls in order to not mess with our analysis.</h3>
</div>

In [None]:
# Here we are re replacing the rows with NaN (null) values with a blank space
df.replace('', np.nan, inplace=True)

# Now we are going to replace the np.nan values for all the three columns we are cleaning
df["speaker's job title"]= df["speaker's job title"].replace(np.nan, 'Unknown')
df['venue']= df['venue'].replace(np.nan, 'Unknown')
df["state info"]= df["state info"].replace(np.nan, 'Unknown')

In [None]:
# Now let us check again if we were able to completely remove the nulls
df.isnull().sum()

In [None]:
# We can also remove th [ID].json column, since it is not useful to us
df.drop(columns=['[ID].json'], axis=1, inplace=True)

## 4. Data visualization on the rows

<div class="alert alert-block alert-warning">  
<b>Below we are going to try to understand more about the rows we have so to start reasoning about which information we have in them and which of them we can use later to extract the features for our Machine Learning models.</b> 
<hr>
</div>

In [None]:
# Defining some helper functions and variables for generating the visualizations using matplotlib, wordcloud and seaborn
from matplotlib import pyplot as plt 
from wordcloud import WordCloud
import seaborn as sns

top10 = 10


# Plots a pie graph based on a specific column
def plot_pie(column, number_of_values):
    df[column].value_counts().head(number_of_values).plot(kind = 'pie', autopct='%1.1f%%', figsize=(8, 8)).legend(bbox_to_anchor=(1, 1))

def plot_bar(column, number_of_values):
    df[column].value_counts().value_counts().head(number_of_values).plot(x='lab', y='val', rot=0)

# TODO: Fix because we need a way to show the data correlation between different columns.
# Plots a wordcloud based on the relationship between two columns
## column: 
## value:
## word:
# def plot_wordcloud(column, value, word):
#     data1=df[df[column]==value]
#     d =data1[word]
#     string_ = []
#     for t in d:
#         string_.append(t)
#     string_ = pd.Series(string_).map(str)
#     string_=str(string_)
#     wc = WordCloud(width=1500, height=700,max_font_size=250, background_color ='white').generate(string_)
#     plt.figure(figsize=(12,10))
#     plt.imshow(wc)
#     plt.axis("off")
#     plt.show()

In [None]:
# Plotting a pie graph based on the label column
plot_pie('label', top10)

In [None]:
# Plotting a pie graph based on the subjects column
plot_pie('subject(s)', top10)

In [None]:
# Plotting a pie graph based on the subjects speakers
plot_pie('speaker', top10)

In [None]:
# Plotting a pie graph based on the speaker's job title column
plot_pie("speaker's job title", top10)

In [None]:
# Plotting a pie graph based on the venue
plot_pie("venue", top10)

In [None]:
# Plotting a pie graph based on the speaker's job title column
plot_pie("state info", top10)

In [None]:
# Plotting a pie graph based on the speaker's job title column
plot_pie("party affiliation", top10)

<div class="alert alert-block alert-warning">  
<b>Based on the above pie charts we can already draw some interesting conclusions:</b> 
<hr>
<ul>
<li>1. For the <b>'label'</b> column, which defines the category to which the news articles belong to, we can see that our dataset contains mostly half-true statements (19.6% of the overall). On the opposite, lower end we have 8.2% pants-fire statements. </li>
<li>2. As for the subjects in the dataset, we see 10 of them, being (in order of distribution of statements, from higher to lower):
 a. Health-care
 b. education
 c. elections
 d. immigration
 c. candidates-biography
 d. economy </li>
<li>3. As for the <b>job titles</b> of the speakers in the dataset (the person who wrote/spoke the news article) we have them being:
 a. Unknown - the articles we had no data for speaker
 b. President
 c. U.S. Senator
 d. Governor
 e. President-Elect (This is a duplicate, we will deal with it in the future)
 f. U.S. Senator (This is a duplicate, we will deal with it in the future)
 e. Presidential Candidate </li>
<li>
4. As per the <b>venue</b> column, meaning the source of the news article, we have:
 a. 18% of them come from a news release.
 b. 16.7% of them come from an interview.
 c. 16.5 come from a press release.
 d. 15.1% of them come from a speech.
 e. 13.0% come from a TV ad.
 f. 11.1% of them come from a tweet.
 e. 9.6% of them come from a campaign ad.</li>
<li>
5. As per the <b>state info</b> where the news was released, we have:
 a. Unknown
 b. Texas
 c. Florida
 d. Wisconsin
 e. New York
 f. Illinois
 g. Ohio</li>
<li>
6. And at last, for the <b>party affiliation</b> we have:
 a. 45%.2 of the political parties affiliated to the news articles are republicans.
 b. We have also a big number of articles affiliated with democrats and 17.$ are unknowns.
 c. We have also a small number of articles being related to no party affiliations, libertarians, a few of them being independent, affiliated to organizations or newsmakers.
</li>
</ul>
</div>

<div class="alert alert-block alert-info">  
    <center><h2><strong>Identification of suitable features and implementation of a suitable feature extractor, e.f.
TfidfVectorizer</strong></h2></center>

## 5. Preparing the features

<div class="alert alert-block alert-warning">  
<b>Below we are going to prepare the features we are going to use with a few techniques in order to prepare the data so that we can further fit our model into it.</b>

<ul>
<li>1. We will remove the stopwords for the news statements, so to not overfit our model due to these words not being useful for us.</li>
<li>2. We will remove special characters.</li>
<li>3. We will gather and remove words that are repeated.</li>
<li>4. We will transform the string texts in tokens (tokenization) so that we can later vectorize them so to fit our model.</li>
<li>5. We will apply stemming to the words in the words so to remove common suffixes from the end of word tokens.</li>
</li>6. At last we will apply lemmatization to ensure that the output word is an existing normalized word.</li>
</ul>
<hr>
</div>

### 5.1 Preparing the statement column

In [None]:
# Let us first define some helper functions that will allow us to prepare the data
import re
import nltk
nltk.download('stopwords')
nltk.download('punkt_tab')
from nltk.corpus import stopwords
from collections import OrderedDict
from nltk.tokenize import word_tokenize
from nltk import RegexpTokenizer
from nltk.stem.porter import *
from nltk.stem import WordNetLemmatizer

''' 
Removes stopwords that are included in the english stopwords corpus.

Parameters:
- column(pd.Dataframe): The dataframe column to be processed 

Returns:
- column(pd.Dataframe): The processed dataframe column
'''
def remove_stopwords(column):
    stopwords_list = stopwords.words('english')
    return column.apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords_list)]))


''' 
Removes all the ASCII code special characters !@#$%^&*()_+{}/ from each row in the dataframe column.

Parameters:
- column(pd.Dataframe): The dataframe column to be processed 

Returns:
- column(pd.Dataframe): The processed dataframe column
'''
def remove_special_characters(column):
    # Using regex to remove every non word character
    return column.map(lambda x: re.sub(r'\W+', ' ', x))


''' 
Removes duplicated words in the dataframe column.

Parameters:
- column(pd.Dataframe): The dataframe column to be processed 

Returns:
- column(pd.Dataframe): The processed dataframe column
'''
def remove_repeated_words(column):
    ''' 
    Adds all the words in a string sentence as unique words in a set.
    
    Parameters:
    - text(String): A string value representing the sentence to be processed
    
    Returns:
    - text(String): The processed sentence
    '''
    def remove_duplicates(text):
        words = text.split()
        seen = set()
        unique_words = []
        for word in words:
            if word not in seen:
                unique_words.append(word)
                seen.add(word)
        return ' '.join(unique_words)

    return column.apply(remove_duplicates)


''' 
Tokenizes the words in the rows belonging to the dataframe column using word_tokenize. 

Parameters:
- column(pd.Dataframe): The dataframe column to be processed

Returns:
- column(pd.Dataframe): The processed dataframe column
'''
def tokenize(column):
    return column.apply(word_tokenize)


''' 
Applies stemming to the rows belonging to the dataframe column using the Porter Stemmer technique.

Parameters:
- column(pd.Dataframe): The dataframe column to be processed 

Returns:
- column(pd.Dataframe): The processed dataframe column
'''
def apply_stemming(column):
    stemmer = PorterStemmer()
    return column.apply(lambda x : [stemmer.stem(y) for y in x])

''' 
Applies lemmatization to the rows belonging to the dataframe column. using WordNetLemmatizer

Parameters:
- column(pd.Dataframe): The dataframe column to be processed 

Returns:
- column(pd.Dataframe): The processed dataframe column
'''
def apply_lemmatization(column):
    lemmatizer = WordNetLemmatizer()

    def lemmatize_text(text):
        if isinstance(text, str):
            words = word_tokenize(text)
            lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
            return ' '.join(lemmatized_words)
        else:
            return text

    return column.apply(lemmatize_text)
    
# Uncomment the below for debugging
# # df["statement"] = remove_stopwords(df["statement"])
# # df["statement"] = remove_special_characters(df["statement"])
# # df["statement"] = remove_repeated_words(df["statement"])
# # df["statement"] = tokenize(df["statement"])
# # df["statement"] = apply_stemming(df["statement"])
# # df["statement"] = apply_lemmatization(df["statement"])
# # df["statement"].head()

In [None]:
# Preparing the statement column

# Uncomment the below for debugging
# df = pd.read_csv("datasets/Liar_Dataset.csv")

df["statement"] = remove_stopwords(df["statement"])
df["statement"] = remove_special_characters(df["statement"])
df["statement"] = remove_repeated_words(df["statement"])
df["statement"] = tokenize(df["statement"])
df["statement"] = apply_stemming(df["statement"])
df["statement"] = apply_lemmatization(df["statement"])
df["statement"].head(5)

### 5.2. Preparing the subject(s) column

In [None]:
# Let us define some helper functions to allow us to prepare the subject(s) column
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

import pandas as pd
from fuzzywuzzy import fuzz

''' 
Uses the fuzzywuzy library to group similar words together on a similarity threshold.

Parameters:
- column(pd.Dataframe): The dataframe column to be processed 

Returns:
- column(pd.Dataframe): The processed dataframe column
'''
def group_by_similar_lists(column, threshold=80):
    unique_values = column.dropna().unique()
    groups = {}

    # Convert list of strings to a single string for comparison
    def list_to_string(lst):
        return ''.join(lst)

    # Iterate through each unique value (list of strings) and assign it to a group
    for value in unique_values:
        found_group = False
        str_value = list_to_string(value)
        
        for group in groups:
            # If the value is similar to the representative value of the group, add it to the group
            if fuzz.ratio(str_value, group) > threshold:
                groups[group].append(value)
                found_group = True
                break
        
        # If no similar group was found, create a new group
        if not found_group:
            groups[str_value] = [value]

    # Map each original value to its corresponding group
    def map_to_group(value):
        str_value = list_to_string(value)
        for group, group_values in groups.items():
            if any(fuzz.ratio(str_value, list_to_string(existing_value)) > threshold for existing_value in group_values):
                return group
        return str_value

    return column.apply(map_to_group)

In [None]:
# Uncomment the below for debugging. Do not forget to comment again so to not mess with our data samples!
# df = pd.read_csv("datasets/Liar_Dataset.csv")

df["subject(s)"] = remove_stopwords(df["subject(s)"])
df["subject(s)"] = remove_special_characters(df["subject(s)"])
df["subject(s)"] = remove_repeated_words(df["subject(s)"])

df["subject(s)"].value_counts()

In [None]:
# By checking the above, we can see that after removing the stopwords, special characters and repeated words from each row in the subject(s) column we are left with
# 4533 unique values. Let us try to group similar values so to not overfit our model with unneeded information.

df["subject(s)"] = group_by_similar_lists(df["subject(s)"])

df["subject(s)"].value_counts()

In [None]:
# now let us plot the top 10 values in our subject(s) list
plot_pie("subject(s)", 10)

### 5.3 Preparing the speaker column

In [None]:
# Let us see how many speakers we have. 3308
df['speaker'].value_counts()

In [None]:
# Now let us remove the stopwords, remove special characters and repeated words and group by.

df['speaker'] = remove_stopwords(df['speaker'])
df['speaker'] = remove_special_characters(df['speaker'])
df['speaker'] = remove_repeated_words(df['speaker'])
df['speaker'] = group_by_similar_lists(df['speaker'])

# Now we ended up with 3120 values.
df['speaker'].value_counts()

### 5.4 Preparing the speaker's job title column

In [None]:
# Let us see how many values we have here in our speaker's job title column. We have 1355 values, 
# let us try to apply our pipeline in order to reduce its dimensionality and remove unneeded words.
df["speaker's job title"].value_counts()

In [None]:
# Helper functions for the speaker's job title

def tokenize_column(column):
    column = column.apply(str)
    tokens = []
    for word in column:
        tokens.append(word_tokenize(word))

    return tokens;

def remove_stopwords_from_speakers_job_title(speakers_job_title_list):
    return [
        [word for word in word_list if word.lower() not in stopwords.words('english')]
        for word_list in speakers_job_title_list
    ]

def apply_stemming_to_speakers_job_title(df, column_name, speakers_token):
    ps = PorterStemmer() 

    index = 0    
    for words in speakers_token:
    
        job=""
        for w in words: 
            job=job+ps.stem(w)+" "
        df.at[index, column_name] = job
        index += 1
    return df[column_name]


# Uncomment below to debug.
'''
df = pd.read_csv("datasets/Liar_Dataset.csv")

speakers_token = tokenize_column(df["speaker's job title"])
speakers_token = remove_stopwords_from_speakers_job_title(speakers_token)

df["speaker's job title"] = apply_stemming_to_speakers_job_title(df, "speaker's job title", speakers_token)
df["speaker's job title"] = group_by_similar_lists(df["speaker's job title"])

df["speaker's job title"].head(10)
df["speaker's job title"].value_counts().head(12).plot(kind='bar')
'''

In [None]:
# Now let us apply the functions above to the column

speakers_token = tokenize_column(df["speaker's job title"])
speakers_token = remove_stopwords_from_speakers_job_title(speakers_token)

df["speaker's job title"] = apply_stemming_to_speakers_job_title(df, "speaker's job title", speakers_token)
df["speaker's job title"] = group_by_similar_lists(df["speaker's job title"])

plot_pie("speaker's job title", 10)

In [None]:
# Now let us count how many values we have:
df["speaker's job title"].value_counts()

### 5.4 Preparing the venue column

In [None]:
df['venue'].value_counts()

In [None]:
# Defining some helper function for the venue column

def remove_stopwords_from_venue(speakers_job_title_list):
    return [
        [word for word in word_list if word.lower() not in stopwords.words('english')]
        for word_list in speakers_job_title_list
    ]

def apply_stemming_to_venue(df, column_name, tokens):
    ps = PorterStemmer() 

    index = 0    
    for words in tokens:
    
        job=""
        for w in words: 
            job=job+ps.stem(w)+" "
        df.at[index, column_name] = job
        index += 1
    return df[column_name]
    

# Uncomment below to debug.
'''
df = pd.read_csv("datasets/Liar_Dataset.csv")

venues_tokens = tokenize_column(df['venue'])
venues_tokens = remove_stopwords_from_venue(venues_tokens)

df['venue'] = apply_stemming_to_venue(df, 'venue', venues_tokens)

df['venue'].value_counts()
'''

In [None]:
# Now let us apply tokenization remove the stopwords and apply stemming to the venue column.
venues_tokens = tokenize_column(df['venue'])
venues_tokens = remove_stopwords_from_venue(venues_tokens)

df['venue'] = apply_stemming_to_venue(df, 'venue', venues_tokens)

# Now we can see that we have 4591 values left in the column
df['venue'].value_counts()

<div class="alert alert-block alert-info">  
    <center><h2><strong>Feature extraction</strong></h2></center>

<div class="alert alert-block alert-warning">  
<b>In this section we are going to convert the raw text data from our features into vectors that we can use later to feed our machine learning models. We are going
to do this using the TF-IDF (Term Frequency-Inverse Document Frequency) technique, Adjusts the word counts of each element based on their importance (frequency and inverse-frequency) across the entire dataseta.</b>
</div>

## 6. Feature extraction

In [None]:
# First let us define some helper functions for the vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Let us instantiate our vectorizer using TF IDF vectorizer
word_vectorizer = TfidfVectorizer(
        sublinear_tf=True,
        strip_accents='unicode',
        analyzer='word',
        token_pattern=r'\w{1,}',
        ngram_range=(3, 3),
        max_features =5000)

def vectorize_column(column):
    return word_vectorizer.fit_transform(column.astype('str'))

### 6.1 Feature extraction on the statements column

In [None]:
statements_vectors = vectorize_column(df['statement'])
statements_vectors = statements_vectors.toarray()

# Get output feature names for transformation.
output_feaature_names = word_vectorizer.get_feature_names_out(df['statement'])

statement_feature_df=pd.DataFrame(np.round(statements_vectors, 1), columns=output_feaature_names)
statement_feature_df.head()

In [None]:
# Adding the vectorized statement column to the dataframe
df = pd.concat([df, statement_feature_df], axis=1)

### 6.2 Feature extraction for the Subject's column.

In [None]:
subjects_vectors = vectorize_column(df["subject(s)"])
subjects_vectors = subjects_vectors.toarray()

# Get output feature names for transformation.
output_feature_names_subject = word_vectorizer.get_feature_names_out(df["subject(s)"])

subjects_feature_df=pd.DataFrame(np.round(subjects_vectors, 1), columns=output_feature_names_subject)
subjects_feature_df.head()

### 6.3 Feature extraction on the speaker´s job title column

In [None]:
speakers_job_title_vectors = vectorize_column(df["speaker's job title"])
speakers_job_title_vectors = speakers_job_title_vectors.toarray()

# Get output feature names for transformation.
output_feature_names_speakers_job_title = word_vectorizer.get_feature_names_out(df["speaker's job title"])

speakers_job_title_feature_df=pd.DataFrame(np.round(speakers_job_title_vectors, 1), columns=output_feature_names_speakers_job_title)
speakers_job_title_feature_df.head()

### 6.3 Feature extraction on the party affiliation column

In [None]:
party_affiliation_vectors = vectorize_column(df["party affiliation"])
party_affiliation_vectors = party_affiliation_vectors.toarray()

# Get output feature names for transformation.
output_feature_names_party_affiliation = word_vectorizer.get_feature_names_out(df["party affiliation"])

party_affiliation_feature_df=pd.DataFrame(np.round(party_affiliation_vectors, 1), columns=output_feature_names_party_affiliation)
party_affiliation_feature_df.head()

### 6.4 Feature extraction on the venue column

In [None]:
venue_vectors = vectorize_column(df["venue"])
venue_vectors = venue_vectors.toarray()

# Get output feature names for transformation.
output_feature_names_venue = word_vectorizer.get_feature_names_out(df["venue"])

venue_feature_df=pd.DataFrame(np.round(venue_vectors, 1), columns=output_feature_names_venue)
venue_feature_df.head()

### 6.6 Adding the features to the dataframe so we can feed them to our model

In [None]:
speaker_vectors = pd.Categorical(df['speaker'])               
df['speaker']=speaker_vectors.codes

label_vectors = pd.Categorical(df['label'])               
df['label']=label_vectors.codes

state_info_vectors = pd.Categorical(df['state info'])               
df['state info']=state_info_vectors.codes

subjects_info_vectors = pd.Categorical(df["subject(s)"])               
df["subject(s)"]=subjects_info_vectors.codes

speakers_job_title_info_vectors = pd.Categorical(df["speaker's job title"])               
df["speaker's job title"]=speakers_job_title_info_vectors.codes

party_affiliation_info_vectors = pd.Categorical(df["party affiliation"])               
df["party affiliation"]=party_affiliation_info_vectors.codes

venue_info_vectors = pd.Categorical(df["venue"])               
df["venue"]=venue_info_vectors.codes

# I am going to drop the statement column because it has too many rows and it is making our models too slow
df.drop(columns=['statement'], axis=1, inplace=True)

# Let us check the dataframe now with all the features already vectorized
df.head(50)

<div class="alert alert-block alert-info">  
    <center><h2><strong>Building and evaluation our models</strong></h2></center>

<div class="alert alert-block alert-warning">  
<b>In this section we are going to build and train our machine learning models and use K Fold cross validation for model comparison. Here we chose to use three supervised learning classification models from sklearn, being them: Gaussian Naive Bayes, Neural Networks and Logistic Regression. We are also going to use KFold cross-validation in order to split our test and train data within 5 iterations.</b>
</div>

## 6.1 Machine Learning models

In [None]:
# Let us define some helper variables and functions in order to better evaluate our models

# Importing the necessary libraries
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import time
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression

# Global variables
kfold_splits_5 = 5



### Gaussian Naive Bayes

<div class="alert alert-block alert-warning">  
<b>In this section we are going to build and train our machine learning models and use K Fold cross validation for model comparison.</b>
</div>

In [None]:
# Initialize Naive Bayes model
nb = GaussianNB(var_smoothing=1e-08)

# Split features and labels
X = df.iloc[:, :-1].values
y = df['label'].values

# Set up K-Fold cross-validation
kf = KFold(n_splits=kfold_splits_5)
outcomes = []
conf_matrix_list = []

# Cross-validation loop
for fold, (train_index, test_index) in enumerate(kf.split(X, y), start=1):
    print(f"KFold Split: {fold}")
    print(f"Train indices: {train_index}")
    print(f"Test indices: {test_index}\n")
    
    # Split the data
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    # Train the model and measure time
    print('Running time of algorithm')
    start_time = time.time()
    nb.fit(X_train, y_train)
    elapsed_time = time.time() - start_time
    print(f"Elapsed time: {elapsed_time:.4f} seconds")
    
    # Make predictions
    predictions = nb.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, predictions)
    outcomes.append(accuracy)
    print(f"Accuracy of KFold {fold} is: {accuracy}\n")
    
    # Generate and print classification report
    print(f"Classification Report of KFold {fold} is following:\n")
    classification_rep = classification_report(y_test, predictions)
    print(classification_rep)
    
    # Generate and print confusion matrix
    print(f"Confusion Matrix of KFold {fold} is following:\n")
    confusion_mat = confusion_matrix(y_test, predictions)
    conf_matrix_list.append(confusion_mat)
    print(confusion_mat)
    print('\n' + '='*50 + '\n')

# Print the average results
mean_accuracy = np.mean(outcomes)
print(f"Total Average Accuracy of Naive Bayes is: {mean_accuracy:.4f}")
print("\nAverage Confusion Matrix:\n")
avg_conf_matrix = np.mean(conf_matrix_list, axis=0)
print(avg_conf_matrix)

<div class="alert alert-block alert-warning">  
<b>By analysing the report above we conclude the following.</b>
<ul>
<li>1.The accuracy of its predictions was 0.9984. We will use this as a baseline in order to compare with the other models.</li>
</div>

### Neural Networks

In [None]:
# Initialize MLPClassifier model
mlp = MLPClassifier(hidden_layer_sizes=(100,), max_iter=300, alpha=0.0001, solver='adam', random_state=1)

# Split features and labels
X = df.iloc[:, :-1].values
y = df['label'].values

# Set up K-Fold cross-validation
kf = KFold(n_splits=kfold_splits_5)
outcomes = []
conf_matrix_list = []

# Cross-validation loop
for fold, (train_index, test_index) in enumerate(kf.split(X, y), start=1):
    print(f"KFold Split: {fold}")
    print(f"Train indices: {train_index}")
    print(f"Test indices: {test_index}\n")
    
    # Split the data
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    # Train the model and measure time
    print('Running time of algorithm')
    start_time = time.time()
    mlp.fit(X_train, y_train)
    elapsed_time = time.time() - start_time
    print(f"Elapsed time: {elapsed_time:.4f} seconds")
    
    # Make predictions
    predictions = mlp.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, predictions)
    outcomes.append(accuracy)
    print(f"Accuracy of KFold {fold} is: {accuracy}\n")
    
    # Generate and print classification report
    print(f"Classification Report of KFold {fold} is following:\n")
    classification_rep = classification_report(y_test, predictions)
    print(classification_rep)
    
    # Generate and print confusion matrix
    print(f"Confusion Matrix of KFold {fold} is following:\n")
    confusion_mat = confusion_matrix(y_test, predictions)
    conf_matrix_list.append(confusion_mat)
    print(confusion_mat)
    print('\n' + '='*50 + '\n')

# Print the average results
mean_accuracy = np.mean(outcomes)
print(f"Total Average Accuracy of MLP Classifier is: {mean_accuracy:.4f}")
print("\nAverage Confusion Matrix:\n")
avg_conf_matrix = np.mean(conf_matrix_list, axis=0)
print(avg_conf_matrix)

<div class="alert alert-block alert-warning">  
<b>By analysing the report above we conclude the following.</b>
<ul>
<li>1.The accuracy of its predictions was 0.5789. We will use this as a baseline in order to compare with the other models.</li>
</div>

### Logistic Regression

In [None]:
import numpy as np
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
import time

# Initialize Logistic Regression model
log_reg = LogisticRegression(solver='lbfgs', max_iter=7600, random_state=1)

# Split features and labels
X = df.iloc[:, :-1].values
y = df['label'].values

# Set up K-Fold cross-validation
kf = KFold(n_splits=kfold_splits_5)
outcomes = []
conf_matrix_list = []

# Cross-validation loop
for fold, (train_index, test_index) in enumerate(kf.split(X, y), start=1):
    print(f"KFold Split: {fold}")
    print(f"Train indices: {train_index}")
    print(f"Test indices: {test_index}\n")
    
    # Split the data
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    # Train the model and measure time
    print('Running time of algorithm')
    start_time = time.time()
    log_reg.fit(X_train, y_train)
    elapsed_time = time.time() - start_time
    print(f"Elapsed time: {elapsed_time:.4f} seconds")
    
    # Make predictions
    predictions = log_reg.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, predictions)
    outcomes.append(accuracy)
    print(f"Accuracy of KFold {fold} is: {accuracy}\n")
    
    # Generate and print classification report
    print(f"Classification Report of KFold {fold} is following:\n")
    classification_rep = classification_report(y_test, predictions)
    print(classification_rep)
    
    # Generate and print confusion matrix
    print(f"Confusion Matrix of KFold {fold} is following:\n")
    confusion_mat = confusion_matrix(y_test, predictions)
    conf_matrix_list.append(confusion_mat)
    print(confusion_mat)
    print('\n' + '='*50 + '\n')

# Print the average results
mean_accuracy = np.mean(outcomes)
print(f"Total Average Accuracy of Logistic Regression is: {mean_accuracy:.4f}")
print("\nAverage Confusion Matrix:\n")
avg_conf_matrix = np.mean(conf_matrix_list, axis=0)
print(avg_conf_matrix)

<div class="alert alert-block alert-warning">  
<b>By analysing the report above we conclude the following.</b>
<ul>
<li>1.The accuracy of its predictions was 0.5789. We will use this as a baseline in order to compare with the other models.</li>
</div>

### Model evaluation

In [None]:
# TODO: Use Wikipedia pip module in order to evaluate the model precision

In [None]:
import wikipedia 
 
# finding result for the search
# sentences = 2 refers to numbers of line
result = wikipedia.page(title="Elections_in_the_United_States")
# result.content
 
# printing the result
print(result.section("Criticisms")) 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn.linear_model import LogisticRegression

# Example text to classify
random_text = result.section("Criticisms")

# Step 2: Preprocess and vectorize the input text
text_vector = vectorizer.transform([random_text])

# Step 3: Make the prediction
prediction = log_reg.predict(text_vector)

# Optionally, get the probability of the prediction
prediction_proba = log_reg.predict_proba(text_vector)

# Output the result
print(f"Predicted class: {prediction[0]}")
print(f"Prediction probabilities: {prediction_proba[0]}")
