This code reads in two CSV files named 'test.csv' and 'train.csv', cleans the data by filling in missing values with empty strings, creates a new column in the training dataframe called 'text_corpus' by concatenating the 'author', 'title', and 'text' columns, and then creates several dataframes for different columns in the training dataframe. It also creates a date table for article dates and loads all the dataframes into a PostgreSQL database using SQLAlchemy.

After loading the data into the database, it uses pd.read_sql_query() to query the database and display the first 15 rows of each table.

#### Dataset used - https://www.kaggle.com/fake-news/data

### Dataset Description

train.csv: A full training dataset with the following attributes:

* id: unique id for a news article
* title: the title of a news article
* author: author of the news article
* text: the text of the article; could be incomplete
* label: a label that marks the article as potentially unreliable
  * 1: FAKE
  * 0: TRUE


First, the code downloads and loads all stop words from the NLTK corpus. Then, it establishes a connection to a PostgreSQL database named "Project_4" using the psycopg2 library. A SQL query is executed to retrieve data from two tables named "article_id" and "text_corpus" in the database. The data is limited to the first 1000 rows using the LIMIT keyword. The results are stored in a Pandas DataFrame named "news_dataset". Finally, the database connection is closed using the close() method.

Set the dependencies

In [233]:
# Import necessary libraries and packages
import numpy as np
import pandas as pd
import psycopg2
from sqlalchemy import create_engine
from sqlalchemy import inspect
from api_keys import postgres_p
import matplotlib.pyplot as plt
import re 
import nltk 
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
import itertools
import pickle
import winsound

The Natural Language Toolkit (NLTK) is a platform used for building Python programs that work with human language data for applying in statistical natural language processing (NLP). It contains text processing libraries for tokenization, parsing, classification, stemming, tagging and semantic reasoning.
The nltk.corpus package defines a collection of corpus reader classes, which can be used to access the contents of a diverse set of corpora. The list of available corpora is given at: https://www.nltk.org/nltk_data/ Each corpus reader class is specialized to handle a specific corpus forma

Load and test all the STW ,Stopwords are words which occur frequently in a corpus. e.g a, an, the, in. Frequently occurring words are removed from the corpus for the purpose of text-normalization.

In [234]:
# Download and load all stop words
nltk.download('stopwords')
stop_words = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jonow\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [235]:
# check if Stopwords loaded in english
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Data Pre-processing and Analysis


Regular Expression Syntax. A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing

In [236]:
# Establish a connection to the PostgreSQL database
conn = psycopg2.connect(database="Project_4", user="postgres", password=postgres_p) #host="your_host_address", port="your_port_number"

In [237]:
# SQL query to retrieve the data - Limit to 1000 records for testing and evaluation
query = "SELECT a.article_id, a.article_label, t.text_corpus FROM article_id a  JOIN text_corpus t ON a.article_id = t.article_id LIMIT 1000"

In [238]:
# Execute the query and store the results in a Pandas DataFrame
news_dataset = pd.read_sql_query(query, conn)

In [239]:
# Close the database connection
conn.close()

In [241]:
# Check dataset
news_dataset.head()

Unnamed: 0,article_id,article_label,text_corpus
0,0,1,Darrell Lucus House Dem Aide: We Didn’t Even S...
1,1,0,"Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo..."
2,2,1,Consortiumnews.com Why the Truth Might Get You...
3,3,1,Jessica Purkiss 15 Civilians Killed In Single ...
4,4,1,Howard Portnoy Iranian woman jailed for fictio...


In [None]:
# Now we will separate the data and label i.e. text_corpus and label fields
X = news_dataset['text_corpus']
Y = news_dataset['article_label']

In [None]:
# Define a function for stemming the content
port_stem = PorterStemmer()
def stemming(content):
    # Pick all alphabet characters - lowercase and uppercase...all others such as numbers and punctuations will be removed. Numbers or punctuations will be replaced by a whitespace
    stemmed_content = re.sub('[^a-zA-Z]',' ',content)
    # Converting all letters to lowercase 
    stemmed_content = stemmed_content.lower()
    # Converting all to a splitted case or a list
    stemmed_content = stemmed_content.split()
    # Applying stemming, so we get the root words wherever possible + remove stopwords as well
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    # Join all the words in final content
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [None]:
# Apply stemming to the text_corpus column
X = X.apply(stemming)

In [None]:
# Play a sound to let you know its done 
duration = 2000  # milliseconds
freq = 440  # Hz
winsound.Beep(freq, duration)

In [None]:
# Print the X and Y variables
print(X)
print(Y)

In [None]:
# Create a single instance of CountVectorizer
vectorizer = CountVectorizer(stop_words='english')

In [None]:
vectorizer.fit(X)

In [None]:
X_transformed = vectorizer.transform(X)

In [None]:
pickle.dump(vectorizer, open('../Pickles/tfidfvect2.pkl', 'wb'))

In [None]:
TEST_model = pickle.load(open('../Pickles/tfidfvect2.pkl', 'rb'))

print(TEST_model)

In [None]:
print(X_transformed)

---

Modeling & Model Evaluation

### Splitting the data into test and train datasets

In [None]:
# Splitting the data into test and train datasets
X_train_transformed, X_test, Y_train, Y_test = train_test_split(X_transformed, Y, test_size=0.18, random_state=42)

We use 2 models to determine the accuracy of teh training set and will then select the most accurate model to us ein HEREKO
The first Model - Logistic regression

In [None]:
# # Training the model
# logisticreg_model = LogisticRegression(random_state=42)

# logisticreg_model.fit(X_train_transformed, Y_train)

In [None]:
# Define the parameter grid for Logistic Regression
logreg_param_grid = {
    'C': np.logspace(-4, 4, 20),
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']
}

In [None]:
# Initialize the model
logreg = LogisticRegression()

In [None]:
# Perform GridSearchCV for Logistic Regression
logreg_grid_search = GridSearchCV(logreg, logreg_param_grid, cv=5, verbose=0)
logreg_grid_search.fit(X_transformed, Y)

In [None]:

# Get the best parameters
best_logreg_params = logreg_grid_search.best_params_

In [None]:
# Train the model with best parameters
best_logreg_model = LogisticRegression(**best_logreg_params)

In [None]:
# Calculate the accuracies and record the changes
logreg_accuracies = logreg_grid_search.cv_results_['mean_test_score']

In [None]:
best_logreg_model.fit(X_train_transformed, Y_train)

In [None]:
# # Model 1: Logistic Regression
y_pred1 = best_logreg_model.predict(X_test)
accuracy1 = np.mean(y_pred1 == Y_test) * 100

print("Logistic Regression Model Results")
print("----------------------------------")
print("Prediction accuracy: {:.2f}%".format(accuracy1))
print("\nClassification Report:")
print("--------------------------------------------")
print(classification_report(Y_test, y_pred1))
print("Confusion Matrix:")
print(confusion_matrix(Y_test, y_pred1))
print("\n")

In [None]:
# Plot changes
plt.figure(figsize=(12, 6))
plt.plot(logreg_accuracies, label="Logistic Regression", linestyle="-", marker="o")
plt.xlabel("Parameter Set")
plt.ylabel("Accuracy")
plt.title("Logistic Regression Model Optimization")
plt.legend()
plt.savefig("logreg_model_optimization.png")
plt.show()

### Model Evaluation

In [None]:
# Accuracy Score on Training Data
X_train_prediction = best_logreg_model.predict(X_train_transformed)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

print('Accuracy score on the training data: ',training_data_accuracy)

# Accuracy Score on Test Data
X_test_prediction = best_logreg_model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

print('Accuracy score on the test data: ',test_data_accuracy)

In [None]:
import pickle
pickle.dump(logisticreg_model, open('../Pickles/logisticreg_model.pkl', 'wb'))


In [None]:
# Classification report for test data
classification_report(Y_test, X_test_prediction)

**CLASSIFICATION MODEL : PASSIVE AGGRESSIVE CLASSIFIER**

* Passive Aggressive Classifier works by responding as passive for correct classifications and responding as aggressive for any miscalculation.

---

In [None]:
# Create a function to preprocess and stem the text
def stemming(text):
    ps = PorterStemmer()
    review = re.sub('[^a-zA-Z]', ' ', text)
    review = review.lower()
    review = review.split()
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    return review


In [None]:

# Apply stemming to each text in the array
X_preprocessed = [stemming(text) for text in X]

In [None]:
# Fit and transform the preprocessed data
X_transformed = vectorizer.fit_transform(X_preprocessed)

In [None]:
# Splitting dataset into train and test sets
X_train, X_test, Y_train, Y_test = train_test_split(X_transformed, Y, test_size=0.33, random_state=42)


In [None]:
# Define the parameter grid for Passive Aggressive Classifier
pac_param_grid = {
    'C': np.logspace(-4, 4, 20),
    'loss': ['hinge', 'squared_hinge']
}

In [None]:
# Initialize the model
pac = PassiveAggressiveClassifier()

In [None]:
# Perform GridSearchCV for Passive Aggressive Classifier
pac_grid_search = GridSearchCV(pac, pac_param_grid, cv=5, verbose=0)
pac_grid_search.fit(X_transformed, Y)

In [None]:

# Get the best parameters
best_pac_params = pac_grid_search.best_params_

In [None]:
# Train the model with best parameters
best_pac_model = PassiveAggressiveClassifier(**best_pac_params)


In [None]:
# Calculate the accuracies and record the changes
pac_accuracies = pac_grid_search.cv_results_['mean_test_score']

In [None]:
best_pac_model.fit(X2_train, Y2_train)

In [None]:
# Model 2: Passive Aggressive Classifier
y_pred2 = best_pac_model.predict(X2_test)
accuracy2 = np.mean(y_pred2 == Y2_test) * 100

print("Passive Aggressive Classifier Model Results")
print("--------------------------------------------")
print("Prediction accuracy: {:.2f}%".format(accuracy2))
print("\nClassification Report:")
print("--------------------------------------------")
print(classification_report(Y2_test, y_pred2))
print("Confusion Matrix:")
print(confusion_matrix(Y2_test, y_pred2))

In [None]:
# Plot changes
plt.figure(figsize=(12, 6))
plt.plot(pac_accuracies, label="Passive Aggressive Classifier", linestyle="--", marker="x")
plt.xlabel("Parameter Set")
plt.ylabel("Accuracy")
plt.title("Passive Aggressive Model Optimization")
plt.legend()
plt.savefig("model_optimization.png")
plt.show()

In [None]:
# Making prediction on test set
test_pred = best_pac_model.predict(X2_test)

In [None]:
# Save the vectorizer
pickle.dump(vectorizer, open('../Pickles/tfidf_vectorizer.pkl', 'wb'))

In [None]:
# Save the model
pickle.dump(best_pac_model, open('../Pickles/passive_aggressive_model.pkl', 'wb'))

In [None]:
# Use the trained models to make predictions on new data
vectorizer = pickle.load(open('../Pickles/tfidf_vectorizer.pkl', 'rb'))
passive_aggressive_model = pickle.load(open('../Pickles/passive_aggressive_model.pkl', 'rb'))

In [None]:
# Assuming X_new is a new text input
X_new = X[5]
X_new_preprocessed = stemming(X_new)
X_new_transformed = vectorizer.transform([X_new_preprocessed])

In [None]:

# Make prediction using the trained model
prediction = passive_aggressive_model.predict(X2_test)

In [None]:
# Print the prediction result
print("Prediction for the new text input: ", prediction[0])
if (prediction[0] == 0):
    print('Jono says it\'s True')
else:
    print('Johan Says it is a porky:)')

---

Testing the two models

In [None]:
# Fit and transform the input data
X_transformed = vectorizer.fit_transform(X)

# Splitting dataset into train and test sets
X2_train, X2_test, Y2_train, Y2_test = train_test_split(X_transformed, Y, test_size=0.33, random_state=42)

# Creating model
passiveagressive_model = PassiveAggressiveClassifier(C=0.5, random_state=5)

# Fitting model
passiveagressive_model.fit(X2_train, Y2_train)

# Making prediction on test set
test_pred = passiveagressive_model.predict(X2_test)

# Model evaluation
print(f"Test Set Accuracy : {accuracy_score(Y2_test, test_pred) * 100} %\n\n")

# # Save the vectorizer
# pickle.dump(vectorizer, open('../Pickles/tfidf_vectorizer.pkl', 'wb'))

# # Save the model
# pickle.dump(passiveagressive_model, open('../Pickles/passiveagressive_model.pkl', 'wb'))

In [None]:
y_pred = best_logreg_model.predict(X2_test)

# Calculate the prediction accuracy
accuracy = np.mean(y_pred == Y_test) * 100

# Print the accuracy
print("Prediction accuracy: {:.2f}%".format(accuracy))

# Print the prediction for a single example
X_new = X2_test[5]
prediction = best_logreg_model.predict(X_new.reshape(1, -1))
print("Prediction for example 500: ", prediction[0])
if (prediction[0] == 0):
  print('Jono says its True')
else:
  print('Johan Says it is a porky:)')

In [None]:
y2_pred = best_pac_model.predict(X2_test)

# Calculate the prediction accuracy
accuracy = np.mean(y2_pred == Y2_test) * 100

# Print the accuracy
print("Prediction accuracy: {:.2f}%".format(accuracy))

# Print the prediction for a single example
X2_new = X2_test[5]
prediction2 = best_pac_model.predict(X2_new.reshape(1, -1))
print("Prediction for example 500: ", prediction[0])
if (prediction[0] == 0):
  print('Jono says its True')
else:
  print('Johan Says it is a porky:)')

In [None]:
news_dataset[10:11]

In [None]:
print(Y_test)

In [None]:
print(Y_test)

In [None]:
news_dataset[3:4]

In [None]:
# Assuming X is an array of text inputs
X_preprocessed = [stemming(text) for text in X]  # Apply stemming to each text in the array
X_vectorized = vectorizer.transform(X_preprocessed)  # Convert to numerical format using the trained vectorizer



In [None]:
pickled_model1 = pickle.load(open('../Pickles/logisticreg_model.pkl', 'rb'))
pickled_model1.predict(X_vectorized)

In [None]:
pickled_model2 = pickle.load(open('../Pickles/passiveagressive_model.pkl', 'rb'))
pickled_model2.predict(X2_test)

FAngo tested a point to clarify teh vector model

In [None]:
ps = PorterStemmer()

In [None]:
review = re.sub('[^a-zA-Z]', ' ', news_dataset['text_corpus'][10])
review = review.lower()
review = review.split()
review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
review = ' '.join(review)
review

In [None]:
val = vectorizer.transform([review]).toarray()

In [None]:
tfidfvect2_model2 = pickle.load(open('../Pickles/tfidfvect2.pkl', 'rb'))


In [None]:
# Plot both models
plt.figure(figsize=(12, 6))
plt.plot(logreg_accuracies, label="Logistic Regression", linestyle="-", marker="o")
plt.plot(pac_accuracies, label="Passive Aggressive Classifier", linestyle="--", marker="x")
plt.xlabel("Parameter Set")
plt.ylabel("Accuracy")
plt.title("Model Optimization")
plt.legend()
plt.savefig("model_optimization.png")
plt.show()