This code reads in two CSV files named 'test.csv' and 'train.csv', cleans the data by filling in missing values with empty strings, creates a new column in the training dataframe called 'text_corpus' by concatenating the 'author', 'title', and 'text' columns, and then creates several dataframes for different columns in the training dataframe. It also creates a date table for article dates and loads all the dataframes into a PostgreSQL database using SQLAlchemy.

After loading the data into the database, it uses pd.read_sql_query() to query the database and display the first 15 rows of each table.

#### Dataset used - https://www.kaggle.com/fake-news/data

### Dataset Description

train.csv: A full training dataset with the following attributes:

* id: unique id for a news article
* title: the title of a news article
* author: author of the news article
* text: the text of the article; could be incomplete
* label: a label that marks the article as potentially unreliable
  * 1: FAKE
  * 0: TRUE


Ensure the file creator is installed to save the MODELS for use in Heroku app

Set the dependencies

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/quickstart/beginner.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" /> Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/jonowood/Project-4-A-Team/blob/JonBranch/ML/JONO_PREDICT_MASTER.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" /> View source on GitHub</a>
  </td>
  <td>
    <a href="https://github.com/jonowood/Project-4-A-Team/blob/JonBranch/ML/JONO_PREDICT_MASTER.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" /> Download notebook</a>
  </td>
</table>

In [1]:
import numpy as np
import pandas as pd
import psycopg2
from sqlalchemy import create_engine
from sqlalchemy import inspect
from api_keys import postgres_p

import matplotlib.pyplot as plt

import re 
import nltk 
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn import metrics
import itertools

import pickle


The Natural Language Toolkit (NLTK) is a platform used for building Python programs that work with human language data for applying in statistical natural language processing (NLP). It contains text processing libraries for tokenization, parsing, classification, stemming, tagging and semantic reasoning.
The nltk.corpus package defines a collection of corpus reader classes, which can be used to access the contents of a diverse set of corpora. The list of available corpora is given at: https://www.nltk.org/nltk_data/ Each corpus reader class is specialized to handle a specific corpus forma

Load and test all the STW ,Stopwords are words which occur frequently in a corpus. e.g a, an, the, in. Frequently occurring words are removed from the corpus for the purpose of text-normalization.

In [2]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jonow\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# check if Stopwords laoded in english

print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Data Pre-processing and Analysis


Regular Expression Syntax. A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing

# insert SQL here

In [4]:
# Establish a connection to the PostgreSQL database
conn = psycopg2.connect(database="Project_4", user="postgres", password=postgres_p) #host="your_host_address", port="your_port_number"

In [5]:
# SQL query to retrieve the data
query = "SELECT a.article_id, a.article_label, t.text_corpus FROM article_id a  JOIN text_corpus t ON a.article_id = t.article_id LIMIT 100"

In [6]:
# Execute the query and store the results in a Pandas DataFrame
news_dataset = pd.read_sql_query(query, conn)

In [7]:
# Close the database connection
conn.close()

In [8]:
news_dataset

Unnamed: 0,article_id,article_label,text_corpus
0,0,1,Darrell Lucus House Dem Aide: We Didn’t Even S...
1,1,0,"Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo..."
2,2,1,Consortiumnews.com Why the Truth Might Get You...
3,3,1,Jessica Purkiss 15 Civilians Killed In Single ...
4,4,1,Howard Portnoy Iranian woman jailed for fictio...
...,...,...,...
95,72,0,Jacey Fortin Dress Like a Woman? What Does Tha...
96,73,0,"Brett Anderson At 91, Ella Brennan Still Feeds..."
97,74,0,"Jane Perlez Pressing Asia Agenda, Obama Treads..."
98,75,0,Josh Katz Democrats Have a 60 Percent Chance t...


In [9]:
# Now we will separate the data and label i.e. text_corpus and label fields
X = news_dataset['text_corpus']
Y = news_dataset['article_label']

In [10]:
# Define a function for stemming the content
port_stem = PorterStemmer()
def stemming(content):
    # Pick all alphabet characters - lowercase and uppercase...all others such as numbers and punctuations will be removed. Numbers or punctuations will be replaced by a whitespace
    stemmed_content = re.sub('[^a-zA-Z]',' ',content)
    # Converting all letters to lowercase 
    stemmed_content = stemmed_content.lower()
    # Converting all to a splitted case or a list
    stemmed_content = stemmed_content.split()
    # Applying stemming, so we get the root words wherever possible + remove stopwords as well
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    # Join all the words in final content
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [11]:
# Apply stemming to the text_corpus column
X = X.apply(stemming)

In [12]:
import winsound
duration = 1000  # milliseconds
freq = 440  # Hz
winsound.Beep(freq, duration)

In [13]:
# Print the X and Y variables
print(X)
print(Y)

0     darrel lucu hous dem aid even see comey letter...
1     daniel j flynn flynn hillari clinton big woman...
2     consortiumnew com truth might get fire truth m...
3     jessica purkiss civilian kill singl us airstri...
4     howard portnoy iranian woman jail fiction unpu...
                            ...                        
95    jacey fortin dress like woman mean new york ti...
96    brett anderson ella brennan still feed lead ne...
97    jane perlez press asia agenda obama tread ligh...
98    josh katz democrat percent chanc retak senat n...
99    news pr disast presid panason forc resign pana...
Name: text_corpus, Length: 100, dtype: object
0     1
1     0
2     1
3     1
4     1
     ..
95    0
96    0
97    0
98    0
99    1
Name: article_label, Length: 100, dtype: int64


TF-IDF (Term Frequency, Inverse Document Frequency)

### Converting Textual data to Numerical data

* The TF-IDF Vectorizer
* TF-IDF Vectorizer coverts textual data to numerical data

Thsi is still a bit messed up and need to be cleaned, I stidued the HEROKYU app fiel and the vectorizer is use dto translate the input text to ML to do teh comparison

In [14]:
vectorizer = TfidfVectorizer()
vectorizer.fit(X)

X_transformed = vectorizer.transform(X)

In [15]:
pickle.dump(vectorizer, open('../Pickles/tfidfvect2.pkl', 'wb'))

In [16]:
TEST_model = pickle.load(open('../Pickles/tfidfvect2.pkl', 'rb'))

print(TEST_model)

TfidfVectorizer()


In [17]:
print(X_transformed)

  (0, 8093)	0.03694408830646858
  (0, 8089)	0.021628144918573823
  (0, 8050)	0.0497274539521895
  (0, 8049)	0.11016364582333155
  (0, 8042)	0.04022767162925188
  (0, 8035)	0.013089773516581497
  (0, 7986)	0.017691705351612422
  (0, 7934)	0.01861195231446561
  (0, 7931)	0.015043445572472646
  (0, 7925)	0.03694408830646858
  (0, 7919)	0.01374287660665248
  (0, 7901)	0.013089773516581497
  (0, 7846)	0.025654557793032627
  (0, 7844)	0.03597218816753848
  (0, 7768)	0.026538631430371946
  (0, 7694)	0.06780137453754952
  (0, 7636)	0.03006645541414307
  (0, 7613)	0.018945605044700796
  (0, 7530)	0.03390068726877476
  (0, 7501)	0.01278336564572092
  (0, 7500)	0.05487750586388943
  (0, 7492)	0.15392734675819578
  (0, 7486)	0.08993047041884619
  (0, 7398)	0.03694408830646858
  (0, 7348)	0.026538631430371946
  :	:
  (99, 1600)	0.04946733759450423
  (99, 1581)	0.042484792480827906
  (99, 1507)	0.03625065040130079
  (99, 1476)	0.033199456795716224
  (99, 1446)	0.02629924016155174
  (99, 1441)	0.0389

---

Modeling & Model Evaluation

### Splitting the data into test and train datasets

In [18]:
# Splitting the data into test and train datasets
X_train, X_test, Y_train, Y_test = train_test_split(X_transformed, Y, test_size=0.18, random_state=42)

We use 2 models to determine the accuracy of teh training set and will then select the most accurate model to us ein HEREKO
The first Model - Logistic regression

In [19]:
# Training the model
logisticreg_model = LogisticRegression()

logisticreg_model.fit(X_train, Y_train)

LogisticRegression()

### Model Evaluation

In [20]:
# Accuracy Score on Training Data
X_train_prediction = logisticreg_model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

print('Accuracy score on the training data: ',training_data_accuracy)

# Accuracy Score on Test Data
X_test_prediction = logisticreg_model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

print('Accuracy score on the test data: ',test_data_accuracy)

Accuracy score on the training data:  1.0
Accuracy score on the test data:  0.7222222222222222


In [21]:
import pickle
pickle.dump(logisticreg_model, open('../Pickles/logisticreg_model.pkl', 'wb'))


In [22]:
# Classification report for test data
classification_report(Y_test, X_test_prediction)

'              precision    recall  f1-score   support\n\n           0       0.88      0.64      0.74        11\n           1       0.60      0.86      0.71         7\n\n    accuracy                           0.72        18\n   macro avg       0.74      0.75      0.72        18\nweighted avg       0.77      0.72      0.72        18\n'

**CLASSIFICATION MODEL : PASSIVE AGGRESSIVE CLASSIFIER**

* Passive Aggressive Classifier works by responding as passive for correct classifications and responding as aggressive for any miscalculation.

In [51]:
X2_train, X2_test, Y2_train, Y2_test = train_test_split(X_transformed, Y, test_size=0.33, random_state=42)

In [52]:
pickle.dump(passiveagressive_model, open('../Pickles/passiveagressive_model.pkl', 'wb'))

Testing the two models

In [53]:
# Importing modules
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import train_test_split

# Create the vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the input data
X_transformed = vectorizer.fit_transform(X)

# Splitting dataset into train and test sets
X2_train, X2_test, Y2_train, Y2_test = train_test_split(X_transformed, Y, test_size=0.33, random_state=42)

# Creating model
passiveagressive_model = PassiveAggressiveClassifier(C=0.5, random_state=5)

# Fitting model
passiveagressive_model.fit(X2_train, Y2_train)

# Making prediction on test set
test_pred = passiveagressive_model.predict(X2_test)

# Model evaluation
print(f"Test Set Accuracy : {accuracy_score(Y2_test, test_pred) * 100} %\n\n")

# Save the vectorizer
pickle.dump(vectorizer, open('../Pickles/tfidf_vectorizer.pkl', 'wb'))

# Save the model
pickle.dump(passiveagressive_model, open('../Pickles/passiveagressive_model.pkl', 'wb'))

Test Set Accuracy : 69.6969696969697 %




In [54]:
y_pred = logisticreg_model.predict(X_test)

# Calculate the prediction accuracy
accuracy = np.mean(y_pred == Y_test) * 100

# Print the accuracy
print("Prediction accuracy: {:.2f}%".format(accuracy))

# Print the prediction for a single example
X_new = X_test[5]
prediction = logisticreg_model.predict(X_new.reshape(1, -1))
print("Prediction for example 500: ", prediction[0])
if (prediction[0] == 0):
  print('Jono says its True')
else:
  print('Johan Says it is a porky:)')

Prediction accuracy: 72.22%
Prediction for example 500:  0
Jono says its True


In [55]:
y2_pred = logisticreg_model.predict(X2_test)

# Calculate the prediction accuracy
accuracy = np.mean(y2_pred == Y2_test) * 100

# Print the accuracy
print("Prediction accuracy: {:.2f}%".format(accuracy))

# Print the prediction for a single example
X2_new = X2_test[5]
prediction2 = passiveagressive_model.predict(X2_new.reshape(1, -1))
print("Prediction for example 500: ", prediction[0])
if (prediction[0] == 0):
  print('Jono says its True')
else:
  print('Johan Says it is a porky:)')

Prediction accuracy: 84.85%
Prediction for example 500:  0
Jono says its True


In [56]:
news_dataset[10:11]

Unnamed: 0,article_id,article_label,text_corpus
10,10,0,Aaron Klein Obama’s Organizing for Action Part...


In [57]:
print(Y_test)

83    1
53    0
70    0
45    0
44    0
39    0
22    1
80    0
10    0
0     1
18    1
30    0
73    1
33    0
90    0
4     1
76    1
77    0
Name: article_label, dtype: int64


In [58]:
X_new = X_test[3]

prediction = logisticreg_model.predict(X_new)
print(prediction)

if (prediction[0] == 0):
  print('Jono says its True')
else:
  print('Johan Says it is a porky:)')

[1]
Johan Says it is a porky:)


In [59]:
print(Y_test)

83    1
53    0
70    0
45    0
44    0
39    0
22    1
80    0
10    0
0     1
18    1
30    0
73    1
33    0
90    0
4     1
76    1
77    0
Name: article_label, dtype: int64


In [60]:
news_dataset[300:301]

Unnamed: 0,article_id,article_label,text_corpus


In [61]:
print(X)
vectorizer.fit(X)

0     darrel lucu hous dem aid even see comey letter...
1     daniel j flynn flynn hillari clinton big woman...
2     consortiumnew com truth might get fire truth m...
3     jessica purkiss civilian kill singl us airstri...
4     howard portnoy iranian woman jail fiction unpu...
                            ...                        
95    jacey fortin dress like woman mean new york ti...
96    brett anderson ella brennan still feed lead ne...
97    jane perlez press asia agenda obama tread ligh...
98    josh katz democrat percent chanc retak senat n...
99    news pr disast presid panason forc resign pana...
Name: text_corpus, Length: 100, dtype: object


TfidfVectorizer()

# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!######################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################

In [65]:
# vectorizer = TfidfVectorizer()
# vectorizer.fit(X)

vectorizer = pickle.load(open('../Pickles/tfidfvect2.pkl', 'rb'))

In [66]:
# Assuming X is an array of text inputs
X_preprocessed = [stemming(text) for text in X]  # Apply stemming to each text in the array
X_vectorized = vectorizer.transform(X_preprocessed)  # Convert to numerical format using the trained vectorizer



In [67]:
pickled_model1 = pickle.load(open('logisticreg_model.pkl', 'rb'))
predictions = pickled_model1.predict(X_vectorized)

ValueError: X has 8134 features, but LogisticRegression is expecting 52004 features as input.

In [45]:
pickled_model2 = pickle.load(open('passiveagressive_model.pkl', 'rb'))
pickled_model2.predict(X2_test)

ValueError: X has 8134 features, but PassiveAggressiveClassifier is expecting 52004 features as input.

FAngo tested a point to clarify teh vector model

In [None]:
ps = PorterStemmer()

In [None]:
review = re.sub('[^a-zA-Z]', ' ', news_dataset['text_corpus'][100])
review = review.lower()
review = review.split()
    
review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
review = ' '.join(review)
review

In [None]:
val = vectorizer.transform([review]).toarray()

In [None]:
tfidfvect2_model2 = pickle.load(open('tfidfvect2.pkl', 'rb'))
