This code reads in two CSV files named 'test.csv' and 'train.csv', cleans the data by filling in missing values with empty strings, creates a new column in the training dataframe called 'text_corpus' by concatenating the 'author', 'title', and 'text' columns, and then creates several dataframes for different columns in the training dataframe. It also creates a date table for article dates and loads all the dataframes into a PostgreSQL database using SQLAlchemy.

After loading the data into the database, it uses pd.read_sql_query() to query the database and display the first 15 rows of each table.

#### Dataset used - https://www.kaggle.com/fake-news/data

### Dataset Description

train.csv: A full training dataset with the following attributes:

* id: unique id for a news article
* title: the title of a news article
* author: author of the news article
* text: the text of the article; could be incomplete
* label: a label that marks the article as potentially unreliable
  * 1: FAKE
  * 0: TRUE


Ensure the file creator is installed to save the MODELS for use in Heroku app

Set the dependencies

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/quickstart/beginner.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" /> Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/jonowood/Project-4-A-Team/blob/JonBranch/ML/JONO_PREDICT_MASTER.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" /> View source on GitHub</a>
  </td>
  <td>
    <a href="https://github.com/jonowood/Project-4-A-Team/blob/JonBranch/ML/JONO_PREDICT_MASTER.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" /> Download notebook</a>
  </td>
</table>

In [1]:
import numpy as np
import pandas as pd
import psycopg2
from sqlalchemy import create_engine
from sqlalchemy import inspect
from api_keys import postgres_p

import matplotlib.pyplot as plt

import re 
import nltk 
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn import metrics
import itertools

import pickle


The Natural Language Toolkit (NLTK) is a platform used for building Python programs that work with human language data for applying in statistical natural language processing (NLP). It contains text processing libraries for tokenization, parsing, classification, stemming, tagging and semantic reasoning.
The nltk.corpus package defines a collection of corpus reader classes, which can be used to access the contents of a diverse set of corpora. The list of available corpora is given at: https://www.nltk.org/nltk_data/ Each corpus reader class is specialized to handle a specific corpus forma

Load and test all the STW ,Stopwords are words which occur frequently in a corpus. e.g a, an, the, in. Frequently occurring words are removed from the corpus for the purpose of text-normalization.

In [2]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jonow\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# check if Stopwords laoded in english

print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Data Pre-processing and Analysis


Regular Expression Syntax. A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing

# insert SQL here

In [10]:
# Establish a connection to the PostgreSQL database
conn = psycopg2.connect(database="Project_4", user="postgres", password=postgres_p) #host="your_host_address", port="your_port_number"

In [11]:
# SQL query to retrieve the data
query = "SELECT a.article_id, a.article_label, t.text_corpus FROM article_id a  JOIN text_corpus t ON a.article_id = t.article_id"

In [12]:
# Execute the query and store the results in a Pandas DataFrame
news_dataset = pd.read_sql_query(query, conn)

In [13]:
# Close the database connection
conn.close()

In [14]:
news_dataset

Unnamed: 0,article_id,article_label,text_corpus
0,0,1,Darrell Lucus House Dem Aide: We Didn’t Even S...
1,1,0,"Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo..."
2,2,1,Consortiumnews.com Why the Truth Might Get You...
3,3,1,Jessica Purkiss 15 Civilians Killed In Single ...
4,4,1,Howard Portnoy Iranian woman jailed for fictio...
...,...,...,...
20795,20795,0,Jerome Hudson Rapper T.I.: Trump a ’Poster Chi...
20796,20796,0,"Benjamin Hoffman N.F.L. Playoffs: Schedule, Ma..."
20797,20797,0,Michael J. de la Merced and Rachel Abrams Macy...
20798,20798,1,"Alex Ansary NATO, Russia To Hold Parallel Exer..."


In [15]:
# Now we will separate the data and label i.e. text_corpus and label fields
X = news_dataset['text_corpus']
Y = news_dataset['article_label']

In [16]:
# Define a function for stemming the content
port_stem = PorterStemmer()
def stemming(content):
    # Pick all alphabet characters - lowercase and uppercase...all others such as numbers and punctuations will be removed. Numbers or punctuations will be replaced by a whitespace
    stemmed_content = re.sub('[^a-zA-Z]',' ',content)
    # Converting all letters to lowercase 
    stemmed_content = stemmed_content.lower()
    # Converting all to a splitted case or a list
    stemmed_content = stemmed_content.split()
    # Applying stemming, so we get the root words wherever possible + remove stopwords as well
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    # Join all the words in final content
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [17]:
# Apply stemming to the text_corpus column
X = X.apply(stemming)

In [18]:
import winsound
duration = 1000  # milliseconds
freq = 440  # Hz
winsound.Beep(freq, duration)

In [19]:
# Print the X and Y variables
print(X)
print(Y)

0        darrel lucu hous dem aid even see comey letter...
1        daniel j flynn flynn hillari clinton big woman...
2        consortiumnew com truth might get fire truth m...
3        jessica purkiss civilian kill singl us airstri...
4        howard portnoy iranian woman jail fiction unpu...
                               ...                        
20795    jerom hudson rapper trump poster child white s...
20796    benjamin hoffman n f l playoff schedul matchup...
20797    michael j de la merc rachel abram maci said re...
20798    alex ansari nato russia hold parallel exercis ...
20799    david swanson keep f aliv david swanson author...
Name: text_corpus, Length: 20800, dtype: object
0        1
1        0
2        1
3        1
4        1
        ..
20795    0
20796    0
20797    0
20798    1
20799    1
Name: article_label, Length: 20800, dtype: int64


TF-IDF (Term Frequency, Inverse Document Frequency)

### Converting Textual data to Numerical data

* The TF-IDF Vectorizer
* TF-IDF Vectorizer coverts textual data to numerical data

Thsi is still a bit messed up and need to be cleaned, I stidued the HEROKYU app fiel and the vectorizer is use dto translate the input text to ML to do teh comparison

In [20]:
vectorizer = TfidfVectorizer()
vectorizer.fit(X)

TfidfVectorizer()

In [21]:
X = vectorizer.transform(X)

In [22]:
pickle.dump(vectorizer, open('../Pickles/tfidfvect2.pkl', 'wb'))

In [23]:
TEST_model = pickle.load(open('../Pickles/tfidfvect2.pkl', 'rb'))

print(TEST_model)

TfidfVectorizer()


In [24]:
print(X)

  (0, 109752)	0.049158312425168854
  (0, 109697)	0.0190646711515277
  (0, 108742)	0.04416544119908134
  (0, 108738)	0.09477494042884232
  (0, 108695)	0.03758488097939004
  (0, 108658)	0.01130614774071694
  (0, 108007)	0.017092546683505856
  (0, 107190)	0.017105936674103112
  (0, 107099)	0.012543234221230963
  (0, 107013)	0.029126417104928328
  (0, 106934)	0.012863319680563097
  (0, 106734)	0.011771716334271506
  (0, 105884)	0.025727197929110487
  (0, 105848)	0.031296701378124764
  (0, 104837)	0.02153649554212262
  (0, 103422)	0.06544555398259812
  (0, 102736)	0.03314918847150756
  (0, 102485)	0.01639612818098454
  (0, 101717)	0.038071924979380216
  (0, 101077)	0.011082403436475742
  (0, 101067)	0.0432044670628921
  (0, 101014)	0.13602128375819167
  (0, 100866)	0.0713092337063475
  (0, 99577)	0.03944988916619374
  (0, 99009)	0.027120358929731154
  :	:
  (20799, 7470)	0.010635431711878486
  (20799, 7143)	0.02816704434978389
  (20799, 6848)	0.03959171777516513
  (20799, 6810)	0.0253655855

Modeling & Model Evaluation

### Splitting the data into test and train datasets

In [25]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.18, random_state=42)

We use 2 models to determine the accuracy of teh training set and will then select the most accurate model to us ein HEREKO
The first Model - Logistic regression

In [26]:
# Training the model
logisticreg_model = LogisticRegression()

logisticreg_model.fit(X_train, Y_train)

LogisticRegression()

### Model Evaluation

In [27]:
# Accuracy Score on Training Data
X_train_prediction = logisticreg_model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

print('Accuracy score on the training data: ',training_data_accuracy)

# Accuracy Score on Test Data
X_test_prediction = logisticreg_model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

print('Accuracy score on the test data: ',test_data_accuracy)

Accuracy score on the training data:  0.9788930581613509
Accuracy score on the test data:  0.9599358974358975


In [28]:
import pickle
pickle.dump(logisticreg_model, open('../Pickles/logisticreg_model.pkl', 'wb'))


In [29]:
# Classification report for test data
classification_report(Y_test, X_test_prediction)

'              precision    recall  f1-score   support\n\n           0       0.97      0.95      0.96      1851\n           1       0.95      0.97      0.96      1893\n\n    accuracy                           0.96      3744\n   macro avg       0.96      0.96      0.96      3744\nweighted avg       0.96      0.96      0.96      3744\n'

**CLASSIFICATION MODEL : PASSIVE AGGRESSIVE CLASSIFIER**

* Passive Aggressive Classifier works by responding as passive for correct classifications and responding as aggressive for any miscalculation.

In [30]:
X2_train, X2_test, Y2_train, Y2_test = train_test_split(X, Y, test_size=0.33, random_state=42)

In [31]:
# Importing modules
# from sklearn.datasets import load_iris
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import train_test_split

# Splitting dataset into train and test sets
X2_train, X2_test, Y2_train, Y2_test = train_test_split(X, Y, test_size=0.33, random_state=42)

# Creating model
passiveagressive_model = PassiveAggressiveClassifier(C = 0.5, random_state = 5)

# Fitting model
passiveagressive_model.fit(X2_train, Y2_train)

# Making prediction on test set
test_pred = passiveagressive_model.predict(X2_test)

# Model evaluation
print(f"Test Set Accuracy : {accuracy_score(Y2_test, test_pred) * 100} %\n\n")

#print(f"Classification Report : \n\n{classification_report(Y2_test, test_pred)}")


Test Set Accuracy : 96.69289044289044 %




In [32]:
pickle.dump(passiveagressive_model, open('../Pickles/passiveagressive_model.pkl', 'wb'))

Testing the two models

In [33]:
y_pred = logisticreg_model.predict(X_test)

# Calculate the prediction accuracy
accuracy = np.mean(y_pred == Y_test) * 100

# Print the accuracy
print("Prediction accuracy: {:.2f}%".format(accuracy))

# Print the prediction for a single example
X_new = X_test[501]
prediction = logisticreg_model.predict(X_new.reshape(1, -1))
print("Prediction for example 500: ", prediction[0])
if (prediction[0] == 0):
  print('Jono says its True')
else:
  print('Johan Says it is a porky:)')

Prediction accuracy: 95.99%
Prediction for example 500:  0
Jono says its True


In [34]:
y2_pred = logisticreg_model.predict(X2_test)

# Calculate the prediction accuracy
accuracy = np.mean(y2_pred == Y2_test) * 100

# Print the accuracy
print("Prediction accuracy: {:.2f}%".format(accuracy))

# Print the prediction for a single example
X2_new = X2_test[501]
prediction2 = passiveagressive_model.predict(X2_new.reshape(1, -1))
print("Prediction for example 500: ", prediction[0])
if (prediction[0] == 0):
  print('Jono says its True')
else:
  print('Johan Says it is a porky:)')

Prediction accuracy: 96.74%
Prediction for example 500:  0
Jono says its True


In [35]:
news_dataset[100:101]

Unnamed: 0,article_id,article_label,text_corpus
100,77,1,Redflag Newsdesk Judge spanks transgender-obse...


In [38]:
print(Y_test)

14649    1
9231     0
6473     1
18736    0
12347    0
        ..
4351     0
8423     0
14014    1
14587    0
4918     0
Name: article_label, Length: 3744, dtype: int64


In [39]:
X_new = X_test[301]

prediction = logisticreg_model.predict(X_new)
print(prediction)

if (prediction[0] == 0):
  print('Jono says its True')
else:
  print('Johan Says it is a porky:)')

[1]
Johan Says it is a porky:)


In [41]:
print(Y_test[9231])

0


In [42]:
news_dataset[300:301]

Unnamed: 0,article_id,article_label,text_corpus
300,245,0,"Andrew Ross Sorkin C.E.O.s Ponder a New Game, ..."


In [45]:
print(X)
# vectorizer.fit(X)

  (0, 109752)	0.049158312425168854
  (0, 109697)	0.0190646711515277
  (0, 108742)	0.04416544119908134
  (0, 108738)	0.09477494042884232
  (0, 108695)	0.03758488097939004
  (0, 108658)	0.01130614774071694
  (0, 108007)	0.017092546683505856
  (0, 107190)	0.017105936674103112
  (0, 107099)	0.012543234221230963
  (0, 107013)	0.029126417104928328
  (0, 106934)	0.012863319680563097
  (0, 106734)	0.011771716334271506
  (0, 105884)	0.025727197929110487
  (0, 105848)	0.031296701378124764
  (0, 104837)	0.02153649554212262
  (0, 103422)	0.06544555398259812
  (0, 102736)	0.03314918847150756
  (0, 102485)	0.01639612818098454
  (0, 101717)	0.038071924979380216
  (0, 101077)	0.011082403436475742
  (0, 101067)	0.0432044670628921
  (0, 101014)	0.13602128375819167
  (0, 100866)	0.0713092337063475
  (0, 99577)	0.03944988916619374
  (0, 99009)	0.027120358929731154
  :	:
  (20799, 7470)	0.010635431711878486
  (20799, 7143)	0.02816704434978389
  (20799, 6848)	0.03959171777516513
  (20799, 6810)	0.0253655855

# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!######################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################

In [53]:
vectorizer = TfidfVectorizer()
vectorizer.fit(X2_new, )

AttributeError: lower not found

In [49]:

pickled_model1 = pickle.load(open('logisticreg_model.pkl', 'rb'))
pickled_model1.predict(X2)

ValueError: X has 111501 features, but LogisticRegression is expecting 52004 features as input.

In [None]:
pickled_model2 = pickle.load(open('passiveagressive_model.pkl', 'rb'))
pickled_model2.predict(X2_test)

FAngo tested a point to clarify teh vector model

In [None]:
ps = PorterStemmer()

In [None]:
review = re.sub('[^a-zA-Z]', ' ', news_dataset['text_corpus'][100])
review = review.lower()
review = review.split()
    
review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
review = ' '.join(review)
review

In [None]:
val = vectorizer.transform([review]).toarray()

In [None]:
tfidfvect2_model2 = pickle.load(open('tfidfvect2.pkl', 'rb'))
