# Negative News Neural Nets Project: Classifying Adverse Media Articles using Machine Learning Algorithms

In this notebook, conda environment with Python 3.86 is used. Some libraries, such as spacy and nltk may require installation if your machine does not have them. 

You can use the steps below to install spaCy. If something goes awry, feel free to use pip/do some stackoverflow search to complete the installation. The last two parts will be required later on in the notebook, they are not essential spaCy packages.

 - conda install -c conda-forge spacy
 
 - conda install -c conda-forge spacy-lookups-data
 
 - python -m spacy download en_core_web_sm
 
 - pip install spacy-langdetect
 
 - conda install -c conda-forge wordcloud
 
On the other hand, installing nltk packages will be easy, just look at the error to understand what needs to be downloaded using nltk.download(...). I have already provided the download code for punkt package and I don't think anything is required beside that.

## TF-IDF & Baseline Logistic Regression Model

This will be a short notebook, reserved only for the Logistic Regression Model. We will use the cleaned & lemmatized dataset that we have exported as a .csv file during the preprocessing part.

In [1]:
import warnings
warnings.simplefilter("ignore", UserWarning)

In [2]:
import pandas as pd
import numpy as np
import json
import math
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, auc, roc_auc_score, f1_score, confusion_matrix

import scipy
from scipy.sparse import hstack

# This module will be for saving the trained model for later use
import joblib

In [3]:
# Uncomment this if you're using linux
# !ls 

In [4]:
# Let's get an overview of what our folder contains..
!dir

 Volume in drive C is Win 10
 Volume Serial Number is CA9A-F06E

 Directory of C:\Users\canberk\Desktop\ut-ml-adverse-media-main

12/08/2020  07:12 AM    <DIR>          .
12/08/2020  07:12 AM    <DIR>          ..
11/23/2020  06:08 PM           110,455 .ipynb
12/08/2020  06:59 AM    <DIR>          .ipynb_checkpoints
11/21/2020  04:43 PM         3,752,073 adverse_media_training.csv.zip
12/08/2020  06:39 AM         2,174,223 cleaned_lemmatized_text.csv
12/08/2020  06:07 AM           110,996 Data Preprocessing&Baselines-Original.ipynb
12/08/2020  06:06 AM           115,641 Data Preprocessing&Baselines.ipynb
11/21/2020  04:43 PM         3,630,748 EDA - Kristjan's Original.ipynb
11/23/2020  06:04 PM         3,740,422 EDA.ipynb
10/24/2015  07:35 PM     5,646,236,541 glove.840B.300d.txt
12/07/2020  04:36 PM     2,176,768,927 glove.840B.300d.zip
11/21/2020  04:43 PM             1,073 LICENSE
12/08/2020  07:12 AM            66,944 Logistic Regression.ipynb
12/08/2020  06:59 AM            68,615 

In [5]:
df = pd.read_csv('./cleaned_lemmatized_text.csv')
df.head()

Unnamed: 0,is_adverse_media,lemmatized_articles
0,0,zimbabweans wake news agriculture minister per...
1,1,singapore founder singapore oil trade company ...
2,1,fraudster offer green tax efficient investment...
3,1,buenos aire reuter judicial probe possible cor...
4,0,ukraines constitutional court appear strike bl...


In [6]:
x_train, x_val, y_train, y_val = train_test_split(df['lemmatized_articles'], 
                                                    df['is_adverse_media'], 
                                                    test_size=0.1, 
                                                    random_state=42,
                                                    stratify=df['is_adverse_media'])

print(x_train.shape, x_val.shape, y_train.shape, y_val.shape)

(656,) (73,) (656,) (73,)


In [7]:
x_train.head()

26     late september joe tone young editor dallas ob...
257    mexicos attorney general alejandro gertz maner...
0      zimbabweans wake news agriculture minister per...
279    article write yash singhal vivekananda institu...
108    singapore reuters singapores central bank impo...
Name: lemmatized_articles, dtype: object

The train and validation sets are ready for applying a vectorizer function. Instead of creating the document-term matrix by simply counting the number of occurrences of words(ie bag of words approach), I will apply a tf-idf vectorizer on train data.

In [8]:
ngram_vectorizer = TfidfVectorizer(max_features=40000,
                             min_df=5, 
                             max_df=0.5, 
                             analyzer='word', 
                             stop_words='english', 
                             ngram_range=(1, 3))
print(ngram_vectorizer)

TfidfVectorizer(max_df=0.5, max_features=40000, min_df=5, ngram_range=(1, 3),
                stop_words='english')


Let's fit the vectorizer to x_train and take a look at the feature names.

In [9]:
import random
ngram_vectorizer.fit(x_train)
features = ngram_vectorizer.get_feature_names()

random.sample(features, k=20)

['harry',
 'case include',
 'dire',
 'make',
 'attorney southern',
 'construct',
 'americans',
 'aml',
 'epidemic',
 'pump',
 'early week',
 'reach million',
 'westminster magistrates',
 'food drug',
 'pakistani',
 'weekly',
 'testing available',
 'dub',
 'hamas',
 'vague']

In [10]:
tfidf_train = ngram_vectorizer.transform(x_train)
tfidf_validation = ngram_vectorizer.transform(x_val)

In [11]:
doc_array = tfidf_train.toarray()
doc_array

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.1394311 , 0.07193199,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [12]:
frequency_matrix = pd.DataFrame(doc_array, 
                                columns = features)
frequency_matrix.head(10)

Unnamed: 0,abandon,abdul,abdullah,abide,ability,able,able use,abolish,abroad,absence,...,zanupf,zealand,zero,zetas,zimbabwe,zimbabwe anticorruption,zimbabwe anticorruption commission,zimbabwean,zimbabwes,zone
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.074467,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.129174,0.0,0.0,0.139431,0.071932,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016777,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.014276,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Both the train and the validation datasets are transformed. Now we need to fit a  basic logistic regression model to see how far it can get with f1 score and accuracy.

In [13]:
lr = LogisticRegression(solver='sag')
lr.fit(tfidf_train, y_train)

LogisticRegression(solver='sag')

In [14]:
train_preds_lr = lr.predict(tfidf_train)
val_preds_lr = lr.predict(tfidf_validation)

train_f1_score_lr = f1_score(y_train, train_preds_lr)
val_f1_score_lr = f1_score(y_val, val_preds_lr)

train_accuracy_lr = accuracy_score(y_train, train_preds_lr)
val_accuracy_lr = accuracy_score(y_val, val_preds_lr)

In [15]:
print('Prediction accuracy for logistic regression model on train data:', round(train_accuracy_lr*100, 3))
print('Prediction accuracy for logistic regression model on validation data:', round(val_accuracy_lr*100, 3))

print()

print('F1 score for logistic regression model on train data:', round(train_f1_score_lr*100, 3))
print('F1 score for logistic regression model on validation data:', round(val_f1_score_lr*100, 3))

Prediction accuracy for logistic regression model on train data: 96.799
Prediction accuracy for logistic regression model on validation data: 87.671

F1 score for logistic regression model on train data: 97.219
F1 score for logistic regression model on validation data: 89.888


The results seem good, but they can get better, in test data we will most probably see some overfitting. **Yet, for now, I am skipping the regularization part, since I would like to see the results on public test data before doing any serious regularization & tuning.**

Let's save the untuned LR model for later modifications.

In [16]:
filename = 'log_regression_model.sav'
joblib.dump(lr, filename)

['log_regression_model.sav']

In [20]:
#loaded_model = joblib.load(filename)
#val_preds_lr = loaded_model.predict(tfidf_validation)
#result = f1_score(y_val, val_preds_lr)
#print(result)