## Second Model - Naive Bayes classifier : Supervised 


**What is Naive Bayes?**

In this portion of the project, we are building a predictive model.  Naive Bayes is a supervised machine learning model based off Bayes theorem. Bayes theorem is using the knowledge of other events to calculate the probability of a future event happening. Many people used Naive Bayes classifier with text data because it calculates the features individually. 

**How does the Naive Bayes classifier work?**

The algorithm uses Bayes theorem to calculate the posterior probability. A posterior probability, in Bayesian statistics, is the revised or updated probability of an event occurring after taking into consideration new information. [4] We have certain amount of feature that we need to calculate the probability for.  The algorithm applies Bayes theorem to each feature and the chain rule is applied to calculate the individual probability. Then, the individual probability is combine to calculate the posterior probability.The main objective function in naive Bayes classifier is to maximize the posterior probability given the training data for each class. [1]

#### 1. Import necessary packages 

In [2]:
import pandas as pd 
import matplotlib.pyplot as plt
from string import punctuation
import string
import regex
import io
import nltk
from nltk.corpus import stopwords, wordnet
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize,sent_tokenize
from wordcloud import WordCloud
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression,SGDClassifier, LinearRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from keras.layers import Dense, Conv1D, MaxPool1D, Flatten, Dropout
from keras.models import Sequential
from tensorflow.keras.utils import to_categorical

 ### 2. Import Data 
 
The dataframe used for the next to models was built in our Juptyer notebook with the code from the first half of our project. The dataset has been sorted in articles that contain licensing agreement language and news articles that do not. A label column has been created and number 0 for nonlicensing and 1 for licensing. We have taken a sample of 5000 articles instead of training on all 40,000. The remaining articles can be used for further model testing. To elimenate the imbalance of categorical data. We selected only 3000 of the non-licensing articles and all of the licensing articles which totaled to a little over 2500. 

In [3]:
from google.colab import files
 
 
uploaded = files.upload()


Saving df_equal_labels.csv to df_equal_labels.csv


In [4]:
df = pd.read_csv(io.BytesIO(uploaded['df_equal_labels.csv']))

### 3. Clean data for models

In this section, we are cleaning the data specifically for the machine learning models. We are following similar logic and procedure as the first half of the project. Making text lowercase, removing numbers, spliting contractions, and removing punctuation


In [5]:
def clean_text(text):
        
    '''Make text lowercase, ,remove punctuation, split contractions, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    text = str(text)
    text = text.lower()
    text = text.replace('\n','')
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    return text


#### 3.0.1 Remove stop words

In [6]:
nltk.download('stopwords')
stop = set(nltk.corpus.stopwords.words('english'))
stop.update(punctuation)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### 3.0.2 Add positional tagging 

This section is somewhat similar to tagging the nouns during topic modeling. The positional tagging is done with wordnet. WordNet is a lexical database of semantic relations between words in more than 200 languages. It was a hand coded by linguist. 

In [7]:
def get_simple_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

### 3.0.3 Word Lemmatizing 

This is the process of getting the text down to its base or lemma. It save on processing time. The NLTK has a package that we will use to assist us. Sometimes, the same word can have multiple different ‘lemma’s. So, based on the context it’s used, you should identify the ‘part-of-speech’ (POS) tag for the word in that specific context and extract the appropriate lemma.[3]

In [8]:
lemmatizer = WordNetLemmatizer()
def clean_text_2(text):
    clean_text = []
    for w in word_tokenize(text):
        if w.lower() not in stop:
            pos = pos_tag([w])
            new_w = lemmatizer.lemmatize(w, pos=get_simple_pos(pos[0][1]))
            clean_text.append(new_w)
    return clean_text

def join_text(text):
    return " ".join(text)

### 3.0.4 Apply user define preprocessing functions to text

This section will apply all the functions defined above to the dataframes text. 

In [9]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')
df['Content'] = df['Content'].apply(clean_text_2)
df['Content'] = df['Content'].apply(join_text)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


### 4. Train and Test Split 

The data has been cleaned and preprocessed. Now, it needs to split so we can train the model and then test it. The standard split for train and test data is 80 (train) and 20 (test). When training the model, it can over or under fit the data. So, we need to test it with fresh data to get a better and more true accuracy score. 

In [12]:
# splitting data.
x_train,x_test,y_train,y_test = train_test_split(df.Content,df.label,test_size = 0.2 , random_state = 0)

In [13]:
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((4419,), (4419,), (1105,), (1105,))

#### 4.0.1 Identifying Labels

In [14]:
non_licensing = x_train[y_train[y_train==0].index]
licensing = x_train[y_train[y_train==1].index]


### 5. Vectorize the Text Data

Now, we need to formulate the data in a way our model can understand it. This is a process that we also did during the topic modelling and was happening behind the scene during some of the keyword extractions methods used as well. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.

In [15]:
# creating a variable for count vectorizer which gives us features using the whole text of data.
count_vec = CountVectorizer(max_features=4000, ngram_range=(1,2), max_df=0.9, min_df=0)
# max_df insures to remove most frequent words as we discussed earlier.
# ngram_range is used to select words at a time like 1 or 2 like if a sentence have 'not happy' in text then it can mean two things if we pick the word 'happy' and pick the words 'not happy' both.


In [16]:
x_train_features = count_vec.fit_transform(x_train).todense()
x_test_features = count_vec.transform(x_test).todense()
x_train_features.shape, x_test_features.shape

((4419, 4000), (1105, 4000))

### 6. Create and fit the model

Now, the model is ready to be created and fitted. Once we train the model, we will assess its accuracy with the test data. 

In [17]:
nb_clf = MultinomialNB()
nb_clf.fit(x_train_features, y_train)
y_pred = nb_clf.predict(x_test_features)
print(accuracy_score(y_test,y_pred)*100)



69.32126696832579




### 7. Examine the Results 

To get a deeper understanding of our models performance, we will print out the classification report. A classification report is similar to a confusion matrix. It tells us the precision and recall based on the true and false negatives and positives. If something is 'true', it was classified correctly. If it is 'false' , it has been labeled incorrectly. Precision is the number of true positives divided by the total number of positives(true and false) classified. Recall is the number of true positives divided by the total number of elements that actually belong to the positive class (i.e. the sum of true positives and false negatives, which are items which were not labelled as belonging to the positive class but should have been). This is done for each category. From our classification report, we can identify that the model had a more difficult time labelling the 1 or licensing articles correctly. Overall, our model did OK. This is not something we could provide to a company to execute a sorting task or any high confidence predictive measure. However, we can fine tune the parameters or adjust the preprocessing technique to try to improve upon our model. 


#### 7.0.1 Classification report

In [18]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      0.45      0.62       606
           1       0.60      0.99      0.74       499

    accuracy                           0.69      1105
   macro avg       0.79      0.72      0.68      1105
weighted avg       0.81      0.69      0.67      1105



Sources:
1. [Naive Bayes Classifier (NB)](https://medium.com/@akshayc123/naive-bayes-classifier-nb-7429a1bdb2c0#:~:text=Multinomial%20NB%20is%20used%20for%20multinomial%20distribution%20that,It%20is%20used%20when%20data%20has%20Gaussian%20distribution.)
2. [Code Source](https://www.kaggle.com/code/amananandrai/nlp-using-ml-algorithms-news-articles/notebook)
3. [Word Lemmatization](https://www.machinelearningplus.com/nlp/lemmatization-examples-python/#:~:text=Lemmatization%20is%20the%20process%20of%20converting%20a%20word,outputs%20from%20these%20packages.%20Skip%20to%20content%20Blogs)
4. [Posterior Probability](https://www.investopedia.com/terms/p/posterior-probability.asp)

## Third Model - Gradient Boost : Supervised

**What is Gradient Boosting?**

Our third and last model of the project will be using Gradient Boosting Classifier. This is a very popular machine learning algorithm and can be used for both regression and classification. This classifier is a boosting type of ensemble learning method. Boosting works by taking poor performing predictors and turning them into strong performing predictors. 

**How does Gradient Boosting work?**

In order to make initial predictions on the data, the algorithm will get the log of the odds of the target features. Once it has done this, it has built a mini decision tree. Next, it will calculate a residuals for each target feature. Unlike the regression Gradient Boost, all the predictions cannot simply be added together without some sort of transformation because the predictions are in terms of the log of the odds. The transformation formula is the numerator is the sum of all the residuals and the denominator is sum of the previous predicted probability multiplied by 1 minus the previous predicted probability. This transformer is applied to every leaf on the tree.[2] Now, it’s ready to update the prediction by the initial leaf with a new tree. The new tree is scaled by a learning rate. So, we take the the log(odds) prediction or the residual and add that to the learning rate multiplied by the updated prediction. This process is continued and repeated for more new trees until a certain predefined threshold is reached, or the residuals are negligible.[1]




In [19]:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import GradientBoostingClassifier
from pprint import pprint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import ShuffleSplit
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2
import numpy as np

### 2. Define Parameters
We have to define the different parameters:

* ngram_range: This is were ngrams is decided. 
We want to consider both unigrams and bigrams.

* max_df: When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold

* min_df: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold.

* max_features: If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.[3]

In [20]:
# Parameter election
ngram_range = (1,2)
min_df = 10
max_df = 1.
max_features = 300


### 3. Vectorize the Features

This is the same process done for our previous model but a different formula. Tf stands for term frequency and idf, inverse document frequency. This method multiple tf by idf. Some data scienctist prefer TF-IDF because it not only focuses on the frequency of words present in the corpus but also provides the importance of the words.




Please note that we have fitted and then transformed the training set, but we have only transformed the test set.

In [21]:
tfidf = TfidfVectorizer(encoding='utf-8',
                        ngram_range=ngram_range,
                        stop_words=None,
                        lowercase=False,
                        max_df=max_df,
                        min_df=min_df,
                        max_features=max_features,
                        norm='l2',
                        sublinear_tf=True)
                        
features_train = tfidf.fit_transform(x_train).toarray()
labels_train = y_train
print(features_train.shape)

features_test = tfidf.transform(x_test).toarray()
labels_test = y_test
print(features_test.shape)

(4419, 300)
(1105, 300)


### 3. Hyperparameter tuning

**What is hyperparameter tuning?**

Hyperparameter are the varaibles that govern the training processes.  parameters are the variables that your chosen machine learning technique uses to adjust to your data. These are tuned so we can fine the best fit for our model and data to produce the highest accuracy.

Let's check the current model parameters. Then, we will use cross vaildation to tune the hyperparameters. 

In [22]:
gb_0 = GradientBoostingClassifier(random_state = 8)

print('Parameters currently in use:\n')
## pretty print
pprint(gb_0.get_params())

Parameters currently in use:

{'ccp_alpha': 0.0,
 'criterion': 'friedman_mse',
 'init': None,
 'learning_rate': 0.1,
 'loss': 'deviance',
 'max_depth': 3,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_iter_no_change': None,
 'random_state': 8,
 'subsample': 1.0,
 'tol': 0.0001,
 'validation_fraction': 0.1,
 'verbose': 0,
 'warm_start': False}


**Tree-related hyperparameters**:

* n_estimators = number of trees in the forest.
* max_features = max number of features considered for splitting a node
* max_depth = max number of levels in each decision tree
* min_samples_split = min number of data points placed in a node before the node is split
* min_samples_leaf = min number of data points allowed in a leaf node


**Boosting-related hyperparameters**:

* learning_rate= learning rate shrinks the contribution of each tree by learning_rate.
* subsample= the fraction of samples to be used for fitting the individual base learners.


#### 3.0.1 Random Search Cross Validation

When performing cross validation, either random search or grid search are used to try out different hyperparameters in hopes to fine the best fitting for our model and data. We will use random search. 

**What is random search?**

A randomized search on hyper parameters. In contrast to GridSearchCV, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions [3] Because grid search is an exhaustive approach it is very time consuming, but random search is a very effective and time-efficient. The logic behind random search is that by randomly picking different hyper-parameter combination, we will mostly likely pick one similar to the exhaustive solution of grid search in less time. 

**What is Cross Validation? (CV)**

Cross validation is very commonly used when building a machine learning model. It can help when there is not a robust amount of data to train and test on. It, also, helps to test hyper-parameter with overfitting. Cross validation takes a samples of the data to train and validate with. These are called folds and the user can designate how many folds are performed when cross validating. All folds are used to train except one and the one is used for validation. 




#### 3.0.2 Define Search Parameters

In [24]:
# n_estimators
n_estimators = [200, 800]

# max_features
max_features = ['auto', 'sqrt']

# max_depth
max_depth = [10, 40]
max_depth.append(None)

# min_samples_split
min_samples_split = [10, 30, 50]

# min_samples_leaf
min_samples_leaf = [1, 2, 4]

# learning rate
learning_rate = [.1, .5]

# subsample
subsample = [.5, 1.]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'learning_rate': learning_rate,
               'subsample': subsample}

pprint(random_grid)

{'learning_rate': [0.1, 0.5],
 'max_depth': [10, 40, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [10, 30, 50],
 'n_estimators': [200, 800],
 'subsample': [0.5, 1.0]}


### 3.0.3 Create Base Gradient Booting Model

In [25]:
# First create the base model to tune
gbc = GradientBoostingClassifier(random_state=8)

# Definition of the random search
random_search = RandomizedSearchCV(estimator=gbc,
                                   param_distributions=random_grid,
                                   n_iter=10,
                                   scoring='accuracy',
                                   cv=3, 
                                   verbose=1, 
                                   random_state=8)

# Fit the random search model
random_search.fit(features_train, labels_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


RandomizedSearchCV(cv=3, estimator=GradientBoostingClassifier(random_state=8),
                   param_distributions={'learning_rate': [0.1, 0.5],
                                        'max_depth': [10, 40, None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [10, 30, 50],
                                        'n_estimators': [200, 800],
                                        'subsample': [0.5, 1.0]},
                   random_state=8, scoring='accuracy', verbose=1)

#### 3.0.4 Print Out Hyperparameters and Accuracy Score

In [26]:
print("The best hyperparameters from Random Search are:")
print(random_search.best_params_)
print("")
print("The mean accuracy of a model with these hyperparameters is:")
print(random_search.best_score_)

The best hyperparameters from Random Search are:
{'subsample': 0.5, 'n_estimators': 800, 'min_samples_split': 30, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': 10, 'learning_rate': 0.1}

The mean accuracy of a model with these hyperparameters is:
0.949988685222901


#### 3.0.5 Grid Search CV 

**What is Grid Search?**


Similar to random search, Grid search is an exhaustive search on all the hyperparmeters. This will help us build a high preforming model. Grid search is great for spot-checking combinations that are known to perform well generally. Random search is great for discovery and getting hyperparameter combinations that you would not have guessed intuitively, although it often requires more time to execute. We will use our discovery found in random search and perfect it in grid search.


In [27]:
# Create the parameter grid based on the results of random search 
max_depth = [5, 10, 15]
max_features = ['sqrt']
min_samples_leaf = [2]
min_samples_split = [50, 100]
n_estimators = [800]
learning_rate = [.1, .5]
subsample = [1.]

param_grid = {
    'max_depth': max_depth,
    'max_features': max_features,
    'min_samples_leaf': min_samples_leaf,
    'min_samples_split': min_samples_split,
    'n_estimators': n_estimators,
    'learning_rate': learning_rate,
    'subsample': subsample

}

# Create a base model
gbc = GradientBoostingClassifier(random_state=8)

# Manually create the splits in CV in order to be able to fix a random_state (GridSearchCV doesn't have that argument)
cv_sets = ShuffleSplit(n_splits = 3, test_size = .33, random_state = 8)

# Instantiate the grid search model
grid_search = GridSearchCV(estimator=gbc, 
                           param_grid=param_grid,
                           scoring='accuracy',
                           cv=cv_sets,
                           verbose=1)

# Fit the grid search to the data
grid_search.fit(features_train, labels_train)

Fitting 3 folds for each of 12 candidates, totalling 36 fits


GridSearchCV(cv=ShuffleSplit(n_splits=3, random_state=8, test_size=0.33, train_size=None),
             estimator=GradientBoostingClassifier(random_state=8),
             param_grid={'learning_rate': [0.1, 0.5], 'max_depth': [5, 10, 15],
                         'max_features': ['sqrt'], 'min_samples_leaf': [2],
                         'min_samples_split': [50, 100], 'n_estimators': [800],
                         'subsample': [1.0]},
             scoring='accuracy', verbose=1)

#### 3.0.7 Print Out Best Hyperparameters from Grid Search

In [28]:
print("The best hyperparameters from Grid Search are:")
print(grid_search.best_params_)
print("")
print("The mean accuracy of a model with these hyperparameters is:")
print(grid_search.best_score_)

The best hyperparameters from Grid Search are:
{'learning_rate': 0.1, 'max_depth': 15, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 100, 'n_estimators': 800, 'subsample': 1.0}

The mean accuracy of a model with these hyperparameters is:
0.9529358007767877


In [29]:
best_gbc = grid_search.best_estimator_

### 4. Create and Fit Gradient Boost Model

In this section, we are ready to create and train our model. We will use the best parameters from from grid search. Grid search preformed 1 percent better then random search. 

In [30]:
best_gbc.fit(features_train, labels_train)

GradientBoostingClassifier(max_depth=15, max_features='sqrt',
                           min_samples_leaf=2, min_samples_split=100,
                           n_estimators=800, random_state=8)

### 5. Evaluate the Model 

We will now see how well the model will perform with are unseen test data. 

In [31]:
gbc_pred = best_gbc.predict(features_test)

In [32]:
# Training accuracy
print("The training accuracy is: ")
print(accuracy_score(labels_train, best_gbc.predict(features_train)))

The training accuracy is: 
1.0


In [33]:
# Test accuracy
print("The test accuracy is: ")
print(accuracy_score(labels_test, gbc_pred))


The test accuracy is: 
0.9638009049773756


### 6. Examine Results 

Similar to our previous model, we will print out a classification report to identify strengths and weaknesses. For the overall accuracy on the test data, our model got a **96 percent**. This is very good and has sucessfully achieved our task. This is something we could use if continuing this project and offer to a business as a sucessful predictive measure. 







In [34]:
# Classification report
print("Classification report")
print(classification_report(labels_test,gbc_pred))

Classification report
              precision    recall  f1-score   support

           0       0.96      0.98      0.97       606
           1       0.97      0.95      0.96       499

    accuracy                           0.96      1105
   macro avg       0.96      0.96      0.96      1105
weighted avg       0.96      0.96      0.96      1105



#### 6.0.1 Classification Report
Now, lets break down our classification report: 

**Precision**: Out of all the articles that the model predicted were licensing articles, only 97% actually are.

**Recall**: Out of all the articles that actually were licensing, the model predicted this outcome correctly for 95% of those articles

**F1 Score**: This value is calculated as:

F1 Score: 2 * (Precision * Recall) / (Precision + Recall)
F1 Score: 2 * (.97 * .95) / (.97 + .95)
F1 Score: 0.96


Since this value is very close to 1, it tells us that the model does a good job of predicting whether or not the article is a licensing agreement article or not.

#### Sources: 

1. [Gradient Boosting Classification ](https://towardsdatascience.com/gradient-boosting-classification-explained-through-python-60cc980eeb3d#:~:text=Boosting%20is%20a%20special%20type,attention%20to%20its%20predecessor's%20mistakes.)
2. [Gradient Boost Classification Video](https://www.youtube.com/watch?v=jxuNLH5dXCs&t=346s)
3. [Code source Github](https://github.com/miguelfzafra/Latest-News-Classifier/blob/master/0.%20Latest%20News%20Classifier/03.%20Feature%20Engineering/03.%20Feature%20Engineering.ipynb)
4. [Random Search CV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)

