<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Project-Objective" data-toc-modified-id="Project-Objective-1">Project Objective</a></span></li><li><span><a href="#Project-setup" data-toc-modified-id="Project-setup-2">Project setup</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#python-pickle" data-toc-modified-id="python-pickle-2.0.1">python pickle</a></span></li><li><span><a href="#Number-of-articles-in-each-category" data-toc-modified-id="Number-of-articles-in-each-category-2.0.2">Number of articles in each category</a></span></li><li><span><a href="#News-length-by-category" data-toc-modified-id="News-length-by-category-2.0.3">News length by category</a></span></li></ul></li></ul></li><li><span><a href="#Pre-processing-&amp;-Feature-Engineering" data-toc-modified-id="Pre-processing-&amp;-Feature-Engineering-3">Pre-processing &amp; Feature Engineering</a></span><ul class="toc-item"><li><span><a href="#Coding-the-labels" data-toc-modified-id="Coding-the-labels-3.1">Coding the labels</a></span></li><li><span><a href="#Create-the-train-and-test-sets" data-toc-modified-id="Create-the-train-and-test-sets-3.2">Create the train and test sets</a></span></li><li><span><a href="#Create-features-using-TF-IDF" data-toc-modified-id="Create-features-using-TF-IDF-3.3">Create features using TF-IDF</a></span></li><li><span><a href="#Creating-the-features" data-toc-modified-id="Creating-the-features-3.4">Creating the features</a></span></li><li><span><a href="#Looking-at-the-result" data-toc-modified-id="Looking-at-the-result-3.5">Looking at the result</a></span></li><li><span><a href="#Base-Models" data-toc-modified-id="Base-Models-3.6">Base Models</a></span><ul class="toc-item"><li><span><a href="#Random-Forest" data-toc-modified-id="Random-Forest-3.6.1">Random Forest</a></span></li></ul></li></ul></li></ul></div>

https://towardsdatascience.com/text-classification-in-python-dd95d264c802

# Project Objective

Classify (catagorize) news articles based on their content.  There are five categories - business, entertainment, politics, sport and technology.

- A labeled dataset has been provided. Approximately 2200 articles from the BBC.
    - Each individual article was provided in its own file.
    - The data has already been processed into one large file - News_dataset.csv
    - Every article became a row in the corpus.
- This is a supervised learning problem

# Project setup

- Connect to Google Drive
- Inside of your NLP_data folder, create a new folder named Pickles
- News_dataset.csv will be read from github.  Output files will be placed in the the Pickles folder.

### python pickle

Pickle in Python is primarily used in serializing and deserializing a Python object structure. In other words, it's the process of converting a Python object into a byte stream to store it in a file/database, maintain program state across sessions, or transport data over the network.

https://towardsdatascience.com/stop-using-csvs-for-storage-pickle-is-an-80-times-faster-alternative-832041bbc199

In [None]:
#from google.colab import drive
#drive.mount('/content/drive')

In [61]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
import seaborn as sns
sns.set_style("whitegrid")
#import altair as alt
#alt.renderers.enable("notebook")

# Code for hiding seaborn warnings
import warnings
warnings.filterwarnings("ignore")


import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2

from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import GradientBoostingClassifier

from pprint import pprint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import ShuffleSplit

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')


[nltk_data] Downloading package punkt to /Users/jimcody/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/jimcody/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jimcody/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Loading the dataset:

In [None]:
url = 'https://raw.githubusercontent.com/jimcody2014/nlp_cdc/main/data/News_dataset.csv'
df = pd.read_csv(url, sep=';',usecols = [1,2])

In [None]:
df.head()

### Number of articles in each category

In [None]:
sns.countplot(x="Category", data=df)

### News length by category

Definition of news length field. Although there are special characters in the text (``\r, \n``), it will be useful as an approximation.

In [None]:
df['News_length'] = df['Content'].str.len()

In [None]:
plt.figure(figsize=(12.8,6))
sns.distplot(df['News_length']).set_title('News length distribution');

In [None]:
df['News_length'].describe()

Let's remove from the 95% percentile onwards to better appreciate the histogram:

In [None]:
quantile_95 = df['News_length'].quantile(0.95)
df_95 = df[df['News_length'] < quantile_95]

In [None]:
plt.figure(figsize=(12.8,6))
sns.distplot(df_95['News_length']).set_title('News length distribution');

We can get the number of news articles with more than 10,000 characters:

In [None]:
df_more10k = df[df['News_length'] > 10000]
len(df_more10k)

Let's see one:

In [None]:
#df_more10k['Content'].iloc[0]

It's just a large news article.

Let's now plot a boxplot:

In [None]:
plt.figure(figsize=(12.8,6))
sns.boxplot(data=df, x='Category', y='News_length', width=.5);

Now, let's remove the larger documents for better comprehension:

In [None]:
plt.figure(figsize=(12.8,6))
sns.boxplot(data=df_95, x='Category', y='News_length');

We can see that, although the length distribution is different for every category, the difference is not too big. If we had way too different lengths between categories we would have a problem since the feature creation process may take into account counts of words. However, when creating the features with TF-IDF scoring, we will normalize the features just to avoid this.

At this point, we cannot do further Exploratory Data Analysis. We'll turn onto the **Feature Engineering** section.

We'll save the dataset:

In [None]:
#with open('News_dataset.pickle', 'wb') as output:
#   pickle.dump(df, output)

# Pre-processing & Feature Engineering

In [None]:
#path_df = "/home/lnc/0. Latest News Classifier/02. Exploratory Data Analysis/News_dataset.pickle"

#path_df ='/Users/jimcody/Documents/2021Python/nlp/data/News_dataset.pickle'
#with open(path_df, 'rb') as data:
#    df = pickle.load(data)

In [None]:
df.head()

In [None]:
# \r and \n  \r - carraige return.  \n - new line
df['Content_Parsed_1'] = df['Content'].str.replace("\r", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("\n", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("    ", " ")

df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace('"', '')  # remove double-quote
df['Content_Parsed_2'] = df['Content_Parsed_1'].str.lower()           # make lowercase

punctuation_signs = list("?:!.,;")
df['Content_Parsed_3'] = df['Content_Parsed_2']                       # remove punctuation
for punct_sign in punctuation_signs:
    df['Content_Parsed_3'] = df['Content_Parsed_3'].str.replace(punct_sign, '')
    
df['Content_Parsed_4'] = df['Content_Parsed_3'].str.replace("'s", "") # # Possessive pronouns

In [None]:
# Lemmatize
wordnet_lemmatizer = WordNetLemmatizer()

# Iterate over every word in order to lemmatize
nrows = len(df)
lemmatized_text_list = []

for row in range(0, nrows):
    
    # Create an empty list containing lemmatized words
    lemmatized_list = []
    
    # Save the text and its words into an object
    text = df.loc[row]['Content_Parsed_4']
    text_words = text.split(" ")

    # Iterate through every word to lemmatize
    for word in text_words:
        lemmatized_list.append(wordnet_lemmatizer.lemmatize(word, pos="v"))
        
    # Join the list
    lemmatized_text = " ".join(lemmatized_list)
    
    # Append to the list containing the texts
    lemmatized_text_list.append(lemmatized_text)
    
df['Content_Parsed_5'] = lemmatized_text_list

In [None]:
# Stop words
stop_words = list(stopwords.words('english'))

df['Content_Parsed_6'] = df['Content_Parsed_5']     # Put 5 into 6

for stop_word in stop_words:                        # Replace 6 with blank if it is a stopword

    regex_stopword = r"\b" + stop_word + r"\b"
    df['Content_Parsed_6'] = df['Content_Parsed_6'].str.replace(regex_stopword, '')

# This might result in some double and triple spacing between words.  
# That will be corrected when the content is tokenized.

stop_words

In [None]:
df.head(1)

In [None]:
list_columns = ['Category', 'Content', 'Content_Parsed_6']
df = df[list_columns]

df = df.rename(columns={'Content_Parsed_6': 'Content_Parsed'})

In [None]:
df.head()

## Coding the labels

In [None]:
category_codes = {
    'business': 0,
    'entertainment': 1,
    'politics': 2,
    'sport': 3,
    'tech': 4
}

# Category mapping
df['Category_Code'] = df['Category']
df = df.replace({'Category_Code':category_codes})
df.head()

## Create the train and test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df['Content_Parsed'], 
                                                    df['Category_Code'], 
                                                    test_size=0.15, 
                                                    random_state=8)

## Create features using TF-IDF

We have various options:

* Count Vectors as features
* TF-IDF Vectors as features
* Word Embeddings as features
* Text / NLP based features
* Topic Models as features

*Count vectors and TF-IDF are considered 'bag of words' methods that do not take the order of words in a sentence into consideration.*

An almost understandable word embedding article - https://jalammar.github.io/illustrated-word2vec/

Another... https://medium.com/geekculture/word-embeddings-in-ai-10a9e430cb59

We have to define the different parameters:

* `ngram_range`: We want to consider both unigrams and bigrams.
* `max_df`: When building the vocabulary ignore terms that have a document
    frequency strictly higher than the given threshold
* `min_df`: When building the vocabulary ignore terms that have a document
    frequency strictly lower than the given threshold.
* `max_features`: If not None, build a vocabulary that only consider the top
    max_features ordered by term frequency across the corpus.

In [None]:
# Parameter election
ngram_range = (1,2)
min_df = 10
max_df = 1.
max_features = 300

## Creating the features

In [None]:
tfidf = TfidfVectorizer(encoding='utf-8',
                        ngram_range=ngram_range,
                        stop_words=None,
                        lowercase=False,
                        max_df=max_df,
                        min_df=min_df,
                        max_features=max_features,
                        norm='l2',
                        sublinear_tf=True)
                        
features_train = tfidf.fit_transform(X_train).toarray()
labels_train = y_train
print(features_train.shape)

features_test = tfidf.transform(X_test).toarray()
labels_test = y_test
print(features_test.shape)

## Looking at the result

In [None]:
from sklearn.feature_selection import chi2
import numpy as np

for Product, category_id in sorted(category_codes.items()):
    features_chi2 = chi2(features_train, labels_train == category_id)
    indices = np.argsort(features_chi2[0])
    feature_names = np.array(tfidf.get_feature_names())[indices]
    unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
    bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
    print("# '{}' category:".format(Product))
    print("  . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-5:])))
    print("  . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-2:])))
    print("")


In [None]:
bigrams

## Base Models

### Random Forest

In [54]:
base_rfc = RandomForestClassifier(random_state = 8)
base_rfc.fit(features_train, labels_train)
print(accuracy_score(labels_train, base_rfc.predict(features_train)))
print(accuracy_score(labels_test, base_rfc.predict(features_test)))

1.0
0.9281437125748503


In [56]:
base_knn = KNeighborsClassifier()
base_knn.fit(features_train, labels_train)
print(accuracy_score(labels_train, base_knn.predict(features_train)))
print(accuracy_score(labels_test, base_knn.predict(features_test)))

0.9603384452670545
0.9341317365269461


In [50]:
base_lr = LogisticRegression(random_state = 8)
base_lr.fit(features_train, labels_train)
print(accuracy_score(labels_train, base_lr.predict(features_train)))
print(accuracy_score(labels_test, base_lr.predict(features_test)))

0.9809624537281861
0.9401197604790419


In [57]:
base_svm = svm.SVC(random_state = 8)
base_svm.fit(features_train, labels_train)
print(accuracy_score(labels_train, base_svm.predict(features_train)))
print(accuracy_score(labels_test, base_svm.predict(features_test)))

0.9989423585404548
0.9550898203592815


In [60]:
base_mnbc = MultinomialNB()
base_mnbc.fit(features_train, labels_train)
print(accuracy_score(labels_train, base_mnbc.predict(features_train)))
print(accuracy_score(labels_test, base_mnbc.predict(features_test)))

0.9539925965097832
0.9341317365269461


In [62]:
base_gb = GradientBoostingClassifier(random_state = 8)
base_gb.fit(features_train, labels_train)
print(accuracy_score(labels_train, base_gb.predict(features_train)))
print(accuracy_score(labels_test, base_gb.predict(features_test)))

1.0
0.9311377245508982
