# Classification -- All The News 2.1 Articles

- Text Vectorization (from scikit-learn):
  - TF-IDF Vectorizer
- Classifiers (from scikit-learn):
  - Random Forest
  - Logistic Regression
  - K-Nearest Neighbors
  - Simple Decision Tree
  - Gaussian Naive Bayes

## Import Libraries and Set Settings

In [1]:
!pip install wordcloud
!pip install plotly
!pip install nltk

Collecting wordcloud
  Downloading wordcloud-1.8.1-cp36-cp36m-manylinux1_x86_64.whl (366 kB)
[K     |████████████████████████████████| 366 kB 35.1 MB/s eta 0:00:01
Installing collected packages: wordcloud
Successfully installed wordcloud-1.8.1
Collecting plotly
  Downloading plotly-4.14.3-py2.py3-none-any.whl (13.2 MB)
[K     |████████████████████████████████| 13.2 MB 17.8 MB/s eta 0:00:01
Collecting retrying>=1.3.3
  Downloading retrying-1.3.3.tar.gz (10 kB)
Building wheels for collected packages: retrying
  Building wheel for retrying (setup.py) ... [?25ldone
[?25h  Created wheel for retrying: filename=retrying-1.3.3-py3-none-any.whl size=11429 sha256=31d99edfbdf6b64356cdb7c7c4c533bac5aeb80282c67000859b4fd38537d7ab
  Stored in directory: /home/azureuser/.cache/pip/wheels/ac/cb/8a/b27bf6323e2f4c462dcbf77d70b7c5e7868a7fbe12871770cf
Successfully built retrying
Installing collected packages: retrying, plotly
Successfully installed plotly-4.14.3 retrying-1.3.3
Collecting nltk
  Downlo

In [2]:
import os                              # Python default package
import numpy as np
import pandas as pd
import ipywidgets as widgets

# from sqlalchemy import create_engine   # conda install -c anaconda sqlalchemy
from wordcloud import WordCloud          # conda install -c conda-forge wordcloud
from ipywidgets import interact, fixed

# For visualizations
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# Azure ML Specific
from azureml.core import Workspace, Dataset

# SK-Learn
from sklearn.model_selection import train_test_split

# Text Manipulation
import spacy

In [3]:
sns.set_theme(style="whitegrid")
pd.options.display.max_rows = 3000

## Import Dataset

In this notebook, we will be testing on `AllTheNews21_Training`

In [None]:
# Specific Azure ML for importing Datasets
subscription_id = '546d9c91-7fcf-4547-836c-10b640e06628'
resource_group = 'NSSCapstoneProject'
workspace_name = 'AllTheNews21'

# Create a Workspace
workspace = Workspace(subscription_id, resource_group, workspace_name)

# Download the dataset locally
dataset = Dataset.get_by_name(workspace, name='AllTheNews21_Training')
#dataset.download(target_path='./data/', overwrite=True)

# Read from the local file
#news = pd.read_csv("data/AllTheNews21_Training.csv")
news = dataset.to_pandas_dataframe()


Using alternate reader. Inconsistent or mixed schemas detected across partitions: partition had different number of columns. The first partition has 4 columns. Found partition has 135 columns.
First partition columns (ordered): ['article', 'category', 'article_length', 'word_count']
Found Partition has columns (ordered): ['article', 'category', 'article_length', 'word_count', 'Column6', 'Column7', 'Column8', 'Column9', 'Column10', 'Column11', 'Column12', 'Column13', 'Column14', 'Column15', 'Column16', 'Column17', 'Column18', 'Column19', 'Column20', 'Column21', 'Column22', 'Column23', 'Column24', 'Column25', 'Column26', 'Column27', 'Column28', 'Column29', 'Column30', 'Column31', 'Column32', 'Column33', 'Column34', 'Column35', 'Column36', 'Column37', 'Column38', 'Column39', 'Column40', 'Column41', 'Column42', 'Column43', 'Column44', 'Column45', 'Column46', 'Column47', 'Column48', 'Column49', 'Column50', 'Column51', 'Column52', 'Column53', 'Column54', 'Column55', 'Column56', 'Column57', 

In [None]:


display(news.shape)
display(news.head())

### Some Post-Import Cleanups

In [None]:
news = news.drop(columns=["Unnamed: 0"])

display(news.shape)
display(news.head())

## EDA & Feature Engineering

### How many unique classes in the category?

In [None]:
display(len(news["category"].unique()))
display(news["category"].unique())

### How many rows and how many columns?

In [None]:
news.shape

### What are the data types of each columns?

In [None]:
news.dtypes

### Do we have any missing values?

In [None]:
news.isnull().any()

### Is the target variable balanced or not?

In [None]:
sns.countplot(x = news["category"]);

### How is the distribution of `article_lengths`?

In [None]:
sns.histplot(x = news["article_length"]).set_title("News Content Lengths Distribution");

In [None]:
max(news["article_length"])

In [None]:
min(news["article_length"])

## Cleaning And Pre-Processing

## Classification Groundwork

### Split data to Training And Testing

In [None]:
# Features and Target
X = news["article"]
y = news["category"]

In [None]:
# Perform train-test-split: 20% Test-Size
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=777,
    stratify=y # Make sure to have the target column evenly distributed
)