# News Classification with Feed Forward Neural Networks (Multi-Layer Perceptrons)

© Data Trainers LLC. GPL v 3.0.

**Author:** Axel Sirota


In this notebook we will use the cnn headline dataset that has 130000 news titles and their category between: Sports, Business, Sci-Tec and World.

Take it slow and notice we will leverage the embeddings we learned before, and notice the last layer will need a softmax and as many cells as categories we have.

You can run this lab both locally or in Colab.

- To run in Colab just go to `https://colab.research.google.com`, sign-in and you upload this notebook. Colab has GPU access for free.
- To run locally just run `jupyter notebook` and access the notebook in this lab. You would need to first install the requirements in `requirements.txt`

Follow the instructions. Good luck!

In [1]:
!nvidia-smi

Tue Jan 28 14:48:44 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   53C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [2]:
!pip install --upgrade  textblob gensim pytorch-nlp swifter

Collecting textblob
  Downloading textblob-0.19.0-py3-none-any.whl.metadata (4.4 kB)
Collecting pytorch-nlp
  Downloading pytorch_nlp-0.5.0-py3-none-any.whl.metadata (9.0 kB)
Collecting swifter
  Downloading swifter-1.4.0.tar.gz (1.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m21.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dask-expr<1.2,>=1.1 (from dask[dataframe]>=2.10.0->swifter)
  Downloading dask_expr-1.1.21-py3-none-any.whl.metadata (2.6 kB)
INFO: pip is looking at multiple versions of dask-expr to determine which version is compatible with other requirements. This could take a while.
  Downloading dask_expr-1.1.20-py3-none-any.whl.metadata (2.6 kB)
  Downloading dask_expr-1.1.19-py3-none-any.whl.metadata (2.6 kB)
  Downloading dask_expr-1.1.18-py3-none-any.whl.metadata (2.6 kB)
  Downloading dask_expr-1.1.16-py3-none-any.whl.metadata (2.5 kB)
Downloading textblob-0.19.0-py3-none-

In [3]:
import multiprocessing
import sys

import numpy as np
import random
import os
import pandas as pd
import gensim
import warnings
import nltk
import pickle
import torch
import torch.nn as nn
import torch.optim as optim
import re
import warnings
from sklearn.model_selection import train_test_split
from textblob import TextBlob

embedding_dim = 100
epochs=10
batch_size = 250
corpus_size=5000

def set_seeds_and_trace():
  os.environ['PYTHONHASHSEED'] = '0'
  np.random.seed(42)
  random.seed(42)


set_seeds_and_trace()
warnings.filterwarnings('ignore')
nltk.download('punkt')
textblob_tokenizer = lambda x: TextBlob(x).words

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [4]:
%%writefile get_data.sh
if [ ! -f news.csv ]; then
  wget -O news.csv https://www.dropbox.com/s/352x7xzivf60zgc/news.csv?dl=0
fi

Writing get_data.sh


In [5]:
!bash get_data.sh

--2025-01-28 14:50:16--  https://www.dropbox.com/s/352x7xzivf60zgc/news.csv?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.2.18, 2620:100:6017:18::a27d:212
Connecting to www.dropbox.com (www.dropbox.com)|162.125.2.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.dropbox.com/scl/fi/7p0lipgmk5b1rf6wwmb9m/news.csv?rlkey=i3u8u5n432kdxob7txco82dht&dl=0 [following]
--2025-01-28 14:50:16--  https://www.dropbox.com/scl/fi/7p0lipgmk5b1rf6wwmb9m/news.csv?rlkey=i3u8u5n432kdxob7txco82dht&dl=0
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc7090c9aad3f575dabf8e761704.dl.dropboxusercontent.com/cd/0/inline/CjAZW7ncfrMpo7O7rdhmivBRJ5vkHSzZLiHCsvc1EyDxq1akVVXSod7wNNsa_DeAb-hxfc2DSShZKjUC_Ospgn7zo4DhnQvsx5nUB83FTgQ1BnDbOW26HFzvpcWNzn1lVOQ8mn3RhTtdziNdpp4BxlVm/file# [following]
--2025-01-28 14:50:16--  https://uc7090c9aad3f575dabf8e761704.dl.dropboxusercontent.com/cd/0/inlin

In [15]:
path = './news.csv'
news_pre = pd.read_csv(path, header=0).sample(n=corpus_size).reset_index(drop=True)

In [16]:
news_pre

Unnamed: 0,category,title
0,Sports,"It #39;s been swell, Pedro"
1,World,US soldiers flock to laser eye clinic
2,Business,"MSU med school plan to move has flaws, study says"
3,Business,Viacom in China Tie-Up with Beijing TV (Reuters)
4,Sports,Coulthard has one race to prove his worth
...,...,...
4995,World,"NGOs working to topple regime, says Mugabe"
4996,Sci/Tech,IBM's PC unit lost money from 2001 onwards
4997,World,Into the abyss
4998,Sports,Lucchino thinks it was a bad move


In [21]:
def preprocess_text(text, should_join=True):
    # Use the tokenizer to tokenize into words, lowercase them, remove punctuation, and finally use gensim.utils.simple_preprocess(text)
    text = ' '.join(gensim.utils.tokenize(text, lowercase=True))
    text = re.sub(r'[.,!?]',r" ", text)
    if should_join:
      return ' '.join(gensim.utils.simple_preprocess(text))
    else:
      return gensim.utils.simple_preprocess(text)

In [22]:
preprocess_text('This is the best night of my life! Is it ? well , maybe')

'this is the best night of my life is it well maybe'

In [23]:
import swifter
# Use swifter to apply the preprocessin and save that pandas series to a file
news = news_pre.title.swifter.apply(preprocess_text)

Pandas Apply:   0%|          | 0/5000 [00:00<?, ?it/s]

In [24]:
news

Unnamed: 0,title
0,it been swell pedro
1,us soldiers flock to laser eye clinic
2,msu med school plan to move has flaws study says
3,viacom in china tie up with beijing tv reuters
4,coulthard has one race to prove his worth
...,...
4995,ngos working to topple regime says mugabe
4996,ibm pc unit lost money from onwards
4997,into the abyss
4998,lucchino thinks it was bad move


In [25]:
news.to_csv('news_processed.csv', index=False)

In [27]:
!head -n 5 news_processed.csv

title
it been swell pedro
us soldiers flock to laser eye clinic
msu med school plan to move has flaws study says
viacom in china tie up with beijing tv reuters


In [30]:

class MyCorpus:
    """An iterator that yields sentences (lists of str)."""

    def __iter__(self):
        corpus_path = 'news_processed.csv'
        for line in open(corpus_path):
            # assume there's one document per line, tokens separated by whitespace
            yield preprocess_text(line, should_join=False)

from gensim.models import Word2Vec

word2vec_model = Word2Vec(sentences = MyCorpus(), vector_size=embedding_dim)
# Get a word2vec model using gensim.models and passing the sentences using MyCorpus()

In [31]:
wv_model = word2vec_model.wv

array([[-3.99890006e-01,  4.10344183e-01,  1.24052331e-01, ...,
        -7.35639393e-01,  2.50209779e-01, -8.90491344e-03],
       [-3.56799871e-01,  3.60813409e-01,  1.08187512e-01, ...,
        -6.35201395e-01,  2.06186280e-01, -8.20607040e-03],
       [-3.13642412e-01,  3.26474965e-01,  8.48005265e-02, ...,
        -5.71801066e-01,  2.03827068e-01, -1.74986739e-02],
       ...,
       [-1.21246725e-02,  1.65891610e-02,  7.62920175e-03, ...,
        -2.12490018e-02,  1.32388738e-03,  1.55974168e-03],
       [-1.56067060e-02, -3.15120706e-04,  1.04339141e-02, ...,
        -7.19414605e-03,  1.06007028e-02,  6.80401316e-03],
       [-1.22097144e-02,  4.31641936e-03,  6.22730702e-03, ...,
        -1.48746781e-02,  4.04775422e-03,  4.24864050e-03]], dtype=float32)

In [33]:
weights = torch.Tensor(wv_model.vectors)  # Get the weights of the model (the embedding) and convert to tensor. Hint: Check word2vec_model.wv
vocab_size = len(wv_model.index_to_key)  # get vocab size from index_to_key in word2vec_model.wv

In [34]:
weights.shape

torch.Size([1413, 100])

In [35]:
news_preprocessed = pd.DataFrame()
news_preprocessed['label'] = news_pre.category.map({'Business': 0, 'Sports': 1, 'Sci/Tech': 2, 'World': 3})
news_preprocessed['title'] = news
news_preprocessed

Unnamed: 0,label,title
0,1,it been swell pedro
1,3,us soldiers flock to laser eye clinic
2,0,msu med school plan to move has flaws study says
3,0,viacom in china tie up with beijing tv reuters
4,1,coulthard has one race to prove his worth
...,...,...
4995,3,ngos working to topple regime says mugabe
4996,2,ibm pc unit lost money from onwards
4997,3,into the abyss
4998,1,lucchino thinks it was bad move


In [37]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [38]:
def get_maximum_review_length(df):
    maximum = 0
    for ix, row in df.iterrows():
        candidate = len(textblob_tokenizer(row.title))
        if candidate > maximum:
            maximum = candidate
    return maximum


maximum = get_maximum_review_length(news_preprocessed)   # Since 2 titles may have different number of words, we have to find the max length and fill with 0s if a title is shorter

In [39]:
maximum

17

In [42]:
from hashlib import new
X = np.zeros((len(news_preprocessed), maximum))   # Here we do what we said above
# Iterate through the news df and for every word, if it exists in the word2vec model, put into X for that review and that word the index of the embedding (check index_to_key)
# HINT: to iterate through a column of a pandas dataframe you do:

for index, value in news_preprocessed.iterrows():
    word_ix = 0
    for word in textblob_tokenizer(value.title):
        token = vocab_size + 1
        if word in wv_model.index_to_key:
           token = wv_model.key_to_index[word]
        word_ix += 1

y = news_preprocessed.label

In [43]:
y

Unnamed: 0,label
0,1
1,3
2,0
3,0
4,1
...,...
4995,3
4996,2
4997,3
4998,1


In [44]:
X[:2]

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0.]])

In [45]:
import torch.nn.functional as F
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert all datasets to tensors

X_train = torch.Tensor(X_train).to(torch.long)
X_test = torch.Tensor(y_train).to(torch.long)

# Convert y_train and y_test from an array of values between 0-3 to a one hot matrix tensor
y_train = F.one_hot(torch.Tensor(y_train.to_numpy()).to(torch.long))
y_test = F.one_hot(torch.Tensor(y_test.to_numpy()).to(torch.long))

In [48]:
class MeanLayer(nn.Module):

  def forward(self, x):
    return torch.mean(x, dim=1)

In [50]:
# Create a sequential model like we have been doing, apply softmax.

model = nn.Sequential(
    nn.Embedding(vocab_size, embedding_dim),
    nn.Linear(embedding_dim, 100),
    nn.ReLU(),
    nn.Linear(100, 50),
    nn.ReLU(),
    MeanLayer(),
    nn.Linear(50, 10),
    nn.Softmax()
)

In [51]:
loss_function = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)

In [52]:
def train_cbow(X, y, model, loss_function, optimizer, epochs):
    for epoch in range(epochs):
        total_loss = 0
        optimizer.zero_grad()
        log_probs = model(X)


        loss = loss_function(log_probs, y.to(torch.float))
        # Do backword pass and update the gradients
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        if( epoch + 1) % 10 == 0:
            print(f'Epoch: {epoch}, Loss: {total_loss}')

    return model

    # implement

In [54]:
trained_model = train_cbow(X_train, y_train, model, loss_function, optimizer, epochs=epochs)


RuntimeError: 0D or 1D target tensor expected, multi-target not supported

## Exercise extra-credit: Make X and y a DataLoader, add batching, and validate the performance with the test set