# Tech/Business News Article Classification using ALBERT-V2: A Deep Learning Approach

### Team : Harshwardhan Patil, Avinash Pawar, Aoi Minamoto

#### Brief Description:
For this question we've generated two text categories with a 100 different texts. The categories are Tech Articles and Business Articles.
For genration of text we have used News API and we are fetching english language news for Technology and Business catagories.

In [1]:
# Supresses NonCritical Warnings of Tensorflow
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

In [2]:
# This code imports the TensorFlow library and then enables memory growth for GPU devices, if any are available.
# Important for some runtime errors during model execution
import tensorflow as tf

# Enable memory growth
gpus = tf.config.list_physical_devices('GPU')
if gpus:
    try:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        # Memory growth must be set before GPUs have been initialized
        print(e)

1 Physical GPUs, 1 Logical GPUs


### Import required libraries

In [3]:
import requests
import pandas as pd
import numpy as np
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
from tensorflow.keras.optimizers import Adam
import evaluate

### Data Fetching through API

In [4]:
# set the API endpoint URL
url = 'https://newsapi.org/v2/top-headlines'

In [5]:
# set the request parameters
params = {
    'category': 'technology',
    'language': 'en',
    'pageSize': 100,
    'apiKey': 'adefd8b37ba649039de179bdb9a70985'
}

# send the request and get the response
response = requests.get(url, params=params)

# get the 'articles' list from the response JSON data
articles = response.json()['articles']

# extract the titles from the articles aslo splitting the source from titles
Tech_titles = [article['title'].split(' - ')[0].split(' | ')[0] for article in articles]

# print the titles
for title in Tech_titles:
    print(title)
    
#check length
len(Tech_titles)

Funko Fusion teaser trailer, screenshots
Grounded unearths three new achievements with Super Duper Update
AMD's new Ryzen 7000 X3D CPUs have burnt out for some, and a BIOS update could prevent it
Use WhatsApp on multiple phones, new feature to help businesses
'Star Trek: Resurgence' launches May 23rd on most platforms
iOS 17 Rumored to Add New Lock Screen, Apple Music, and App Library Features
Haier QLED TV launched in India: Price, features and more
Samsung's Galaxy Watch 5 has dropped to a new all-time low
AMD’s Ryzen Z1 chips could power a new wave of handheld Steam Deck clones
First Ride: The New GT Sensor Loses Weight, Gains Travel
Google makes a Pixel Watch stand, sort of...
Nokia G11 Plus starts receiving Android 13 update
No, Croma is not selling iPhone 13 for Rs 38990 in India: Here are the details
AMD announces Ryzen Z1 series chipsets for handheld gaming consoles
Strayed Lights – Launch Trailer – Nintendo Switch
Official Nintendo Switch ﻿SD Card Line Expands With 1TB Zelda C

100

In [6]:
# set the request parameters
params = {
    'category': 'business',
    'language': 'en',
    'pageSize': 100,
    'apiKey': 'adefd8b37ba649039de179bdb9a70985'
}

# send the request and get the response
response = requests.get(url, params=params)

# get the 'articles' list from the response JSON data
articles = response.json()['articles']

# extract the titles from the articles aslo splitting the source from titles
Business_titles = [article['title'].split(' - ')[0].split(' | ')[0] for article in articles]

# print the titles
for title in Business_titles:
    print(title)
    
#check length
len(Business_titles)

We spent $100,000 on an abandoned high school, and $3.3 million to convert it into apartments—take a look inside
3M to cut 6,000 jobs in second round of layoffs this year
First Republic Stock Plummets to a New Low, Drags Down Other Regionals
Dow Jones Losses Deepen In Afternoon Trading As Google, Microsoft Earnings Loom
FDA grants accelerated approval for Biogen ALS drug that treats rare form of the disease
Terra co-founder Daniel Shin charged with fraud in South Korea
GM, Hyundai announce EV battery plants for the US
Kim Foxx won't seek reelection as Cook County State's Attorney, Chicago DA
U.S. regulators warn they already have the power to go after A.I. bias — and they're ready to use it
‘Vampire’ straw found hidden in traveler’s backpack at Boston airport, cops say
Nate Silver Out at ABC News as Disney Layoffs Once Again Hit News Division
UPS shares fall after delivery giant reports disappointing earnings
Gap Plans to Lay Off Hundreds of Corporate Workers in Latest Cuts
Halliburton

100

### Loading Data

In [7]:
data_list = []
for title in Tech_titles:
    data_list.append({'sentence': str(title), 'label': 0}) 

for title in Business_titles:
    data_list.append({'sentence': str(title), 'label': 1}) 


In [8]:
data = pd.DataFrame(data_list)
print(data)

                                              sentence  label
0             Funko Fusion teaser trailer, screenshots      0
1    Grounded unearths three new achievements with ...      0
2    AMD's new Ryzen 7000 X3D CPUs have burnt out f...      0
3    Use WhatsApp on multiple phones, new feature t...      0
4    'Star Trek: Resurgence' launches May 23rd on m...      0
..                                                 ...    ...
195     Ford Jump Starts Its Attempt to Revive Detroit      1
196  Rail Vikas Nigam zooms 20% after 12% equity ch...      1
197  Indian banks unlikely to go SVB, Credit Suisse...      1
198  ITC hits all-time high, pips Infosys to become...      1
199  ICICI Securities downgrades Yes Bank to ‘reduc...      1

[200 rows x 2 columns]


### shuffling the data

In [9]:
data = shuffle(data, random_state=987654321)
data.head()

Unnamed: 0,sentence,label
155,Wall Street analysts' top calls on Tuesday,1
175,Updated: Biogen chops certain stroke and RNA t...,1
124,The 29 best Mother’s Day gifts for moms in 2023,1
11,Nokia G11 Plus starts receiving Android 13 update,0
23,You can now use one WhatsApp account on two or...,0


### Splitting into Test and Train data

In [10]:
X_train = data.drop('label', axis=1)
y_train = data['label']

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=40, random_state=987654321)
X_train.head()

Unnamed: 0,sentence
128,Spotify CEO Ek: 'We'd like to raise prices in ...
66,AI Can Be a Tool for More Efficient Game Devel...
153,Tesla Drops Model Y Starting Price Below the A...
139,Jeff Shell investigated for CNBC's Hadley Gamb...
71,Slack’s Canvas feature puts docs inside your c...


## Pretrained Model
Since we are using Hugging Face Transformers, we had many choices for pretrained models to use. After testing three different models - GPT-2, Albert-base-v2, and albert-large-v2 - we decided to use albert-base-v2. GPT-2 was too big for our system to handle, and we received an "OOM" (Out Of Memory) error, indicating that our GPU did not have enough memory to allocate the required tensor.

### Model description : albert-base-v2 (A Lite BERT) 
For more information, see https://huggingface.co/albert-base-v2 

ALBERT is a transformers model that was pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained only on the raw texts, without any human labeling. ALBERT used an automatic process to generate inputs and labels from the texts. The original model has the following configuration: 12 repeating layers, 128 embedding dimensions, 768 hidden dimensions, 12 attention heads, and 11 million parameters.

In [12]:
# Clearing Session
tf.keras.backend.clear_session()
tf.random.set_seed(987654321)
np.random.seed(987654321)

In [13]:
# initializing a tokenizer and a pre-trained model for sequence classification using the ALBERT-base-v2 architecture
tokenizer = AutoTokenizer.from_pretrained("albert-base-v2")

model = TFAutoModelForSequenceClassification.from_pretrained("albert-base-v2")

All model checkpoint layers were used when initializing TFAlbertForSequenceClassification.

Some layers of TFAlbertForSequenceClassification were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Preprocesiing
#### We used the AutoTokenizer from Hugging Face Transformers

In [14]:
#tokenizing the input sentences using the tokenizer object
X_train = dict(tokenizer([str(i) for i in X_train['sentence']], return_tensors='np', padding=True))
X_test = dict(tokenizer([str(i) for i in X_test['sentence']], return_tensors='np', padding=True))

### Compiling the model

In [15]:
# This will compile and train the pre-trained model for sequence classification using the Adam optimizer 
# with a learning rate of 1e-5.
model.compile(optimizer=Adam(1e-5))
model.fit(X_train, y_train,epochs=4, batch_size=80)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7fac94140b90>

###  Predictions using the Trained model

In [16]:
# Now, we will use trained model to make predictions on the test set X_test 
preds = model.predict(X_test)["logits"]



In [17]:
y_pred = np.argmax(preds, axis=1)

### Calculating the test Accuracy

In [18]:
metric = evaluate.load('accuracy')
metric.compute(predictions=y_pred, references=np.array(y_test))

{'accuracy': 0.8}

### Observation 
We can see here that when we compile the model with 4 epochs, the training loss is reduced at each step.
When we calculate the accuracy, we get 80% accuracy. 
If we increase the epochs, we can potentially obtain better accuracy, but there is also a risk of overfitting the model.
Hence, we will test the model with epoch = 10 to further evaluate its performance.

### Re running with better parameters

In [19]:
# Clearing Session
tf.keras.backend.clear_session()
tf.random.set_seed(987654321)
np.random.seed(987654321)

In [20]:
model.compile(optimizer=Adam(1e-5))
model.fit(X_train, y_train,epochs=10, batch_size=80)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fac504f1b90>

In [21]:
preds = model.predict(X_test)["logits"]
y_pred = np.argmax(preds, axis=1)
metric = evaluate.load('accuracy')
metric.compute(predictions=y_pred, references=np.array(y_test))



{'accuracy': 0.9}

### Conclusion:
After training the model with 10 epochs, we observed a decrease in the training loss at each step. The accuracy obtained was 90%, which is higher than the accuracy obtained with 4 epochs. However, we could have achieved even higher accuracy if we had used a larger model.