This notebook explores zero-shot text classification using the Hugging Face Transformers pipeline on the `all-the-news-25k.csv` dataset.

## Setup

In [21]:
import pandas as pd

df = pd.read_csv('../baseline/data/all-the-news-25K.csv', low_memory=False, parse_dates=['date'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   date         25000 non-null  datetime64[ns]
 1   year         25000 non-null  int64         
 2   month        25000 non-null  float64       
 3   day          25000 non-null  int64         
 4   title        25000 non-null  object        
 5   article      25000 non-null  object        
 6   section      25000 non-null  object        
 7   publication  25000 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(2), object(4)
memory usage: 1.5+ MB


In [22]:
df.sample(5)

Unnamed: 0,date,year,month,day,title,article,section,publication
9888,2019-06-12 00:00:00,2019,6.0,12,Agent Paul to Celtics: Davis won't sign,Agent Rich Paul warned the Boston Celtics: Tra...,sports,Reuters
6739,2016-06-28 15:43:00,2016,6.0,28,Can Lolo Jones Run It Back to the Rio Olympics?,Lolo Jones in slow motion is a rare sight. But...,sports,Vice
6990,2019-06-22 00:00:00,2019,6.0,22,Velodyne Lidar hires bankers for an IPO: Busin...,(Reuters) - Autonomous vehicle technology comp...,business,Reuters
20862,2018-06-09 10:00:00,2018,6.0,9,5 Things We Learned About Mister Rogers from N...,Long before the cacophony of today’s TV with i...,movies,People
4097,2018-04-04 00:00:00,2018,4.0,4,White House says hopes China changes trade pra...,WASHINGTON (Reuters) - The White House said on...,business,Reuters


Truncate each article to 500 characters (~100 words) as well as 1000 characters (~200 words):

In [23]:
df['article_trunc500'] = df['article'].apply(lambda x: x[:500] if len(x) > 500 else x)
df['article_trunc1000'] = df['article'].apply(lambda x: x[:1000] if len(x) > 1000 else x)

Pick a random sample of 1000 articles to evaluate using zero-shot classification. Convert to a list:

In [25]:
df_sample = df.sample(n=1000, random_state=42)

articles_trunc500 = df_sample['article_trunc500'].values.tolist()
articles_trunc1000 = df_sample['article_trunc1000'].values.tolist()

## Zero-Shot Classification

Use Hugging Face Transformers [pipeline](https://huggingface.co/docs/transformers/v4.24.0/en/main_classes/pipelines):

In [28]:
%%time

from transformers import pipeline
classifier = pipeline("zero-shot-classification")
sections = ["technology", "healthcare", "movies", "business", "sports"]

results_trunc500 = classifier(articles_trunc500, candidate_labels=sections)
results_trunc1000 = classifier(articles_trunc1000, candidate_labels=sections)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


CPU times: user 7h 56min 18s, sys: 13min 59s, total: 8h 10min 18s
Wall time: 1h 1min 48s


Convert classifier results to predictions:

In [44]:
# Go through each list
pred_trun500, pred_trun1000 = [], []
for d500, d1000 in zip(results_trunc500, results_trunc1000):
    # Get label of maximum probability
    pred_trun500.append(d500['labels'][0])
    pred_trun1000.append(d1000['labels'][0])

In [56]:
df_sample['pred_trun500'] = pred_trun500
df_sample['pred_trun1000'] = pred_trun1000

df_sample.sample(3)

Unnamed: 0,date,year,month,day,title,article,section,publication,article_trunc500,article_trunc1000,pred_trun500,pred_trun1000,section_num
21589,2018-10-16 01:30:00,2018,10.0,16,Chris Evans Slams Piers Morgan for Dad-Shaming...,There will be no-dad shaming on Chris Evans’ w...,movies,People,There will be no-dad shaming on Chris Evans’ w...,There will be no-dad shaming on Chris Evans’ w...,movies,business,2
11448,2017-10-01 00:00:00,2017,10.0,1,Three Miami Dolphins players defy Trump order ...,LONDON (Reuters) - Three Miami Dolphins player...,sports,Reuters,LONDON (Reuters) - Three Miami Dolphins player...,LONDON (Reuters) - Three Miami Dolphins player...,sports,sports,4
10422,2018-11-19 15:35:00,2018,11.0,19,Late Night Twitter Sessions Are Bad for NBA Sh...,"Sometimes that thing you think is bad for you,...",sports,Vice,"Sometimes that thing you think is bad for you,...","Sometimes that thing you think is bad for you,...",sports,sports,4


## Results

Convert `section` labels to integers:

In [57]:
mapping_dict = {"technology": 0, "healthcare": 1, "movies": 2, "business": 3, "sports": 4}

df_sample['section_num'] = df_sample['section'].map(mapping_dict)
df_sample['pred_trun500'] = df_sample['pred_trun500'].map(mapping_dict)
df_sample['pred_trun1000'] = df_sample['pred_trun1000'].map(mapping_dict)

df_sample.sample(3)

Unnamed: 0,date,year,month,day,title,article,section,publication,article_trunc500,article_trunc1000,pred_trun500,pred_trun1000,section_num
3169,2016-07-12 20:07:00,2016,7.0,12,Watch Lightning Strike Inside the Carolina Pan...,"On Monday evening, a rain storm hit Charlotte ...",sports,Vice,"On Monday evening, a rain storm hit Charlotte ...","On Monday evening, a rain storm hit Charlotte ...",4,4,4
18746,2017-03-06 00:00:00,2017,3.0,6,The Moar is the electric bike Michael Bay woul...,The Moar is a new bike launching through an In...,technology,The Verge,The Moar is a new bike launching through an In...,The Moar is a new bike launching through an In...,4,0,0
20198,2019-04-04 15:44:00,2019,4.0,4,Jeff Bezos's Ex-Wife MacKenzie Lands $36 Billi...,Jeff Bezos and his ex-wife MacKenzie have fina...,movies,People,Jeff Bezos and his ex-wife MacKenzie have fina...,Jeff Bezos and his ex-wife MacKenzie have fina...,3,3,2


Run evaluation metrics to determine performance

In [63]:
from sklearn.metrics import classification_report

print('\narticles truncated to 500 chars:\n', classification_report(df_sample['section_num'].values, df_sample['pred_trun500'].values, target_names=sections))
print('\narticles truncated to 1000 chars:\n', classification_report(df_sample['section_num'].values, df_sample['pred_trun1000'].values, target_names=sections))


articles truncated to 500 chars:
               precision    recall  f1-score   support

  technology       0.64      0.75      0.69       190
  healthcare       0.81      0.33      0.46       193
      movies       0.91      0.48      0.63       199
    business       0.44      0.87      0.59       197
      sports       0.91      0.85      0.88       221

    accuracy                           0.66      1000
   macro avg       0.74      0.66      0.65      1000
weighted avg       0.75      0.66      0.66      1000


articles truncated to 1000 chars:
               precision    recall  f1-score   support

  technology       0.67      0.74      0.70       190
  healthcare       0.78      0.37      0.50       193
      movies       0.92      0.43      0.59       199
    business       0.44      0.88      0.58       197
      sports       0.92      0.86      0.89       221

    accuracy                           0.66      1000
   macro avg       0.75      0.66      0.65      1000
weight

Interesting to note:
- Not much of a difference in the weighted avg f1-score between articles truncated at 500 and 1000 characters
- An f1-score of 0.66 is much worse than Curtis's baseline result of 0.95 using logistic regression!
- `sports` scores significantly higher than other categories