In [40]:
from embedders.classification.contextual import TransformerSentenceEmbedder
import pandas as pd
import numpy as np
import json

# Load Model and Raw Data
For the Model we use the embedders library which make embedding generation really easy to use. (alternatively you can for example use "sentence-transformers")

We are also using the **kern export format** here, which is a simple json that can be read from pandas directly. 

If you're using a csv from an Excel export, just modify this code here!

In [41]:
embedder = TransformerSentenceEmbedder("distilbert-base-cased")

Some weights of the model checkpoint at C:\Users\Moe/.cache\torch\sentence_transformers\distilbert-base-cased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [42]:
path = "labeled_data_v1.json"
with open(path, "r") as f:
    data = json.load(f)
    
df = pd.DataFrame(data)

In [43]:
df.head()

Unnamed: 0,newsletter,date,headline,body,__Interesting__MANUAL,__Interesting__WEAK_SUPERVISION,__Interesting__WEAK_SUPERVISION__confidence,__Topic__MANUAL,__Topic__WEAK_SUPERVISION,__Topic__WEAK_SUPERVISION__confidence
0,Box of Amazing,2022-05-08 06:01:12+00:00,Refind – Get smarter every day,Every day we pick 7 links from around the web ...,,,,,,
1,Box of Amazing,2022-05-08 06:01:12+00:00,"Fast, Cheap, and Out of Control: Inside Shein’...","Last fall, in the stagnation of pandemic life,...",,,,,,
2,Box of Amazing,2022-05-08 06:01:12+00:00,"Pity the Billionaire, Marc Andreessen Edition",Silicon Valley’s oligarch class can’t stop fee...,,,,,,
3,Box of Amazing,2022-05-08 06:01:12+00:00,Inside Elon Musk’s Big Plans for Twitter,Here’s what Mr. Musk is projecting for Twitter...,no,,,social media,,
4,Box of Amazing,2022-05-08 06:01:12+00:00,A visit to the human factory,"Will Jackson, CEO of robotics company Engineer...",,,,,,


In [44]:
# to get all the context information we have to merge headline and body
df["merged_texts"] = df["headline"] + ". "+ df["body"]

# Embedd the texts

In [45]:
embeddings = np.array(embedder.transform(df["merged_texts"].values.tolist()))

Initializing model, might take some time...


Encoding batches ...: 100%|██████████| 5/5 [00:25<00:00,  5.16s/it]


In [46]:
# you have the option to save the embeddings so you don't need to re-calculate them
np.save("embeddings", embeddings)

# Recommendations using Vector Calculations

In [47]:
from scipy.spatial.distance import cdist

In [48]:
# average the interesting vector
interesting_idxs = df[df["__Interesting__MANUAL"] == "yes"].index
interesting_vector_avg = embeddings[interesting_idxs].mean(axis=0)

In [49]:
# calculate the distances to the unlabeled data
non_labeled_idxs = df[df["__Interesting__MANUAL"].isnull()].index
dist_to_unlabeled = cdist(interesting_vector_avg.reshape(1,-1), embeddings[non_labeled_idxs], metric="cosine")[0]

In [50]:
# sort the indices ascending
sorted_unlabeled_idxs = dist_to_unlabeled.argsort()

# translate them back to the original dataframe
sorted_original_idxs = non_labeled_idxs[sorted_unlabeled_idxs]

In [51]:
top_10_recommendations = df.loc[sorted_original_idxs[0:10]]
top_10_recommendations.head()

Unnamed: 0,newsletter,date,headline,body,__Interesting__MANUAL,__Interesting__WEAK_SUPERVISION,__Interesting__WEAK_SUPERVISION__confidence,__Topic__MANUAL,__Topic__WEAK_SUPERVISION,__Topic__WEAK_SUPERVISION__confidence,merged_texts
98,datascienceweekly,2022-04-21 22:34:45+00:00,Real World Recommendation System - Part 1,Training a collaborative filtering based recom...,,,,,,,Real World Recommendation System - Part 1. Tra...
422,TLDR,2022-05-04 10:19:09+00:00,Meta has built a massive new language AI—and i...,Meta's AI lab has created a new language model...,,,,,,,Meta has built a massive new language AI—and i...
244,datascienceweekly,2022-03-03 23:50:48+00:00,Weaviate Podcast #9: Karen Beckers about the r...,"Karen Beckers, Data Scientist from Squadra Mac...",,,,,,,Weaviate Podcast #9: Karen Beckers about the r...
203,datascienceweekly,2022-03-17 22:52:02+00:00,Building systems to securely reason over priva...,People today rely on AI systems such as assist...,,,,,,,Building systems to securely reason over priva...
82,datascienceweekly,2022-04-28 23:21:02+00:00,Data Science at Stitch Fix,"Podcast Interview with Olivia Liao, Senior Dir...",,,,,,,Data Science at Stitch Fix. Podcast Interview ...


# Predict Topics covered by the article
There are some problems when it comes to classifying topics in this setting. The most dominant one is that we don't know if the topics we selected are even all the topics that exist, for that we'd have to label every datapoint and also make sure that in the future there are no topics coming up that weren't in the training. Second, this data is rather unbalanced. We will not deal with these problems here and continue with our baseline usecase.

Instead we will choose the topics that we want to have classified and which have enough support. We then introduce a "catch-all" class, where we map all other labels to.

We will split the data into train and test set, train the model, and then evaluate it very quickly. We will not go into too much detail of the whole pipeline (it takes companies months to make sense of their data and models!) as this is not the aim of this workshop.

## Prepare Train and Test split

In [52]:
from sklearn.model_selection import train_test_split

In [53]:
# get all the labeled instances
labeled_idxs = df[~df["__Topic__MANUAL"].isnull()].index.tolist()

In [54]:
# look what labels have enough support
df.loc[labeled_idxs]["__Topic__MANUAL"].value_counts()

big tech                 27
research  and science    20
library/code             14
social media              9
jobs                      9
data science              9
ai art                    6
programming               6
mobile                    5
advice                    4
society                   2
event                     1
Name: __Topic__MANUAL, dtype: int64

In [55]:
# choose the labels that have enough support
topics = ["big tech", "research  and science", "library/code", "social media"]

In [56]:
labeled_df = df.loc[labeled_idxs]
labeled_df = labeled_df[labeled_df["__Topic__MANUAL"].isin(topics)]

In [57]:
train_idx, test_idx = train_test_split(labeled_df.index.tolist(), test_size = 0.2)

In [58]:
X_train = embeddings[train_idx]
X_test = embeddings[test_idx]
y_train = df.loc[train_idx]["__Topic__MANUAL"]
y_test = df.loc[test_idx]["__Topic__MANUAL"]

## Train and evaluate a classifier
We can simulate the classification layer of a typical BERT pipeline with a LogisticRegression sklearn model.

In [59]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [60]:
clf = LogisticRegression().fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [61]:
print(classification_report(y_test, clf.predict(X_test)))

                       precision    recall  f1-score   support

             big tech       0.80      1.00      0.89         4
         library/code       1.00      1.00      1.00         3
research  and science       1.00      0.80      0.89         5
         social media       1.00      1.00      1.00         2

             accuracy                           0.93        14
            macro avg       0.95      0.95      0.94        14
         weighted avg       0.94      0.93      0.93        14



## Predict the topics

In [62]:
clf = LogisticRegression().fit(embeddings[labeled_df.index], labeled_df["__Topic__MANUAL"])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [75]:
# get the unlabeled data
unlabeled_df = df.drop(labeled_df.index)["__Topic__MANUAL"]
unlabeled_idxs = unlabeled_df.index

X = embeddings[unlabeled_idxs]

In [76]:
probs = clf.predict_proba(X)

In [90]:
pred_idx, pred_class = np.where(probs > 0.75)
pred_class_text = list(map(lambda x: clf.classes_[x],pred_class))

In [106]:
df["topic"] = "Unknown"
df.loc[labeled_df.index,"topic"] = df.loc[labeled_df.index]["__Topic__MANUAL"]
df.loc[pred_idx,"topic"] = pred_class_text

In [107]:
df["topic"].value_counts()

Unknown                  314
research  and science     91
big tech                  83
library/code              57
social media               8
Name: topic, dtype: int64

In [111]:
# save the data with predicted topics to disk
df[['newsletter', 'date', 'headline', 'body', '__Interesting__MANUAL', 'merged_texts', 'topic']].to_csv("output.csv", index=False, quoting=1)