In [1]:
%load_ext lab_black
%load_ext autotime
import pandas as pd
import numpy as np

time: 502 ms (started: 2023-05-28 21:30:38 -07:00)


Based on a [notebook](https://github.com/dkobak/iclr-tsne/blob/main/iclr-tsne.ipynb) by [Dmitry Kobak](https://github.com/dkobak) (I originally found it via [a tweet](https://twitter.com/hippopedoid/status/1575879260216373249)). This uses TF-IDF of ICLR submissions.

In [2]:
import requests


def download_iclr():
    titles = []
    abstracts = []
    years = []

    for year in [2018, 2019, 2020, 2021, 2022, 2023]:
        for query in [
            "Blind_Submission",
            "Withdrawn_Submission",
            "Desk_Rejected_Submission",
        ]:
            url = f"https://api.openreview.net/notes?invitation=ICLR.cc%2F{year}%2FConference%2F-%2F{query}"
            for offset in [0, 1000, 2000, 3000, 4000]:
                df = pd.DataFrame(
                    requests.get(url + f"&offset={offset}").json()["notes"]
                )
                if len(df) > 0:
                    titles += [d["title"].strip() for d in df["content"].values]
                    abstracts += [d["abstract"].strip() for d in df["content"].values]
                    years += [year] * len(df)

    return np.array(abstracts), np.array(titles), np.array(years)

time: 91.5 ms (started: 2023-05-28 21:30:38 -07:00)


In [3]:
abstracts, titles, years = download_iclr()

time: 59.4 s (started: 2023-05-28 21:30:41 -07:00)


In [4]:
len(titles), len(abstracts)

(16577, 16577)

time: 7.01 ms (started: 2023-05-28 21:31:41 -07:00)


In [5]:
mask = np.array([len(a) >= 200 for a in abstracts])
docs_to_keep = np.where(mask)

time: 142 ms (started: 2023-05-28 21:31:41 -07:00)


In [6]:
abstracts = abstracts[docs_to_keep]
titles = titles[docs_to_keep]
years = years[docs_to_keep]
len(titles)

16554

time: 117 ms (started: 2023-05-28 21:31:41 -07:00)


In [7]:
text = np.empty_like(titles, dtype=object)
for i in range(len(titles)):
    text[i] = titles[i] + " " + abstracts[i]
text[:3]

array(["Predicting Floor-Level for 911 Calls with Neural Networks and Smartphone Sensor Data In cities with tall buildings, emergency responders need an accurate floor level location to find 911 callers quickly. We introduce a system to estimate a victim's floor level via their mobile device's sensor data in a two-step process. First, we train a neural network to determine when a smartphone enters or exits a building via GPS signal changes. Second, we use a barometer equipped smartphone to measure the change in barometric pressure from the entrance of the building to the victim's indoor location. Unlike impractical previous approaches, our system is the first that does not require the use of beacons, prior knowledge of the building infrastructure, or knowledge of user behavior. We demonstrate real-world feasibility through 63 experiments across five different tall buildings throughout New York City where our system predicted the correct floor level with 100% accuracy.",
       "Some Co

time: 156 ms (started: 2023-05-28 21:31:41 -07:00)


In [11]:
import sklearn.feature_extraction.text

iclr_l2s = sklearn.feature_extraction.text.TfidfVectorizer(
    norm="l2", sublinear_tf=True
).fit_transform(text)

time: 2.02 s (started: 2023-05-28 21:37:47 -07:00)


We'll use 100 components here like Dmitry does in his notebook. My own permutation-based tests suggest maybe up to 150 components is also a possible choice, but not a lot more than that.

In [15]:
import sklearn.decomposition

tsvd = sklearn.decomposition.TruncatedSVD(n_components=100, algorithm="arpack").fit(
    iclr_l2s
)

time: 5.05 s (started: 2023-05-28 21:38:40 -07:00)


In [16]:
np.sum(tsvd.explained_variance_ratio_)

0.13176556719987437

time: 3.98 ms (started: 2023-05-28 21:38:45 -07:00)


13% of variance explained with 100 dimensions.

In [17]:
data = tsvd.transform(iclr_l2s)

time: 196 ms (started: 2023-05-28 21:38:56 -07:00)


In [18]:
keywords = [
    "network",
    "graph",
    "reinforcement",
    "language",
    "adversarial",
    "federated",
    "contrastive",
    "domain",
    "diffusion",
    "out-of-dis",
    "continual",
    "distillation",
    "architecture",
    "privacy",
    "protein",
    "fair",
    "attention",
    "video",
    "meta-learning",
    "generative adv",
    "autoencoder",
    "game",
    "semi-sup",
    "pruning",
    "physics",
    "3d",
    "translation",
    "optimization",
    "recurrent",
    "word",
    "bayesian",
]

time: 7.17 ms (started: 2023-05-28 21:39:15 -07:00)


In [19]:
# Most frequent words in the titles (at least 5 letters)

words, counts = np.unique(" ".join(titles).split(), return_counts=True)
ind = np.argsort(counts)[::-1][:50]
for i in ind:
    if len(words[i]) >= 5:
        print(f"{words[i]:20} {counts[i]:4}")

Learning             4545
Neural               2067
Networks             1668
Reinforcement         868
Adversarial           803
Graph                 797
Models                784
Training              635
Network               561
Representation        534
Model                 481
Optimization          465
Efficient             460
Generative            436
Language              407
Representations       399
Robust                370
Image                 348
Generalization        347
using                 343
Gradient              338
Towards               338
Unsupervised          334
learning              329
Federated             321
Detection             312
Generation            309
Classification        302
Robustness            288
Policy                274
Adaptive              270
Contrastive           269
time: 90.9 ms (started: 2023-05-28 21:39:16 -07:00)


Some processing of the text is necessary. To do this properly you may want to consider looking into the likes of [nltk](https://www.nltk.org/) for more robust handling of text (see for example this [article](https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089)), but I am just going to make everything lower-case and then remove some symbols. This seems good enough for this dataset.

In [20]:
def remove_symbols(text):
    symbols = '!"#$%&()*+-,./:;<=>?@[\]^_`{|}~\n'
    for i in symbols:
        text = np.char.replace(text, i, " ")
    return text.tolist()


processed_titles = [remove_symbols(title.lower()) for title in titles]

time: 2.15 s (started: 2023-05-28 21:39:19 -07:00)


In [21]:
titles[:5], processed_titles[:5]

(array(['Predicting Floor-Level for 911 Calls with Neural Networks and Smartphone Sensor Data',
        'Some Considerations on Learning to Explore via Meta-Reinforcement Learning',
        'MACH: Embarrassingly parallel $K$-class classification in $O(d\\log{K})$ memory and $O(K\\log{K} + d\\log{K})$ time, instead of $O(Kd)$',
        'Deterministic Policy Imitation Gradient Algorithm',
        'Searching for Activation Functions'], dtype='<U176'),
 ['predicting floor level for 911 calls with neural networks and smartphone sensor data',
  'some considerations on learning to explore via meta reinforcement learning',
  'mach  embarrassingly parallel  k  class classification in  o d log k    memory and  o k log k    d log k    time  instead of  o kd  ',
  'deterministic policy imitation gradient algorithm',
  'searching for activation functions'])

time: 4.51 ms (started: 2023-05-28 21:39:21 -07:00)


In [22]:
words, counts = np.unique(" ".join(processed_titles).split(), return_counts=True)
ind = np.argsort(counts)[::-1]
for i in ind:
    if len(words[i]) >= 5:
        if words[i] in keywords:
            print(f"{words[i]:20} {counts[i]:4}")

reinforcement         946
graph                 934
adversarial           902
network               728
optimization          538
language              525
domain                385
attention             334
federated             334
contrastive           305
architecture          210
continual             201
bayesian              175
distillation          174
recurrent             173
video                 172
translation           167
pruning               157
diffusion             152
privacy               107
physics                78
autoencoder            77
protein                66
time: 118 ms (started: 2023-05-28 21:39:34 -07:00)


As a rough way to see what titles we are missing due to bad processing, here the same procedure but allowing through any words that begin with the keywords, so we pickup any plurals or when it's part of a compound adjective:

In [23]:
words, counts = np.unique(" ".join(processed_titles).split(), return_counts=True)
ind = np.argsort(counts)[::-1]
for i in ind:
    if len(words[i]) >= 5:
        for k in keywords:
            if words[i].startswith(k):
                print(f"{words[i]:20} {counts[i]:4}")
                break

networks             2059
reinforcement         946
graph                 934
adversarial           902
network               728
optimization          538
language              525
domain                385
attention             334
federated             334
contrastive           305
architecture          210
continual             201
bayesian              175
distillation          174
recurrent             173
video                 172
translation           167
pruning               157
diffusion             152
graphs                144
autoencoders          138
privacy               107
games                  89
physics                78
autoencoder            77
fairness               76
protein                66
architectures          60
adversarially          46
domains                40
videos                 29
languages              16
words                   9
attentional             7
graphical               7
graphics                6
networked               5
translations

In [60]:
ordered_keywords = []
words, counts = np.unique(" ".join(processed_titles).split(), return_counts=True)
ind = np.argsort(counts)[::-1]
for i in ind:
    if len(words[i]) >= 5:
        if words[i] in keywords:
            ordered_keywords.append(words[i])
ordered_keywords

['reinforcement',
 'graph',
 'adversarial',
 'network',
 'optimization',
 'language',
 'domain',
 'attention',
 'federated',
 'contrastive',
 'architecture',
 'continual',
 'bayesian',
 'distillation',
 'recurrent',
 'video',
 'translation',
 'pruning',
 'diffusion',
 'privacy',
 'physics',
 'autoencoder',
 'protein']

time: 103 ms (started: 2023-05-25 22:11:04 -07:00)


In [51]:
ordered_keywords = []
words, counts = np.unique(" ".join(processed_titles).split(), return_counts=True)
ind = np.argsort(counts)[::-1]
for i in ind:
    if len(words[i]) >= 5:
        if words[i] in keywords:
            ordered_keywords.append(words[i])
ordered_keywords

['reinforcement',
 'graph',
 'adversarial',
 'network',
 'language',
 'generalization',
 'domain',
 'detection',
 'federated',
 'contrastive',
 'recurrent',
 'diffusion']

time: 111 ms (started: 2023-05-25 21:54:26 -07:00)


In [24]:
levels = np.repeat(len(keywords), len(processed_titles))
for level_int, keyword in enumerate(keywords):
    ind = [i for i, t in enumerate(titles) if keyword.lower() in t.lower()]
    levels[ind] = level_int
keywords.append("unknown")

time: 387 ms (started: 2023-05-28 21:40:14 -07:00)


In [25]:
levels

array([ 0,  2, 31, ...,  1, 16,  0])

time: 4.72 ms (started: 2023-05-28 21:40:16 -07:00)


In [26]:
from drnb.util import codes_to_categories

description = codes_to_categories(levels, keywords, "description")
description

0              network
1        reinforcement
2              unknown
3              unknown
4              unknown
             ...      
16549           domain
16550          unknown
16551            graph
16552        attention
16553          network
Name: description, Length: 16554, dtype: category
Categories (32, object): ['3d', 'adversarial', 'architecture', 'attention', ..., 'translation', 'unknown', 'video', 'word']

time: 133 ms (started: 2023-05-28 21:40:20 -07:00)


In [27]:
description.value_counts()

unknown           7581
network           1602
graph              823
reinforcement      823
adversarial        651
optimization       506
language           414
domain             338
attention          320
federated          254
contrastive        247
architecture       232
autoencoder        196
3d                 191
continual          179
recurrent          171
bayesian           170
video              169
translation        160
semi-sup           157
distillation       150
pruning            150
diffusion          134
meta-learning      130
fair               126
generative adv     123
game               122
out-of-dis         116
word                99
privacy             95
physics             74
protein             51
Name: description, dtype: int64

time: 8.11 ms (started: 2023-05-28 21:40:24 -07:00)


## Pipeline

In [28]:
target = pd.concat(
    [
        pd.Series(years, name="year", dtype="category"),
        pd.Series(levels, name="class"),
        description,
    ],
    axis=1,
)
target

Unnamed: 0,year,class,description
0,2018,0,network
1,2018,2,reinforcement
2,2018,31,unknown
3,2018,31,unknown
4,2018,31,unknown
...,...,...,...
16549,2023,7,domain
16550,2023,31,unknown
16551,2023,1,graph
16552,2023,16,attention


time: 23.3 ms (started: 2023-05-28 21:40:35 -07:00)


In [29]:
target = pd.concat(
    [
        pd.Series(years, name="year", dtype="category"),
        pd.Series(levels, name="class"),
        description,
    ],
    axis=1,
)
target

Unnamed: 0,year,class,description
0,2018,0,network
1,2018,2,reinforcement
2,2018,31,unknown
3,2018,31,unknown
4,2018,31,unknown
...,...,...,...
16549,2023,7,domain
16550,2023,31,unknown
16551,2023,1,graph
16552,2023,16,attention


time: 16.7 ms (started: 2023-05-28 21:40:38 -07:00)


`glasbey` generates the colors for the descriptions, except for the final `unknown` description, which will be grey:

In [32]:
import glasbey

colors = glasbey.create_palette(len(keywords) - 1) + ["#aaaaaa"]

time: 525 ms (started: 2023-05-28 21:41:41 -07:00)


In [33]:
palette = dict(
    description=dict(
        zip(
            keywords,
            colors,
        )
    )
)
palette

{'description': {'network': '#d21820',
  'graph': '#1869ff',
  'reinforcement': '#008a00',
  'language': '#f36dff',
  'adversarial': '#710079',
  'federated': '#aafb00',
  'contrastive': '#00bec2',
  'domain': '#ffa235',
  'diffusion': '#5d3d04',
  'out-of-dis': '#08008a',
  'continual': '#005d5d',
  'distillation': '#9a7d82',
  'architecture': '#a2aeff',
  'privacy': '#96b675',
  'protein': '#9e28ff',
  'fair': '#4d0014',
  'attention': '#ffaebe',
  'video': '#ce0092',
  'meta-learning': '#00ffb6',
  'generative adv': '#002d00',
  'autoencoder': '#9e7500',
  'game': '#3d3541',
  'semi-sup': '#f3eb92',
  'pruning': '#65618a',
  'physics': '#8a3d4d',
  '3d': '#5904ba',
  'translation': '#558a71',
  'optimization': '#b2bec2',
  'recurrent': '#ff5d82',
  'word': '#1cc600',
  'bayesian': '#92f7ff',
  'unknown': '#aaaaaa'}}

time: 11.2 ms (started: 2023-05-28 21:41:42 -07:00)


In [34]:
from drnb.io.pipeline import create_default_pipeline

_ = create_default_pipeline(check_for_duplicates=True, metric=["euclidean"]).run(
    "iclr",
    data=data,
    target=target,
    target_palette=palette,
    url="https://github.com/dkobak/iclr-tsne",
    verbose=True,
)

OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.


time: 25.8 s (started: 2023-05-28 21:42:32 -07:00)


## Renormalize

I also recommend renormalizing to L2 after the SVD procedure, so let's save that as a separate dataset.

In [35]:
def renormalize_l2(data):
    return data / np.linalg.norm(data, axis=1)[:, np.newaxis]

time: 1.75 ms (started: 2023-05-28 21:43:45 -07:00)


In [36]:
_ = create_default_pipeline(check_for_duplicates=True, metric=["euclidean"]).run(
    "iclr-l2r",
    data=renormalize_l2(data),
    target=target,
    target_palette=palette,
    url="https://github.com/dkobak/iclr-tsne",
    verbose=True,
)

time: 22.3 s (started: 2023-05-28 21:43:48 -07:00)


For use in the `renorm-prep.ipynb` to experiment with different ways of processing TF-IDF data, let's save the text that was used for the TF-IDF analysis here.

In [37]:
from drnb.io import write_pickle

_ = write_pickle(
    text,
    "iclr",
    suffix="text",
    verbose=False,
    compression="gzip",
    overwrite=True,
)

time: 1.24 s (started: 2023-05-28 22:04:44 -07:00)
