In [1]:
%load_ext lab_black
%load_ext autotime
import pandas as pd
import numpy as np

time: 357 ms (started: 2023-06-10 18:24:59 -07:00)


The [20 Newsgroups dataset](http://qwone.com/~jason/20Newsgroups/). This treatment is based on the material in <https://umap-learn.readthedocs.io/en/latest/sparse.html>, which leaves the data sparse. Not all dimensionality reduction methods can handle sparse data, so it will be converted to a 2500D dense matrix via PCA. **Warning**: the truncated SVD will cause this notebook to take up a fair bit of RAM (around 11GB).

In [2]:
import sklearn.datasets
import sklearn.feature_extraction.text

ng20v = sklearn.datasets.fetch_20newsgroups_vectorized(subset="all")
ng20tfidf = sklearn.feature_extraction.text.TfidfTransformer(norm="l1").fit_transform(
    ng20v.data
)

time: 1.51 s (started: 2023-06-10 18:25:00 -07:00)


In [3]:
ng20tfidf.shape

(18846, 130107)

time: 5.22 ms (started: 2023-06-10 18:25:02 -07:00)


In [4]:
ng20tfidf

<18846x130107 sparse matrix of type '<class 'numpy.float64'>'
	with 2895521 stored elements in Compressed Sparse Row format>

time: 4.13 ms (started: 2023-06-10 18:25:03 -07:00)


In [5]:
import sklearn.decomposition

time: 67.9 ms (started: 2023-06-10 18:25:05 -07:00)


Apart from eating up a fair bit of RAM, this next step is also pretty slow (around ten minutes on my machine). For a more accurate SVD, you probably want to set `algorithm="arpack"`, but that will cause the SVD process to take a lot longer (around an hour on my machine) and getting a low(ish)-rank dense representation of the data is more important to me than getting the actual SVD. Why did I choose 2500 components? I carried out a permutation test, based on randomly shuffling the contents of each column and repeating the SVD with 10 different shuffles, then seeing at what point the amount of variance being extracted in the unshuffled case fell below the shuffled versions. Yes, I should have used a lot more permutations to be sure about this, but I still think it's better than just picking an arbitrary number.

In [6]:
tsvd = sklearn.decomposition.TruncatedSVD(n_components=2500).fit(ng20tfidf)

time: 7min 46s (started: 2023-06-10 18:25:09 -07:00)


How much variance does 2500 components explain?

In [8]:
np.sum(tsvd.explained_variance_ratio_)

0.6634874569068352

time: 7.46 ms (started: 2023-06-10 18:33:58 -07:00)


66%? Not terrible.

In [9]:
data = tsvd.transform(ng20tfidf)

time: 6.78 s (started: 2023-06-10 18:34:04 -07:00)


In [10]:
from drnb.io import write_npy

_ = write_npy(data, "ng20", verbose=True)

time: 778 ms (started: 2023-06-10 18:34:51 -07:00)


## Pipeline

First, prepare the target labels.

In [11]:
ng20v.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

time: 4.64 ms (started: 2023-06-10 18:34:54 -07:00)


In [12]:
ng20v.target

array([17,  7, 10, ..., 10, 18,  9])

time: 4.26 ms (started: 2023-06-10 18:34:57 -07:00)


Use the `codes_to_categories` function to convert the numeric codes to a category column with the actual newsgroup names:

In [13]:
from drnb.util import codes_to_categories

description = codes_to_categories(
    ng20v.target, ng20v.target_names, col_name="description"
)
description

0        talk.politics.mideast
1                    rec.autos
2             rec.sport.hockey
3             rec.sport.hockey
4                    rec.autos
                 ...          
18841       talk.politics.misc
18842       talk.politics.guns
18843         rec.sport.hockey
18844       talk.politics.misc
18845       rec.sport.baseball
Name: description, Length: 18846, dtype: category
Categories (20, object): ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', ..., 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

time: 29 ms (started: 2023-06-10 18:34:58 -07:00)


In [14]:
target = pd.concat([pd.Series(ng20v.target, name="class"), description], axis=1)

time: 3.55 ms (started: 2023-06-10 18:34:59 -07:00)


In [15]:
target

Unnamed: 0,class,description
0,17,talk.politics.mideast
1,7,rec.autos
2,10,rec.sport.hockey
3,10,rec.sport.hockey
4,7,rec.autos
...,...,...
18841,18,talk.politics.misc
18842,16,talk.politics.guns
18843,10,rec.sport.hockey
18844,18,talk.politics.misc


time: 11.5 ms (started: 2023-06-10 18:35:00 -07:00)


### Renormalizing

The initial TF-IDF procedure has all the rows L1 normalized. Applying SVD removes that structure. [I find that it is beneficial to renormalize](https://github.com/jlmelville/drnb/blob/master/notebooks/tfidf-renorm.ipynb) to bring back the L1 normalization of the rows after SVD, so we will also do that here.

In [18]:
from drnb.preprocess import normalize_l1

time: 845 µs (started: 2023-06-10 18:36:23 -07:00)


In [17]:
from drnb.io.pipeline import create_default_pipeline

data_result = create_default_pipeline(
    check_for_duplicates=True,
    metric=["euclidean"],
).run(
    "ng20",
    data=normalize_l1(data),
    target=target,
    tags=["highdim"],
    url="http://qwone.com/~jason/20Newsgroups/",
    verbose=True,
)

OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.


time: 1min 6s (started: 2023-06-10 18:35:12 -07:00)
