In [1]:
%load_ext lab_black
%load_ext autotime
import pandas as pd
import numpy as np

time: 242 ms (started: 2022-09-30 13:26:52 -07:00)


The [20 Newsgroups dataset](http://qwone.com/~jason/20Newsgroups/). This treatment is based on the material in <https://umap-learn.readthedocs.io/en/latest/sparse.html>, which leaves the data sparse. Not all dimensionality reduction methods can handle sparse data, so it will be converted to a 3000D dense matrix via PCA. **Warning**: the truncated SVD will cause this notebook to take up a fair bit of RAM (around 11GB).

In [2]:
import sklearn.datasets
import sklearn.feature_extraction.text

ng20v = sklearn.datasets.fetch_20newsgroups_vectorized(subset="all")
ng20tfidf = sklearn.feature_extraction.text.TfidfTransformer(norm="l1").fit_transform(
    ng20v.data
)

time: 1.18 s (started: 2022-09-30 13:26:53 -07:00)


In [3]:
ng20tfidf.shape

(18846, 130107)

time: 7.25 ms (started: 2022-09-30 13:26:54 -07:00)


In [4]:
ng20tfidf

<18846x130107 sparse matrix of type '<class 'numpy.float64'>'
	with 2895521 stored elements in Compressed Sparse Row format>

time: 3.4 ms (started: 2022-09-30 13:26:54 -07:00)


In [5]:
import sklearn.decomposition

time: 30.7 ms (started: 2022-09-30 13:26:56 -07:00)


Apart from eating up a fair bit of RAM, this next step is also pretty slow (around ten minutes on my machine):

In [6]:
tsvd = sklearn.decomposition.TruncatedSVD(n_components=3000).fit(ng20tfidf)

time: 6min 59s (started: 2022-09-30 13:26:58 -07:00)


How much variance does 3000 components explain?

In [7]:
np.sum(tsvd.explained_variance_ratio_)

0.7099557593781698

time: 3.65 ms (started: 2022-09-30 13:33:58 -07:00)


71%? Not terrible.

In [8]:
data = tsvd.transform(ng20tfidf)

time: 5.3 s (started: 2022-09-30 13:33:58 -07:00)


## Pipeline

First, prepare the target labels.

In [9]:
ng20v.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

time: 3.66 ms (started: 2022-09-30 13:34:04 -07:00)


In [10]:
ng20v.target

array([17,  7, 10, ..., 10, 18,  9])

time: 4.61 ms (started: 2022-09-30 13:34:04 -07:00)


Use the `codes_to_categories` function to convert the numeric codes to a category column with the actual newsgroup names:

In [12]:
from drnb.util import codes_to_categories

description = codes_to_categories(
    ng20v.target, ng20v.target_names, col_name="description"
)
description

0        talk.politics.mideast
1                    rec.autos
2             rec.sport.hockey
3             rec.sport.hockey
4                    rec.autos
                 ...          
18841       talk.politics.misc
18842       talk.politics.guns
18843         rec.sport.hockey
18844       talk.politics.misc
18845       rec.sport.baseball
Name: description, Length: 18846, dtype: category
Categories (20, object): ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', ..., 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

time: 10.8 ms (started: 2022-09-30 13:35:53 -07:00)


In [18]:
target = pd.concat([pd.Series(ng20v.target, name="class"), description], axis=1)

time: 2.34 ms (started: 2022-09-30 12:57:54 -07:00)


In [19]:
target

Unnamed: 0,class,description
0,17,talk.politics.mideast
1,7,rec.autos
2,10,rec.sport.hockey
3,10,rec.sport.hockey
4,7,rec.autos
...,...,...
18841,18,talk.politics.misc
18842,16,talk.politics.guns
18843,10,rec.sport.hockey
18844,18,talk.politics.misc


time: 9.64 ms (started: 2022-09-30 12:57:55 -07:00)


In [20]:
from drnb.io.pipeline import create_default_pipeline

data_result = create_default_pipeline(
    check_for_duplicates=True, metric=["euclidean", "cosine"]
).run(
    "ng20",
    data=data,
    target=target,
    tags=["highdim"],
    url="http://qwone.com/~jason/20Newsgroups/",
    verbose=True,
)

time: 1min 12s (started: 2022-09-30 12:59:19 -07:00)
