In [6]:
%load_ext lab_black
%load_ext autotime
import pandas as pd
import numpy as np

time: 118 ms (started: 2022-09-11 23:31:44 -07:00)


The [20 Newsgroups dataset](http://qwone.com/~jason/20Newsgroups/). This treatment is based on the material in <https://umap-learn.readthedocs.io/en/latest/sparse.html>, which leaves the data sparse. Not all dimensionality reduction methods can handle sparse data, so it will be converted to a 3000D dense matrix via PCA. **Warning**: the truncated SVD will cause this notebook to take up a fair bit of RAM (around 11GB).

In [1]:
import sklearn.datasets
import sklearn.feature_extraction.text

ng20v = sklearn.datasets.fetch_20newsgroups_vectorized(subset="all")
ng20tfidf = sklearn.feature_extraction.text.TfidfTransformer(norm='l1').fit_transform(ng20v.data)

In [2]:
ng20tfidf.shape

(18846, 130107)

In [3]:
ng20tfidf

<18846x130107 sparse matrix of type '<class 'numpy.float64'>'
	with 2895521 stored elements in Compressed Sparse Row format>

In [7]:
import sklearn.decomposition

time: 751 µs (started: 2022-09-11 23:31:48 -07:00)


Apart from eating up a fair bit of RAM, this next step is also pretty slow (around ten minutes on my machine):

In [9]:
tsvd = sklearn.decomposition.TruncatedSVD(n_components=3000).fit(ng20tfidf)

time: 9min 31s (started: 2022-09-11 23:31:56 -07:00)


How much variance does 3000 components explain?

In [11]:
np.sum(tsvd.explained_variance_ratio_)

0.7099729670131367

time: 4.36 ms (started: 2022-09-11 23:43:01 -07:00)


71%? Not terrible.

In [12]:
data = tsvd.transform(ng20tfidf)

time: 6.55 s (started: 2022-09-11 23:43:55 -07:00)


In [16]:
ng20v.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

time: 4.84 ms (started: 2022-09-11 23:46:23 -07:00)


In [17]:
ng20v.target

array([17,  7, 10, ..., 10, 18,  9])

time: 4.13 ms (started: 2022-09-11 23:46:29 -07:00)


In [18]:
description = pd.Series(
    list(map(ng20v.target_names.__getitem__, ng20v.target.astype(int))),
    name="description",
)

time: 5.47 ms (started: 2022-09-11 23:47:41 -07:00)


In [19]:
description

0        talk.politics.mideast
1                    rec.autos
2             rec.sport.hockey
3             rec.sport.hockey
4                    rec.autos
                 ...          
18841       talk.politics.misc
18842       talk.politics.guns
18843         rec.sport.hockey
18844       talk.politics.misc
18845       rec.sport.baseball
Name: description, Length: 18846, dtype: object

time: 4.57 ms (started: 2022-09-11 23:47:46 -07:00)


In [23]:
target = pd.concat([pd.Series(ng20v.target, name="class"), description], axis=1)

time: 3.41 ms (started: 2022-09-11 23:49:45 -07:00)


In [24]:
target

Unnamed: 0,class,description
0,17,talk.politics.mideast
1,7,rec.autos
2,10,rec.sport.hockey
3,10,rec.sport.hockey
4,7,rec.autos
...,...,...
18841,18,talk.politics.misc
18842,16,talk.politics.guns
18843,10,rec.sport.hockey
18844,18,talk.politics.misc


time: 8.88 ms (started: 2022-09-11 23:49:47 -07:00)


## Pipeline

In [25]:
from drnb.dataset import create_data_pipeline

data_pipe = create_data_pipeline(
    data_export=["csv", "npy"],
    target_export=["csv", "pkl"],
    neighbors=dict(
        n_neighbors=[15, 50, 150],
        method="exact",
        metric=["euclidean"],
        file_types=["csv", "npy"],
    ),
    triplets=dict(
        n_triplets_per_point=5,
        seed=1337,
        file_types=["csv", "npy"],
    ),
    verbose=True,
)

INFO:rich:Requesting one extra neighbor to account for self-neighbor


time: 3.87 s (started: 2022-09-11 23:51:44 -07:00)


In [26]:
data_result = data_pipe.run("ng20", data=data, target=target, verbose=True)

INFO:rich:initial data shape: (18846, 3000)
INFO:rich:Removing rows with NAs
INFO:rich:data shape after filtering NAs: (18846, 3000)
INFO:rich:Keeping all columns
INFO:rich:data shape after filtering columns: (18846, 3000)
INFO:rich:No scaling
INFO:rich:Converting to numpy with {'dtype': 'float32', 'layout': 'c'}
INFO:rich:Writing data for ng20
INFO:rich:Processing target with initial shape (18846, 2)
INFO:rich:Keeping all columns
INFO:rich:Writing target for ng20
INFO:rich:Calculating nearest neighbors
INFO:rich:Finding 151 neighbors using faiss with euclidean metric and params: {}
INFO:rich:Calculating triplets
INFO:rich:Writing csv format to triplets/ng20.5.1337.idx.csv
INFO:rich:Writing csv format to triplets/ng20.5.1337.l2.csv
INFO:rich:Writing numpy format to triplets/ng20.5.1337.idx.npy
INFO:rich:Writing numpy format to triplets/ng20.5.1337.l2.npy
INFO:rich:Writing pipeline result for ng20


time: 58.9 s (started: 2022-09-11 23:52:08 -07:00)
