In [1]:
import numpy as np
import pandas as pd

The [20 Newsgroups dataset](http://qwone.com/~jason/20Newsgroups/). This treatment is based on the 
material in <https://umap-learn.readthedocs.io/en/latest/sparse.html>, which leaves the data sparse.
Not all dimensionality reduction methods can handle sparse data, so it will be converted to a 2500D 
dense matrix via PCA. **Warning**: the truncated SVD will cause this notebook to take up a fair bit 
of RAM (around 11GB).

This is also a good example of the perils of applying Euclidean distances in high-dimensional space.
Results are guaranteed to look terrible, although row-normalization can help.

In [2]:
import sklearn.datasets
import sklearn.feature_extraction.text

ng20v = sklearn.datasets.fetch_20newsgroups_vectorized(subset="all")
ng20tfidf = sklearn.feature_extraction.text.TfidfTransformer(norm="l1").fit_transform(
    ng20v.data
)

In [3]:
ng20tfidf.shape

(18846, 130107)

In [4]:
ng20tfidf

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 2895521 stored elements and shape (18846, 130107)>

In [5]:
import sklearn.decomposition

Apart from eating up a fair bit of RAM, this next step is also pretty slow (around ten minutes on my machine). For a more accurate SVD, you probably want to set `algorithm="arpack"`, but that will cause the SVD process to take a lot longer (around an hour on my machine) and getting a low(ish)-rank dense representation of the data is more important to me than getting the actual SVD. Why did I choose 2500 components? I carried out a permutation test, based on randomly shuffling the contents of each column and repeating the SVD with 10 different shuffles, then seeing at what point the amount of variance being extracted in the unshuffled case fell below the shuffled versions. Yes, I should have used a lot more permutations to be sure about this, but I still think it's better than just picking an arbitrary number.

In [6]:
tsvd = sklearn.decomposition.TruncatedSVD(n_components=2500).fit(ng20tfidf)

How much variance does 2500 components explain?

In [7]:
np.sum(tsvd.explained_variance_ratio_)

np.float64(0.6634987588034158)

66%? Not terrible.

In [8]:
data = tsvd.transform(ng20tfidf)

## Pipeline

First, prepare the target labels.

In [9]:
ng20v.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [10]:
ng20v.target

array([17,  7, 10, ..., 10, 18,  9])

Use the `codes_to_categories` function to convert the numeric codes to a category column with the actual newsgroup names:

In [11]:
from drnb.util import codes_to_categories

newsgroup = codes_to_categories(ng20v.target, ng20v.target_names, col_name="newsgroup")
newsgroup

0        talk.politics.mideast
1                    rec.autos
2             rec.sport.hockey
3             rec.sport.hockey
4                    rec.autos
                 ...          
18841       talk.politics.misc
18842       talk.politics.guns
18843         rec.sport.hockey
18844       talk.politics.misc
18845       rec.sport.baseball
Name: newsgroup, Length: 18846, dtype: category
Categories (20, object): ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', ..., 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

In [12]:
target = pd.concat([pd.Series(ng20v.target, name="class"), newsgroup], axis=1)

In [13]:
target

Unnamed: 0,class,newsgroup
0,17,talk.politics.mideast
1,7,rec.autos
2,10,rec.sport.hockey
3,10,rec.sport.hockey
4,7,rec.autos
...,...,...
18841,18,talk.politics.misc
18842,16,talk.politics.guns
18843,10,rec.sport.hockey
18844,18,talk.politics.misc


### Palette

We can also come up with some custom colors that map to the broader subjects in the different
newsgroups. Details on how these were created can be found in the 
[20NG PaCMAP notebook](https://github.com/jlmelville/drnb/blob/master/notebooks/data-pipeline/ng20pacmap.ipynb)

In [14]:
target_palette = {
    "newsgroup": {
        "comp.graphics": "#590000",
        "comp.os.ms-windows.misc": "#96000c",
        "comp.sys.ibm.pc.hardware": "#d23d20",
        "comp.sys.mac.hardware": "#f3823d",
        "comp.windows.x": "#fbb655",
        "sci.crypt": "#0400ba",
        "sci.electronics": "#0c4deb",
        "sci.med": "#4d96f7",
        "sci.space": "#86bef3",
        "talk.politics.guns": "#003100",
        "talk.politics.mideast": "#006500",
        "talk.politics.misc": "#459a10",
        "alt.atheism": "#b631ba",
        "soc.religion.christian": "#ff65ff",
        "talk.religion.misc": "#ffb6ff",
        "rec.sport.baseball": "#494549",
        "rec.sport.hockey": "#928a82",
        "rec.autos": "#412000",
        "rec.motorcycles": "#8a6104",
        "misc.forsale": "#007982",
    }
}

### Renormalizing

The initial TF-IDF procedure has all the rows L1 normalized. Applying SVD removes that structure. [I find that it is beneficial to renormalize](https://github.com/jlmelville/drnb/blob/master/notebooks/tfidf-renorm.ipynb) to bring back the L1 normalization of the rows after SVD, so we will also do that here.

In [15]:
from drnb.preprocess import normalize_l1

In [16]:
from drnb.io.pipeline import create_default_pipeline

data_result = create_default_pipeline(
    check_for_duplicates=True,
    metric=["euclidean"],
).run(
    "ng20",
    data=normalize_l1(data),
    target=target,
    target_palette=target_palette,
    tags=["highdim"],
    url="http://qwone.com/~jason/20Newsgroups/",
    verbose=True,
)