# Contents
* [Introduction](#Introduction)
* [Imports and configuration](#Imports-and-configuration)
* [Load data](#Load-data)
* [Linear discriminant analysis of FRILL embeddings](#Linear-discriminant-analysis-of-FRILL-embeddings)
* [One-class linear discriminant analysis of FRILL embeddings](#One-class-linear-discriminant-analysis-of-FRILL-embeddings)
* [Local outlier factor of FRILL embeddings](#Local-outlier-factor-of-FRILL-embeddings)
* [Local outlier factor of LDA components of FRILL embeddings](#Local-outlier-factor-of-LDA-components-of-FRILL-embeddings)
* [One-class SVM scores of FRILL embeddings (and their LDA components) and of the LDA components of the FRILL embeddings](#One-class-SVM-scores-of-FRILL-embeddings-(and-their-LDA-components)-and-of-the-LDA-components-of-the-FRILL-embeddings)
* [Spherical coordinates](#Spherical-coordinates)
* [LDA components of spherical FRILL-based features](#LDA-components-of-spherical-FRILL-based-features)
* [One-class LDA components of spherical FRILL-based features](#One-class-LDA-components-of-spherical-FRILL-based-features)
* [Aggregate and scale](#Aggregate-and-scale)
* [Discussion](#Discussion)

# Introduction

Three holdout datasets have been identified. This notebook prepares the labels and metadata for these datasets and then prepares them for extraction of FRILL embeddings.

# Imports and configuration

In [1]:
from time import time

notebook_begin_time = time()

# set random seeds

from os import environ
from random import seed as random_seed
from numpy.random import seed as np_seed
from tensorflow.random import set_seed


def reset_seeds(seed: int) -> None:
    """Utility function for resetting random seeds"""
    environ["PYTHONHASHSEED"] = str(seed)
    random_seed(seed)
    np_seed(seed)
    set_seed(seed)


reset_seeds(SEED := 2021)
del environ
del random_seed
del np_seed
del set_seed
del reset_seeds

In [2]:
# extensions
%load_ext autotime
%load_ext lab_black
%load_ext nb_black

In [3]:
# core
import numpy as np
import pandas as pd

# utility
from copy import deepcopy
from joblib import load as joblib_load
from pathlib import Path
from gc import collect as gc_collect
from tqdm.notebook import tqdm

# faster
import swifter
from sklearnex import patch_sklearn

patch_sklearn()
del patch_sklearn

# spherical coordinates
from numpy import arctan2, hypot, sqrt

# typing
from typing import List

# display outputs w/o print calls
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"
del InteractiveShell

time: 3.62 s


Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


In [4]:
# Location of exported feature extractors
FEATURE_EXTRACTORS = "../19.0-mic-extract_FRILL-based_features_from_full_data"

# Location where this notebook will output
OUT_FOLDER = "."

_ = gc_collect()

time: 120 ms


# Setup

In [5]:
# valence label codes
VALENCE = {"neg": 0, "neu": 1, "pos": 2}
VALENCES = ("neg", "neu", "pos")
NEIGHBORS = (10, 20, 30)
OC_SVM = ("sgdlinear", "rbf", "sigmoid", "poly5", "poly6")

feature_trio = lambda prefix, suffix: [
    f"{f'{prefix}_' if prefix else ''}{valence}{f'_{suffix}' if suffix else ''}"
    for valence in VALENCES
]

load_extractor = lambda feature_file: joblib_load(
    f"{FEATURE_EXTRACTORS}/{feature_file}.joblib"
)


def spot_check(data: pd.DataFrame, labels: pd.DataFrame) -> None:
    """Spot check feature extraction process"""
    assert all(data.index == labels.index)
    assert not data.isnull().values.any()
    print(data.info())
    print(data.head())


_ = gc_collect()

time: 126 ms


# Load data

In [6]:
frill = pd.read_feather("holdout_FRILL.feather")
labels = pd.read_feather("holdout_labels.feather").set_index("id")
spot_check(frill, labels)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1957 entries, 0 to 1956
Columns: 2048 entries, 0 to 2047
dtypes: float32(2048)
memory usage: 15.3 MB
None
          0         1         2         3         4         5         6  \
0 -0.035624  0.148898 -0.109508 -0.016199 -0.084427 -0.067872  0.174007   
1 -0.103035  0.082011  0.110948 -0.060209 -0.073397  0.009772 -0.020830   
2 -0.065101  0.038458 -0.029423  0.005429 -0.023002  0.152135  0.053979   
3  0.083481  0.138969 -0.030347  0.062699  0.073357 -0.039354 -0.089715   
4  0.019141  0.170539 -0.056348  0.032532 -0.089482  0.079703  0.037175   

          7         8         9  ...      2038      2039      2040      2041  \
0 -0.033383 -0.017896 -0.024012  ... -0.153698  0.041551 -0.003445 -0.059062   
1  0.044276  0.046892 -0.115516  ...  0.065289  0.075569 -0.089013 -0.161597   
2 -0.039011  0.017791 -0.077413  ... -0.042246 -0.004320 -0.040200 -0.049798   
3  0.113456  0.039636 -0.041954  ...  0.022920 -0.181787  0.011134 -0.073

# Linear discriminant analysis components of FRILL embeddings

In [7]:
data = pd.DataFrame(
    load_extractor("LDA1_-_LDA2").transform(frill), columns=["LDA1", "LDA2"]
)
spot_check(data, labels)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1957 entries, 0 to 1956
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   LDA1    1957 non-null   float64
 1   LDA2    1957 non-null   float64
dtypes: float64(2)
memory usage: 30.7 KB
None
       LDA1      LDA2
0 -1.001496  1.190528
1 -0.614103  0.555738
2 -1.283616  0.439167
3 -0.174906  1.891036
4 -1.165987  1.732043
time: 340 ms


# One-class linear discriminant analysis of FRILL embeddings

In [8]:
df = deepcopy(frill)
for valence in VALENCES:
    feature = f"ocLDA_{valence}"
    df[feature] = np.squeeze(
        load_extractor(feature).transform(df.loc[:, frill.columns])
    )
    del feature
    _ = gc_collect()
data = pd.concat(
    [data, df.loc[:, feature_trio("ocLDA", "")]],
    axis="columns",
)
spot_check(data, labels)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1957 entries, 0 to 1956
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   LDA1       1957 non-null   float64
 1   LDA2       1957 non-null   float64
 2   ocLDA_neg  1957 non-null   float64
 3   ocLDA_neu  1957 non-null   float64
 4   ocLDA_pos  1957 non-null   float64
dtypes: float64(5)
memory usage: 76.6 KB
None
       LDA1      LDA2  ocLDA_neg  ocLDA_neu  ocLDA_pos
0 -1.001496  1.190528  -0.258172  -0.947540   1.478330
1 -0.614103  0.555738  -0.027168  -0.588753   0.747540
2 -1.283616  0.439167   0.486838  -1.262689   0.903614
3 -0.174906  1.891036  -1.311590  -0.090789   1.793279
4 -1.165987  1.732043  -0.560913  -1.087816   2.036688
time: 778 ms


In [9]:
del df
_ = gc_collect()

time: 117 ms


# Local outlier factor of FRILL embeddings

In [10]:
for n_neighbors in tqdm(NEIGHBORS):
    for valence in tqdm(VALENCES):
        feature = f"LOF_{valence}_{n_neighbors}"
        data[feature] = load_extractor(feature).score_samples(frill)

spot_check(data, labels)

for valence in tqdm(VALENCES):
    feature = f"LOF_{valence}_PCA"
    data[feature] = load_extractor(feature).transform(
        data.loc[:, [f"LOF_{valence}_{k}" for k in NEIGHBORS]]
    )

spot_check(data, labels)

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1957 entries, 0 to 1956
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   LDA1        1957 non-null   float64
 1   LDA2        1957 non-null   float64
 2   ocLDA_neg   1957 non-null   float64
 3   ocLDA_neu   1957 non-null   float64
 4   ocLDA_pos   1957 non-null   float64
 5   LOF_neg_10  1957 non-null   float32
 6   LOF_neu_10  1957 non-null   float32
 7   LOF_pos_10  1957 non-null   float32
 8   LOF_neg_20  1957 non-null   float32
 9   LOF_neu_20  1957 non-null   float32
 10  LOF_pos_20  1957 non-null   float32
 11  LOF_neg_30  1957 non-null   float32
 12  LOF_neu_30  1957 non-null   float32
 13  LOF_pos_30  1957 non-null   float32
dtypes: float32(9), float64(5)
memory usage: 145.4 KB
None
       LDA1      LDA2  ocLDA_neg  ocLDA_neu  ocLDA_pos  LOF_neg_10  \
0 -1.001496  1.190528  -0.258172  -0.947540   1.478330   -1.163084   
1 -0.614103  0.555738  -0.027168  -0.5

  0%|          | 0/3 [00:00<?, ?it/s]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1957 entries, 0 to 1956
Data columns (total 17 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   LDA1         1957 non-null   float64
 1   LDA2         1957 non-null   float64
 2   ocLDA_neg    1957 non-null   float64
 3   ocLDA_neu    1957 non-null   float64
 4   ocLDA_pos    1957 non-null   float64
 5   LOF_neg_10   1957 non-null   float32
 6   LOF_neu_10   1957 non-null   float32
 7   LOF_pos_10   1957 non-null   float32
 8   LOF_neg_20   1957 non-null   float32
 9   LOF_neu_20   1957 non-null   float32
 10  LOF_pos_20   1957 non-null   float32
 11  LOF_neg_30   1957 non-null   float32
 12  LOF_neu_30   1957 non-null   float32
 13  LOF_pos_30   1957 non-null   float32
 14  LOF_neg_PCA  1957 non-null   float32
 15  LOF_neu_PCA  1957 non-null   float32
 16  LOF_pos_PCA  1957 non-null   float32
dtypes: float32(12), float64(5)
memory usage: 168.3 KB
None
       LDA1      LDA2  ocLDA_neg  o

# Local outlier factor of LDA components of FRILL embeddings

In [11]:
for n_neighbors in tqdm(NEIGHBORS):
    for valence in tqdm(VALENCES):
        feature = f"LDA-LOF_{valence}_{n_neighbors}"
        data[feature] = load_extractor(feature).score_samples(
            data.loc[:, ["LDA1", "LDA2"]]
        )

spot_check(data, labels)

for valence in tqdm(VALENCES):
    feature = f"LDA-LOF_{valence}_PCA"
    data[feature] = load_extractor(feature).transform(
        data.loc[:, [f"LDA-LOF_{valence}_{k}" for k in NEIGHBORS]]
    )

spot_check(data, labels)

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1957 entries, 0 to 1956
Data columns (total 26 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   LDA1            1957 non-null   float64
 1   LDA2            1957 non-null   float64
 2   ocLDA_neg       1957 non-null   float64
 3   ocLDA_neu       1957 non-null   float64
 4   ocLDA_pos       1957 non-null   float64
 5   LOF_neg_10      1957 non-null   float32
 6   LOF_neu_10      1957 non-null   float32
 7   LOF_pos_10      1957 non-null   float32
 8   LOF_neg_20      1957 non-null   float32
 9   LOF_neu_20      1957 non-null   float32
 10  LOF_pos_20      1957 non-null   float32
 11  LOF_neg_30      1957 non-null   float32
 12  LOF_neu_30      1957 non-null   float32
 13  LOF_pos_30      1957 non-null   float32
 14  LOF_neg_PCA     1957 non-null   float32
 15  LOF_neu_PCA     1957 non-null   float32
 16  LOF_pos_PCA     1957 non-null   float32
 17  LDA-LOF_neg_10  1957 non-null   f

  0%|          | 0/3 [00:00<?, ?it/s]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1957 entries, 0 to 1956
Data columns (total 29 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   LDA1             1957 non-null   float64
 1   LDA2             1957 non-null   float64
 2   ocLDA_neg        1957 non-null   float64
 3   ocLDA_neu        1957 non-null   float64
 4   ocLDA_pos        1957 non-null   float64
 5   LOF_neg_10       1957 non-null   float32
 6   LOF_neu_10       1957 non-null   float32
 7   LOF_pos_10       1957 non-null   float32
 8   LOF_neg_20       1957 non-null   float32
 9   LOF_neu_20       1957 non-null   float32
 10  LOF_pos_20       1957 non-null   float32
 11  LOF_neg_30       1957 non-null   float32
 12  LOF_neu_30       1957 non-null   float32
 13  LOF_pos_30       1957 non-null   float32
 14  LOF_neg_PCA      1957 non-null   float32
 15  LOF_neu_PCA      1957 non-null   float32
 16  LOF_pos_PCA      1957 non-null   float32
 17  LDA-LOF_neg_10

# One-class SVM scores of FRILL embeddings (and their LDA components) and of the LDA components of the FRILL embeddings

In [12]:
for oc_svm in tqdm(OC_SVM):
    prefix = f"ocSVM_{oc_svm}"
    # one-class scores of FRILL embeddings
    for valence in VALENCES:
        feature = f"{prefix}_{valence}"
        data[feature] = load_extractor(feature).score_samples(frill)
    # LDA components of one-class scores of FRILL embeddings
    df = data.loc[:, feature_trio(prefix, "")]
    features = [f"{prefix}_LDA1", f"{prefix}_LDA2"]
    df = pd.DataFrame(
        load_extractor("_-_".join(features)).transform(df), columns=features
    )
    data = pd.concat([data, df], axis="columns")
    del df
    del features
    _ = gc_collect()
    # one-class scores of LDA components of FRILL embeddings
    for feature in feature_trio(f"LDA-{prefix}", ""):
        data[feature] = load_extractor(feature).score_samples(
            data.loc[:, ["LDA1", "LDA2"]]
        )
spot_check(data, labels)

  0%|          | 0/5 [00:00<?, ?it/s]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1957 entries, 0 to 1956
Data columns (total 69 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   LDA1                     1957 non-null   float64
 1   LDA2                     1957 non-null   float64
 2   ocLDA_neg                1957 non-null   float64
 3   ocLDA_neu                1957 non-null   float64
 4   ocLDA_pos                1957 non-null   float64
 5   LOF_neg_10               1957 non-null   float32
 6   LOF_neu_10               1957 non-null   float32
 7   LOF_pos_10               1957 non-null   float32
 8   LOF_neg_20               1957 non-null   float32
 9   LOF_neu_20               1957 non-null   float32
 10  LOF_pos_20               1957 non-null   float32
 11  LOF_neg_30               1957 non-null   float32
 12  LOF_neu_30               1957 non-null   float32
 13  LOF_pos_30               1957 non-null   float32
 14  LOF_neg_PCA             

# Spherical coordinates

In [13]:
features_ = [["LDA1", "LDA2"], feature_trio("ocLDA", "")]

for n_neighbors in NEIGHBORS:
    features_.extend(
        [
            feature_trio(f"LDA-LOF", n_neighbors),
            feature_trio(f"LOF", n_neighbors),
        ]
    )
features_.extend(
    [
        feature_trio(f"LDA-LOF", "PCA"),
        feature_trio(f"LOF", "PCA"),
    ]
)
for descriptor in OC_SVM:
    features_.extend(
        [
            feature_trio(f"ocSVM_{descriptor}", ""),
            [f"ocSVM_{descriptor}_LDA1", f"ocSVM_{descriptor}_LDA2"],
            feature_trio(f"LDA-ocSVM_{descriptor}", ""),
        ]
    )

all_features = []
for feature_set in features_:
    all_features.extend(feature_set)
assert sorted(all_features) == sorted(list(data.columns))

_ = gc_collect()

time: 117 ms


In [14]:
sphericals = {}
for features in features_:
    combo = "+".join(features)
    df = data.loc[:, features]
    x, y = df[features[0]], df[features[1]]
    theta, phi = f"theta_{combo}", f"phi_{combo}"
    sphericals[theta] = arctan2(y, x)
    # convert to polar
    if len(features) == 3:
        sphericals[phi] = arctan2(sqrt(x ** 2 + y ** 2), df[features[2]])
    del df
    del x
    del y
    del combo
    _ = gc_collect()

sphericals = pd.DataFrame(sphericals)
sphericals.head()
sphericals.info()

Unnamed: 0,theta_LDA1+LDA2,theta_ocLDA_neg+ocLDA_neu+ocLDA_pos,phi_ocLDA_neg+ocLDA_neu+ocLDA_pos,theta_LDA-LOF_neg_10+LDA-LOF_neu_10+LDA-LOF_pos_10,phi_LDA-LOF_neg_10+LDA-LOF_neu_10+LDA-LOF_pos_10,theta_LOF_neg_10+LOF_neu_10+LOF_pos_10,phi_LOF_neg_10+LOF_neu_10+LOF_pos_10,theta_LDA-LOF_neg_20+LDA-LOF_neu_20+LDA-LOF_pos_20,phi_LDA-LOF_neg_20+LDA-LOF_neu_20+LDA-LOF_pos_20,theta_LOF_neg_20+LOF_neu_20+LOF_pos_20,...,theta_ocSVM_poly5_neg+ocSVM_poly5_neu+ocSVM_poly5_pos,phi_ocSVM_poly5_neg+ocSVM_poly5_neu+ocSVM_poly5_pos,theta_ocSVM_poly5_LDA1+ocSVM_poly5_LDA2,theta_LDA-ocSVM_poly5_neg+LDA-ocSVM_poly5_neu+LDA-ocSVM_poly5_pos,phi_LDA-ocSVM_poly5_neg+LDA-ocSVM_poly5_neu+LDA-ocSVM_poly5_pos,theta_ocSVM_poly6_neg+ocSVM_poly6_neu+ocSVM_poly6_pos,phi_ocSVM_poly6_neg+ocSVM_poly6_neu+ocSVM_poly6_pos,theta_ocSVM_poly6_LDA1+ocSVM_poly6_LDA2,theta_LDA-ocSVM_poly6_neg+LDA-ocSVM_poly6_neu+LDA-ocSVM_poly6_pos,phi_LDA-ocSVM_poly6_neg+LDA-ocSVM_poly6_neu+LDA-ocSVM_poly6_pos
0,2.270171,-1.836805,0.586375,-2.372771,2.203119,-2.349876,2.32276,-2.327462,2.172485,-2.359419,...,0.236955,1.206338,-0.574844,-1.570796,1.570796,0.22344,1.204146,2.513645,0.23317,1.532792
1,2.406045,-1.616909,0.667643,-2.366813,2.218946,-2.369632,2.220962,-2.350703,2.185424,-2.341869,...,0.238272,1.217777,-0.353619,-1.570796,1.570796,0.226308,1.204854,2.50607,0.25182,1.53349
2,2.811944,-1.202803,0.982068,-2.37446,2.178971,-2.338157,2.203633,-2.316605,2.153088,-2.334869,...,0.248262,1.20812,-0.689257,-1.570796,1.570796,0.237793,1.203616,2.424746,0.288371,1.538617
3,1.663026,-3.072482,0.632624,-2.353551,2.233436,-2.447289,2.249757,-2.351299,2.20623,-2.428448,...,0.229963,1.232911,-0.103535,-1.570805,1.570783,0.21788,1.232009,2.904429,0.176373,1.53905
4,2.163299,-2.046872,0.541106,-2.349717,2.205294,-2.322934,2.221186,-2.286041,2.13762,-2.368039,...,0.215748,1.219682,-0.227366,-1.570796,1.570796,0.202858,1.217804,2.774251,0.218571,1.532823


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1957 entries, 0 to 1956
Data columns (total 44 columns):
 #   Column                                                                         Non-Null Count  Dtype  
---  ------                                                                         --------------  -----  
 0   theta_LDA1+LDA2                                                                1957 non-null   float64
 1   theta_ocLDA_neg+ocLDA_neu+ocLDA_pos                                            1957 non-null   float64
 2   phi_ocLDA_neg+ocLDA_neu+ocLDA_pos                                              1957 non-null   float64
 3   theta_LDA-LOF_neg_10+LDA-LOF_neu_10+LDA-LOF_pos_10                             1957 non-null   float64
 4   phi_LDA-LOF_neg_10+LDA-LOF_neu_10+LDA-LOF_pos_10                               1957 non-null   float64
 5   theta_LOF_neg_10+LOF_neu_10+LOF_pos_10                                         1957 non-null   float32
 6   phi_LOF_neg_10+LOF_neu_1

In [15]:
assert all(sphericals.index == labels.index)
assert all(sphericals.index == data.index)
spot_check(data, labels)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1957 entries, 0 to 1956
Data columns (total 69 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   LDA1                     1957 non-null   float64
 1   LDA2                     1957 non-null   float64
 2   ocLDA_neg                1957 non-null   float64
 3   ocLDA_neu                1957 non-null   float64
 4   ocLDA_pos                1957 non-null   float64
 5   LOF_neg_10               1957 non-null   float32
 6   LOF_neu_10               1957 non-null   float32
 7   LOF_pos_10               1957 non-null   float32
 8   LOF_neg_20               1957 non-null   float32
 9   LOF_neu_20               1957 non-null   float32
 10  LOF_pos_20               1957 non-null   float32
 11  LOF_neg_30               1957 non-null   float32
 12  LOF_neu_30               1957 non-null   float32
 13  LOF_pos_30               1957 non-null   float32
 14  LOF_neg_PCA             

# LDA components of spherical FRILL-based features

In [16]:
features = ["spherical-LDA1", "spherical-LDA2"]
df = pd.DataFrame(
    load_extractor("_-_".join(features)).transform(sphericals), columns=features
)
assert all(df.index == labels.index)
assert all(df.index == data.index)
_ = gc_collect()

time: 128 ms


In [17]:
df_ = pd.concat(
    [
        data,
        df,
    ],
    axis="columns",
)
spot_check(df_, labels)
_ = gc_collect()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1957 entries, 0 to 1956
Data columns (total 71 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   LDA1                     1957 non-null   float64
 1   LDA2                     1957 non-null   float64
 2   ocLDA_neg                1957 non-null   float64
 3   ocLDA_neu                1957 non-null   float64
 4   ocLDA_pos                1957 non-null   float64
 5   LOF_neg_10               1957 non-null   float32
 6   LOF_neu_10               1957 non-null   float32
 7   LOF_pos_10               1957 non-null   float32
 8   LOF_neg_20               1957 non-null   float32
 9   LOF_neu_20               1957 non-null   float32
 10  LOF_pos_20               1957 non-null   float32
 11  LOF_neg_30               1957 non-null   float32
 12  LOF_neu_30               1957 non-null   float32
 13  LOF_pos_30               1957 non-null   float32
 14  LOF_neg_PCA             

In [18]:
data = df_
spot_check(data, labels)
del df
del df_
_ = gc_collect()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1957 entries, 0 to 1956
Data columns (total 71 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   LDA1                     1957 non-null   float64
 1   LDA2                     1957 non-null   float64
 2   ocLDA_neg                1957 non-null   float64
 3   ocLDA_neu                1957 non-null   float64
 4   ocLDA_pos                1957 non-null   float64
 5   LOF_neg_10               1957 non-null   float32
 6   LOF_neu_10               1957 non-null   float32
 7   LOF_pos_10               1957 non-null   float32
 8   LOF_neg_20               1957 non-null   float32
 9   LOF_neu_20               1957 non-null   float32
 10  LOF_pos_20               1957 non-null   float32
 11  LOF_neg_30               1957 non-null   float32
 12  LOF_neu_30               1957 non-null   float32
 13  LOF_pos_30               1957 non-null   float32
 14  LOF_neg_PCA             

# One-class LDA components of spherical FRILL-based features

In [19]:
df = pd.concat(
    [
        data,
        pd.DataFrame(
            {
                feature: np.squeeze(load_extractor(feature).transform(sphericals))
                for feature in feature_trio("spherical-ocLDA", "")
            }
        ),
    ],
    axis="columns",
)
spot_check(df, labels)
_ = gc_collect()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1957 entries, 0 to 1956
Data columns (total 74 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   LDA1                     1957 non-null   float64
 1   LDA2                     1957 non-null   float64
 2   ocLDA_neg                1957 non-null   float64
 3   ocLDA_neu                1957 non-null   float64
 4   ocLDA_pos                1957 non-null   float64
 5   LOF_neg_10               1957 non-null   float32
 6   LOF_neu_10               1957 non-null   float32
 7   LOF_pos_10               1957 non-null   float32
 8   LOF_neg_20               1957 non-null   float32
 9   LOF_neu_20               1957 non-null   float32
 10  LOF_pos_20               1957 non-null   float32
 11  LOF_neg_30               1957 non-null   float32
 12  LOF_neu_30               1957 non-null   float32
 13  LOF_pos_30               1957 non-null   float32
 14  LOF_neg_PCA             

In [20]:
data = pd.concat([df, sphericals], axis="columns")
spot_check(data, labels)
_ = gc_collect()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1957 entries, 0 to 1956
Columns: 118 entries, LDA1 to phi_LDA-ocSVM_poly6_neg+LDA-ocSVM_poly6_neu+LDA-ocSVM_poly6_pos
dtypes: float32(20), float64(98)
memory usage: 1.6 MB
None
       LDA1      LDA2  ocLDA_neg  ocLDA_neu  ocLDA_pos  LOF_neg_10  \
0 -1.001496  1.190528  -0.258172  -0.947540   1.478330   -1.163084   
1 -0.614103  0.555738  -0.027168  -0.588753   0.747540   -1.332296   
2 -1.283616  0.439167   0.486838  -1.262689   0.903614   -1.078094   
3 -0.174906  1.891036  -1.311590  -0.090789   1.793279   -1.503324   
4 -1.165987  1.732043  -0.560913  -1.087816   2.036688   -1.188440   

   LOF_neu_10  LOF_pos_10  LOF_neg_20  LOF_neu_20  ...  \
0   -1.177876   -1.548195   -1.121199   -1.113991  ...   
1   -1.296963   -1.413960   -1.240734   -1.276801  ...   
2   -1.117705   -1.139015   -1.076433   -1.123352  ...   
3   -1.251665   -1.578528   -1.404248   -1.214691  ...   
4   -1.270247   -1.323458   -1.170957   -1.143541  ...   

   

In [21]:
training_data = (
    pd.read_feather(
        f"../19.0-mic-extract_FRILL-based_features_from_full_data/unscaled_features_ready_for_selection.feather"
    )
    .set_index("id")
    .head(1)
)
assert len(data.columns) == len(training_data.columns)
assert set(data.columns) == set(training_data.columns)
data = data.loc[
    :, training_data.columns
]  # with the guarantee of the preceding assertion, ensure congruent column order
assert all(data.columns == training_data.columns)
del sphericals
del training_data
_ = gc_collect()

time: 209 ms


# Uniform scaling with quantile transformation (omitted)

In [22]:
# scaler = load_extractor("preselect_QTuni")
# data = pd.DataFrame(scaler.transform(data), columns=scaler.feature_names_in_)
# del scaler
_ = gc_collect()
spot_check(data, labels)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1957 entries, 0 to 1956
Columns: 118 entries, theta_LDA1+LDA2 to LDA-ocSVM_poly6_pos
dtypes: float32(20), float64(98)
memory usage: 1.6 MB
None
   theta_LDA1+LDA2  theta_ocLDA_neg+ocLDA_neu+ocLDA_pos  \
0         2.270171                            -1.836805   
1         2.406045                            -1.616909   
2         2.811944                            -1.202803   
3         1.663026                            -3.072482   
4         2.163299                            -2.046872   

   phi_ocLDA_neg+ocLDA_neu+ocLDA_pos  \
0                           0.586375   
1                           0.667643   
2                           0.982068   
3                           0.632624   
4                           0.541106   

   theta_LDA-LOF_neg_10+LDA-LOF_neu_10+LDA-LOF_pos_10  \
0                                          -2.372771    
1                                          -2.366813    
2                                      

In [23]:
data.columns = data.columns.astype(str)
data.to_feather(f"{OUT_FOLDER}/holdout_featurized.feather")
data = pd.read_feather(f"{OUT_FOLDER}/holdout_featurized.feather")
data.info()
data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1957 entries, 0 to 1956
Columns: 118 entries, theta_LDA1+LDA2 to LDA-ocSVM_poly6_pos
dtypes: float32(20), float64(98)
memory usage: 1.6 MB


Unnamed: 0,theta_LDA1+LDA2,theta_ocLDA_neg+ocLDA_neu+ocLDA_pos,phi_ocLDA_neg+ocLDA_neu+ocLDA_pos,theta_LDA-LOF_neg_10+LDA-LOF_neu_10+LDA-LOF_pos_10,phi_LDA-LOF_neg_10+LDA-LOF_neu_10+LDA-LOF_pos_10,theta_LOF_neg_10+LOF_neu_10+LOF_pos_10,phi_LOF_neg_10+LOF_neu_10+LOF_pos_10,theta_LDA-LOF_neg_20+LDA-LOF_neu_20+LDA-LOF_pos_20,phi_LDA-LOF_neg_20+LDA-LOF_neu_20+LDA-LOF_pos_20,theta_LOF_neg_20+LOF_neu_20+LOF_pos_20,...,LDA-ocSVM_poly5_neu,LDA-ocSVM_poly5_pos,ocSVM_poly6_neg,ocSVM_poly6_neu,ocSVM_poly6_pos,ocSVM_poly6_LDA1,ocSVM_poly6_LDA2,LDA-ocSVM_poly6_neg,LDA-ocSVM_poly6_neu,LDA-ocSVM_poly6_pos
0,2.270171,-1.836805,0.586375,-2.372771,2.203119,-2.349876,2.32276,-2.327462,2.172485,-2.359419,...,-208.929305,3.9e-05,3.070262,0.697668,1.209082,-0.392961,0.28528,1664.602412,395.325571,65.052503
1,2.406045,-1.616909,0.667643,-2.366813,2.218946,-2.369632,2.220962,-2.350703,2.185424,-2.341869,...,-11.617114,1e-06,2.70745,0.623396,1.064647,-0.328475,0.242286,38.293673,9.852249,1.475788
2,2.811944,-1.202803,0.982068,-2.37446,2.178971,-2.338157,2.203633,-2.316605,2.153088,-2.334869,...,-218.639273,1e-05,3.560355,0.862953,1.409045,-0.394055,0.343421,793.769504,235.463271,26.652357
3,1.663026,-3.072482,0.632624,-2.353551,2.233436,-2.447289,2.249757,-2.351299,2.20623,-2.428448,...,-12.043912,0.000163,6.571688,1.454933,2.37176,-1.222963,0.295605,5763.655093,1027.228634,185.923277
4,2.163299,-2.046872,0.541106,-2.349717,2.205294,-2.322934,2.221186,-2.286041,2.13762,-2.368039,...,-708.255804,0.000229,5.016795,1.03189,1.887,-0.926832,0.356653,9696.335102,2153.740325,377.354692


time: 121 ms


# Discussion

Now that we have the features, let's evaluate them in the next notebook.

In [24]:
print(f"Time elapsed since notebook_begin_time: {time() - notebook_begin_time} s")
_ = gc_collect()

Time elapsed since notebook_begin_time: 486.60900688171387 s
time: 104 ms


[^top](#Contents)