In [1]:
!pip install datasets



In [1]:
import polars as pl
import datasets

import src.extractor as ex

# Preliminaries
This package assumes and is optimized for 'tabular' text dataset, i.e. datasets with a text column with several items/instances.
The default text column name is `text`. To change that, change the config as specified below.

# Extraction

In [2]:
dataset = datasets.load_dataset('Multilingual-Perspectivist-NLU/EPIC')
df = pl.from_pandas(dataset["train"].to_pandas())[:100]

Initialize the extractor with to extract all currently supported base level features. The specification can be found in `src/config.py`.

In [3]:
extractor = ex.Extractor(data=df)

Extraction of all features of the default config

In [4]:
df1 = extractor.extract_features()

Extracting raw_sequence_length...
Extracting n_tokens...
Extracting n_sentences...
Extracting n_tokens_per_sentence...
Extracting n_characters...
Extracting avg_word_length...
Extracting n_types...
Extracting n_long_words...
Extracting n_lemmas...
Extracting sentiment_score...
Extracting n_negative_sentiment...
Extracting n_positive_sentiment...
Extracting avg_valence...
Extracting n_low_valence...
Extracting n_high_valence...
Extracting avg_arousal...
Extracting n_low_arousal...
Extracting n_high_arousal...
Extracting avg_dominance...
Extracting n_low_dominance...
Extracting n_high_dominance...
Extracting avg_emotion_intensity...
Extracting n_low_intensity...
Extracting n_high_intensity...
Extracting compressibility...
Extracting entropy...
Extracting lemma_token_ratio...
Extracting ttr...
Extracting rttr...
Extracting cttr...
Extracting herdan_c...
Extracting summer_index...
Extracting dougast_u...
Extracting maas_index...
Extracting n_hapax_legomena...
Extracting n_hapax_legomena_to

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


Extracting n_low_concreteness...
Extracting n_high_concreteness...
Extracting avg_aoa...
Extracting n_low_aoa...
Extracting n_high_aoa...
Extracting avg_prevalence...
Extracting n_low_prevalence...
Extracting n_high_prevalence...
Extracting n_syllables...
Extracting n_monosyllables...
Extracting n_polysyllables...
Extracting flesch_reading_ease...
Extracting flesch_kincaid_grade...
Extracting gunning_fog...
Extracting ari...
Extracting smog...
Extracting cli...
Extracting lix...
Extracting rix...
Extracting n_lemmas...
Extracting n_hedges...
Extracting avg_num_synsets...
Extracting avg_num_synsets_per_pos...
Extracting n_low_synsets...
Extracting n_high_synsets...
Extracting n_entities...
Extracting n_entities_token_ratio...
Extracting n_entities_sentence_ratio...
Extracting n_per_entity_type...


To add ratios, just append the name of the base level feature + ratio name (find them in `src/ratios.py`). Edit the default config as follows:

In [5]:
config = ex.CONFIG_ALL
# base feature without any additional configutation; use defaults
config["features"]["emotion"]["n_low_valence_token_ratio"] = {}
# base feature with additional configuration
config["features"]["emotion"]["n_low_valence_token_ratio"] = {"threshold": 0.5,
                                                              "lexicon": "vad_nrc"}

If the text column has a different name this can be addressed either in the input data frame by renaming the column or by specifying the text column name in the config 

In [9]:
config["text_column"]

'text'

And with the edited config

In [6]:
dataset = datasets.load_dataset('Multilingual-Perspectivist-NLU/EPIC')
df = pl.from_pandas(dataset["train"].to_pandas())[:100]

extractor = ex.Extractor(data=df, config=config)
df2 = extractor.extract_features()

Extracting raw_sequence_length...
Extracting n_tokens...
Extracting n_sentences...
Extracting n_tokens_per_sentence...
Extracting n_characters...
Extracting avg_word_length...
Extracting n_types...
Extracting n_long_words...
Extracting n_lemmas...
Extracting sentiment_score...
Extracting n_negative_sentiment...
Extracting n_positive_sentiment...
Extracting avg_valence...
Extracting n_low_valence...
Extracting n_high_valence...
Extracting avg_arousal...
Extracting n_low_arousal...
Extracting n_high_arousal...
Extracting avg_dominance...
Extracting n_low_dominance...
Extracting n_high_dominance...
Extracting avg_emotion_intensity...
Extracting n_low_intensity...
Extracting n_high_intensity...
Extracting n_low_valence_token_ratio...
Extracting compressibility...
Extracting entropy...
Extracting lemma_token_ratio...
Extracting ttr...
Extracting rttr...
Extracting cttr...
Extracting herdan_c...
Extracting summer_index...
Extracting dougast_u...
Extracting maas_index...
Extracting n_hapax_le

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


Extracting n_low_concreteness...
Extracting n_high_concreteness...
Extracting avg_aoa...
Extracting n_low_aoa...
Extracting n_high_aoa...
Extracting avg_prevalence...
Extracting n_low_prevalence...
Extracting n_high_prevalence...
Extracting n_syllables...
Extracting n_monosyllables...
Extracting n_polysyllables...
Extracting flesch_reading_ease...
Extracting flesch_kincaid_grade...
Extracting gunning_fog...
Extracting ari...
Extracting smog...
Extracting cli...
Extracting lix...
Extracting rix...
Extracting n_lemmas...
Extracting n_hedges...
Extracting avg_num_synsets...
Extracting avg_num_synsets_per_pos...
Extracting n_low_synsets...
Extracting n_high_synsets...
Extracting n_entities...
Extracting n_entities_token_ratio...
Extracting n_entities_sentence_ratio...
Extracting n_per_entity_type...


In [10]:
len(df1.columns), len(df2.columns)

(165, 166)

In [None]:
df1.select("n_low_valence_token_ratio").head() # as expected, this will fail

ColumnNotFoundError: n_low_valence_token_ratio

In [8]:
df2.select("n_low_valence_token_ratio").head()

n_low_valence_token_ratio
f64
0.294118
0.294118
0.294118
0.294118
0.294118
