# The Codecs and PersonCodec Classes

`Codecs` is a container class for managing ParlaCLARIN categorical (key/value) data i.e. mappings of integer data to text data. It exposes the data itself (as a number of Pandas dataframes), and a set of utility functions for working with the data e.q. encoding & decoding data.

| Table           | Id                 | Description              |
| --------------- | ------------------ | ------------------------ |
| chamber         | chamber_id         | List of chambers         |
| gender          | gender_id          | List of genders          |
| government      | government_id      | List of governments      |
| party           | party_id           | List of partys           |
| office_type     | office_type_id     | List of office types     |
| sub_office_type | sub_office_type_id | List of sub office types |

`PersonCodecs` is a derived class that also include individual data from the `persons_of_interest`. This is a processed version of information found in `person.csv` metadata, and includes only personons that has speeches in the corpus, and with some additional columns.


# The TrendData class

The `riksprot.TrendsData` class computes word trends using filters and pivot keys found in the `PersonCodecs` container. It is based on the `penelope.TrendsData` class with some minor additions.

The class has the following data members:

| Member             | Type              | Description                         |
| ------------------ | ----------------- | ----------------------------------- |
| corpus             | VectorizedCorpus  | Original source corpus              |
| compute_opts       | TrendsComputeOpts | Current compute options (see below) |
| transformed_corpus | VectorizedCorpus  | Transformed (computed) corpus       |
| n_top              | int               | Top count constraint                |
| person_codecs      | PersonCodecs      | Codecs helper class                 |
| tabular_compiler   | TabularCompiler   | Result compiler                     |
| _gof_data_         | _GofData_         | _Godness of fit data (ignore)_      |

And the following methods:

| Method            | Signature                             | Description                                                                                                                                              |
| ----------------- | ------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
| transform         | _opts: TrendsComputeOpts -> self_     | Transforms `corpus` to `transformed_corpus` using `opts`                                                                                                 |
| extract           | _(indices, filters) -> pd.DataFrame_  | Extract pd.DataFrame using current compiler.                                                                                                             |
| reset             | _\_ -> self_                          | Reset corpus and compute opts to default                                                                                                                 |
| find_word_indices | _opts -> sequence[int]_               | Find indicies for matching words (accepts wildcards and regex). Delegates to `transform_corpus.find_matching_words_indices(opts.words, opts.top_count)`. |
| find_words        | _opts -> sequence[str]_               | Find matching words (accepts wildcards and regex). Delegates to `transform_corpus.find_matching_words(opts.words, opts.top_count)`.                      |
| get_top_terms     | _(int,kind,category) -> pd.DataFrame_ |

The avaliable `ComputeOpts` attributes are:

| Attribute           | Type                       | Description                                                    |
| ------------------- | -------------------------- | -------------------------------------------------------------- |
| normalize           | bool                       | Normalize data flag.                                           |
| keyness             | pk.KeynessMetric           | Keyness metric to use `TF`, `TF (norm)` or `TF-IDF`            |
| temporal_key        | str                        | Temporal pivot key: `year`, `lustrum` or `decade`              |
| pivot_keys_id_names | list[str]                  | List of pivot key ID names                                     |
| filter_opts         | `PropertyValueMaskingOpts` | Key/value filter of resulting data (extract)                   |
| unstack_tabular     | bool                       | Unstack result i.e. return columns instead of categorical rows |
| fill_gaps           | bool                       | Fill empty/missing temporal category values                    |
| smooth              | bool                       | Return smoothed, interpolated data (for line plot)             |
| top_count           | int                        |
| words               | list[str]                  | List of word/patterns of interest                              |
| descending          | bool                       | Result sort order                                              |
| keyness_source      | pk.KeynessMetricSource     | Ignore (only valid for co-occurrence trends)                   |

The avaliable `ComputeOpts` attributes are:

| Method             | Signature                          | Description                                           |
| ------------------ | ---------------------------------- | ----------------------------------------------------- |
| invalidates_corpus | _other: TrendsComputeOpts -> bool_ | Checks if `other` opts invalidates transformed corpus |
| clone              | \*\_ -> TrendsComputeOpts          | Creates a clone                                       |


### Example

Load a DTM corpus.


In [1]:
import __paths__

from parlaclarin import codecs as md
from parlaclarin.trends_data import SweDebTrendsData, SweDebComputeOpts
from penelope.common.keyness import KeynessMetric
from penelope.corpus import VectorizedCorpus
from penelope.utility import PropertyValueMaskingOpts

import pandas as pd

pd.set_option("display.max_columns", 500)
pd.set_option("display.width", 1500)

dtm_folder: str = "../data/dataset-01/v0.6.0/dtm/lemma"
dtm_tag: str = "lemma"

corpus: VectorizedCorpus = VectorizedCorpus.load(folder=dtm_folder, tag=dtm_tag)

[32m2023-05-24 16:05:00.230[0m | [1mINFO    [0m | [36mpenelope.vendor.gensim_api._gensim._models[0m:[36m<module>[0m:[36m52[0m - [1mgensim not included in current installment[0m


Load corpus metadata (speakers and codes)


In [4]:

codecs: md.PersonCodecs = md.PersonCodecs().load(source="../data/dataset-01/v0.6.0//riksprot_metadata.db")

""" Speakers metadata with encoded ids """
persons: pd.DataFrame = codecs.persons_of_interest.head().copy()

"""Decoded speakers' metadata"""
print(codecs.decode(persons))

"""Decode speakers' metadata but keep ids"""
print(codecs.decode(persons, drop=False))

"""Print specification of metadata properties (and actual data)."""
"""Can be used to create a GUI for selecting metadata properties and values for display, grouping and filtering"""
# print(codecs.property_values_specs)

                         name  year_of_birth  year_of_death  has_multiple_parties   gender party_abbrev person_id
person_id                                                                                                        
Q53707          Tage Erlander           1901           1985                     0      man            S    Q53707
Q5887636    Rune B. Johansson           1915           1982                     0      man            S  Q5887636
unknown                                    0              0                     0  unknown            ?   unknown
Q1606431         Henry Allard           1911           1996                     0      man            S  Q1606431
Q707581    Ingemund Bengtsson           1919           2000                     0      man            S   Q707581
                         name  year_of_birth  year_of_death  has_multiple_parties   gender party_abbrev person_id
person_id                                                                               

'Can be used to create a GUI for selecting metadata properties and values for display, grouping and filtering'

Compute word trends.


In [8]:
"""Compute trends 
    - group by "year" (temporal key) and party
    - return absolute frequencies (keyness=KeynessMetric.TF)
    - do not normalize
    - do not fill temporal gaps (items with zero frequency)
    - do not smooth (interpolate values, adds additional categories)
    - return atmost 100 words
    - do not unstack tabular data (keep party as column)
"""

trends_data: SweDebTrendsData = SweDebTrendsData(corpus=corpus, person_codecs=codecs, n_top=100000)

opts: SweDebComputeOpts = SweDebComputeOpts(
    fill_gaps=False,
    keyness=KeynessMetric.TF,
    normalize=False,
    pivot_keys_id_names=["party_id"],
    filter_opts=PropertyValueMaskingOpts(gender_id=2),
    smooth=False,
    temporal_key="year",
    top_count=100,
    unstack_tabular=False,
    words=None,
)

trends_data.transform(opts)

opts.words = words=["sverige", "jag"]

# FIXME: Extend API so that extract can take a list of words 
trends: pd.DataFrame = trends_data.extract(indices=trends_data.find_word_indices(opts))

print(trends.head())

opts.words = words=["finland", "du"]
print(trends_data.extract(indices=trends_data.find_word_indices(opts)).head())


# trends_data.transformed_corpus.find_matching_words_indices(
#     word_or_regexp=["sverige", "jag"], n_max_count=100, descending=False
# )

   year  party_id  sverige  jag
0  1960         5        1    1
1  1960         9        0    5
2  1961         9        1    3
3  1970         5        3   32
4  1970         7        0   50
   year  party_id  finland
0  1960         5        0
1  1960         9        0
2  1961         9        0
3  1970         5        1
4  1970         7        0


In [5]:
"""Decode any encoded ids in the trends data frame"""

picked_indices = trends_data.find_word_indices(opts)

trends: pd.DataFrame = trends_data.extract(indices=picked_indices)

print(trends.head())

"""Decode any encoded ids in the trends data frame"""
print(trends_data.person_codecs.decode(trends).head())

# Find indices of picked words from corpus
trends_data.transformed_corpus.find_matching_words_indices(
    word_or_regexp=["sverige", "jag"], n_max_count=100, descending=False
)

   year  party_id  jag  sverige
0  1960         5    1        1
1  1960         9    5        0
2  1961         9    3        1
3  1970         5   32        3
4  1970         7   50        0
   year  jag  sverige party_abbrev
0  1960    1        1            L
1  1960    5        0            S
2  1961    3        1            S
3  1970   32        3            L
4  1970   50        0            M


[6, 167]