# Classifing Reports: characters (CHAR) recognition.

In the Hall and Van de Castle (HVDC) framework, a key aspect of each dream report is the character that appear in it. In this notebook, we will see how to use `dreamy` to use CHAR/NER (for name entity recognition) models, to annotate reports with respect to relevant characters appearing in each report. At the moment, `dreamy` supports NER in the generation format. That is, characters are spelt out (e.g., `individual female adult wise`, always following the HVDC notation.

Please note that CHAR data used in training is not linked to any specific feature. In accordance with the HVDC frameworks, the prediction should not include the dreamer as a characters list.

In [None]:

! pip install dreamy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting dreamy
  Downloading dreamy-0.0.9.5-py3-none-any.whl (13 kB)
Collecting datasets
  Downloading datasets-2.10.1-py3-none-any.whl (469 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers[tokenizers,torch]
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess
  Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.0/132.0 KB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.12.1-py3-none-any.whl (190 kB)
[2K     [9

In [None]:
import dreamy

Let's start by getting some reports. Once again, we will use the set of English-only dream-reports scraped from the DreamBank database, freely available from DReAMy's hugging face!

In [None]:
language   = "english" # choose between english/multi
dream_bank = dreamy.get_HF_DreamBank(as_dataframe=True, language=language)

Downloading readme:   0%|          | 0.00/2.45k [00:00<?, ?B/s]

Downloading and preparing dataset None/None to /root/.cache/huggingface/datasets/DReAMy-lib___parquet/DReAMy-lib--DreamBank-dreams-en-98a9abc92d226c3a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/12.0M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/22415 [00:00<?, ? examples/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/DReAMy-lib___parquet/DReAMy-lib--DreamBank-dreams-en-98a9abc92d226c3a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

As you can see, the file, downloaded directly as a pandas DataFrame, has five entries:
- series: the different collections of DreamBank
- description: a brief description of each series
- dreams: the dream-reports
- gender: the gender of the participant
- year: the time window of the series


In [None]:
dream_bank.sample(5)

Unnamed: 0,series,description,dreams,gender,year
13046,vickie,Vickie: a 10-year-old girl,"Nancy, Mom, and I were looking in a store for ...",female,1995
11510,peru-f,Peruvian women,Last night I first dreamed about a neighbor wh...,female,1970
14934,peru-m,Peruvian men,I was in the house of a friend. I don't rememb...,male,1970
9738,pegasus,Pegasus: a factory worker,Charley C. and I were watching a cow's rear en...,male,1949-1964
17217,elizabeth,Elizabeth: a woman in her 40s,The mailer for my 1 mil campaign went out and ...,female,1999-


Let's now sample a small set of dreams. If you have a more powerful machine (or you are working on Colab), you can increase the number of reports. Note that the whole dataset contains ~ 20k reports.

In [None]:
n_samples = 10
dream_sample = dream_bank.sample(n_samples).reset_index(drop=True)

dream_as_list = dream_sample["dreams"].tolist()

with the `get_ner_model_specifications` method we can check which NER models are available, and how to call them. 

As you can see, at the moment dreamy only has one model for NER 😅. We'll have to work with that for now 😃. Let's start by setting up the needed spec.

In [None]:
dreamy.get_ner_model_specifications()

{'Base En-only, generats full cahractes descriptions (T5-base)': ['full',
  'base-en']}

In [None]:
classification_type = "full"
model_type          = "base-en"
device              = "cpu" # set "cuda" to use GPU

In [None]:
predictions = dreamy.get_CHAR(
    dream_as_list, 
    classification_type, 
    model_type,
    device=device,
    max_new_token=60,
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.35k [00:00<?, ?B/s]

Downloading (…)"spiece.model";:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Finally, let's have a look at the predicted characters...

In [None]:
predictions

['individual female mother adult; individual male brother baby; group joint reative adult;',
 'individual male known adult;',
 'individual male known adult; individual male known adult; unknown character;',
 'individual male known adult; individual female known adult; individual male known adult; group joint reative adult; individual male uncertian adult; individual male uncertian adult;',
 'individual female known adult; group male known adult; individual male known adult; individual male known adult; individual male known adult;',
 '',
 'individual imaginary male prominent adult;',
 'individual male stranger adult; individual female stranger adult; individual male stranger adult; individual female mother adult; individual female stranger adult; individual male father adult;',
 '',
 'individual male father adult; individual female reative adult; individual female reative adult; individual female reative adult; individual female reative adult; individual female reative adult;']

As you can see, we did obtain a few reports containing empty predictions (i.e., no character identified). From a general perspective, this can actually be desirable, since the dreamer themself is never listed as a character.