# Classifing Reports: NER for character (CHAR) recognition.

An important aspect of each dream report is the character that appear in it. In thi notebook, we will see how to use `dreamy` to extract character appearing in each report. As always, character are defined with respect to the Hall & Van de Castle system. CHAR are in this case spelled out, and do not/should not include the dreamer themself. Please note that CHAR data used in training is not linked to any specii feature. In other words, prediction should not be interpreted in any other way other than their presence. 

In [1]:
! pip install dreamy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting dreamy
  Downloading dreamy-0.0.5-py3-none-any.whl (12 kB)
Collecting datasets
  Downloading datasets-2.9.0-py3-none-any.whl (462 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m462.8/462.8 KB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
Collecting transformers[tokenizers,torch]
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m38.5 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess
  Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.0/132.0 KB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━

In [2]:
import dreamy

Let's start by getting some dreams. You can stard by dowloading a collection of dream-reports scraped from the DreamBank database, freely availabe from DReAMy's hugging face!

In [3]:
dream_bank = dreamy.get_HF_DreamBank(as_dataframe=True)

Downloading readme:   0%|          | 0.00/2.45k [00:00<?, ?B/s]



Downloading and preparing dataset None/None to /root/.cache/huggingface/datasets/DReAMy-Library___parquet/DReAMy-Library--DreamBank-dreams-34a268b12519d660/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/15.5M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/29345 [00:00<?, ? examples/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/DReAMy-Library___parquet/DReAMy-Library--DreamBank-dreams-34a268b12519d660/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

As you can see, the file, dowloaded directlty as a pandas DataFrame, as three entires:
- dreams, the dream-reports
- series, the different collection of DreamBank
- description, a brief description of each series

In [4]:
dream_bank.sample(5)

Unnamed: 0,dreams,series,description
10636,Some thread unraveled from my clothes and beca...,norman,Norman: a child molester
19222,It is winter: fresh snow; we have a dog: white...,emma,Emma: 48 years of dreams
24360,"(03/08/98)[""Center of Attention""] I have a bl...",b2,Barb Sanders #2
15738,Being Tested I'm traveling on train tracks. I...,kenneth,Kenneth
11119,I was putting together rectangular pieces of v...,norman,Norman: a child molester


Lets now sample a small set of dreams. If you have a more powerfull machine (or you are working on Colab), you can increase the number of report. Note than the whole dataset contains ~ 29 k reports.

In [5]:
n_samples = 10
dream_sample = dream_bank.sample(n_samples).reset_index(drop=True)

dream_as_list = dream_sample["dreams"].tolist()

Lets check which models are currently availabe.

In [6]:
dreamy.get_NER_names()

{'NER via text generation of full cahractes descriptions': 'DReAMy-lib/t5-base-DreamBank-Generation-NER-Char'}


In [7]:
dreamy.get_NER_maps()

{'full-base-en': ['DReAMy-lib/t5-base-DreamBank-Generation-NER-Char', 'summarization']}


As you can see, at the moment dreamy only has one model for NER 😅. We'll have to work with that for now 😃. Lets start by setting up the needed spec.

In [8]:
classification_type = "full"
model_type          = "base-en"
device              = "cpu"
max_length          = 512
truncation          = True
device              = "cpu"


model_name, task = dreamy.ner_model_maps[
    "{}-{}".format(classification_type, model_type)
]

Much like for the emotion-generation tutorial, do not worry about the `our max_length is set to...` warning. Is just 🤗 being nice and telling us that you could use less max-tokens.

In [9]:
predictions = dreamy.get_CHAR(
    dream_as_list, 
    model_name, 
    task,
    max_length=max_length, 
    truncation=truncation, 
    device=device,
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.35k [00:00<?, ?B/s]

Downloading (…)"spiece.model";:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Your max_length is set to 512, but you input_length is only 114. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=57)
Your max_length is set to 512, but you input_length is only 106. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=53)
Your max_length is set to 512, but you input_length is only 178. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=89)
Your max_length is set to 512, but you input_length is only 30. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)
Your max_length is set to 512, but you input_length is only 310. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=155)
Your max_length is set to 512, but you input_length is only 80. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=40)
Your max_length is set to 512, but you input_length is only 193. You might co

Finally, lets have a look at the prediccted characters...

In [10]:
predictions

[{'summary_text': 'group joint known child; group indefinite uncertian child; individual indefinite known adult; group male known child . group joint occupational adult;'},
 {'summary_text': 'individual female known adult; individual male uncertian adult; reative adult; original form male known adult . changed form female known child; changed form male stranger adult;'},
 {'summary_text': 'individual dead male prominent adult; individual female known adult; original form indefinite uncertian adult; changed form in late 19th century .'},
 {'summary_text': 'individual male stranger adult; group joint uncertian adult; individual female known adult; original form joint occupational adult; changed form female occupational adult .'},
 {'summary_text': "'i was in einer Gesellschaft, where ein Gericht vorkam, das Kawatanken hiess' ''"},
 {'summary_text': 'individual female known adult; group indefinite uncertian child; individual male occupational adult; individual indefinite stranger adult; u