# Annotate with DReAMy: NER, SA, RE

In this notebook, we will use DReAMy and its default setting to quickly annotate a set of dream reports.

In [None]:
! pip install dreamy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import dreamy

Let's start by getting some dreams. You can stard by dowloading a collection of dream-reports scraped from the DreamBank database, freely availabe from DReAMy's hugging face! 

DReAMy has direct access to two datasets. A smaller English-only (~ 20k), and a larger and multiplingual one (En/De, ~ 30 k). We will start with the En-Only one, has it has descriptive variables (such as gender and year). 

In [None]:
# choose between base (~ 22k reports, EN-only & more descriptive variables) 
# large (~ 29k reports, reports in EN and De, only series as descriptive variables)
database  = "base" 
dream_bank = dreamy.get_HF_DreamBank(database=database, as_dataframe=True)



  0%|          | 0/1 [00:00<?, ?it/s]

As you can see, the file, dowloaded directlty as a pandas DataFrame, as three entires:
- dreams, the dream-reports
- series, the different collection of DreamBank
- description, a brief description of each series

In [None]:
dream_bank.sample(2)

Unnamed: 0,series,description,dreams,gender,year
3577,dorothea,Dorothea: 53 years of dreams,With someone planning a party.,female,1912-1965
8510,b2,Barb Sanders #2,"(08/03/00)[""Poison frog prisoner bites me.""] I...",female,1997-2001


Lets now sample a small set of dreams. If you have a more powerfull machine (or you are working on Colab), you can increase the number of report. Note than the whole dataset contains ~ 29 k reports.

In [None]:
n_samples = 1000
dream_sample = dream_bank.sample(n_samples).reset_index(drop=True)

list_of_reports = dream_sample["dreams"].tolist()

First, we can check the suit of models you can call via dreamy to annotate dream reports. Remember that dreamy has three main annotation tasks: 

 - NER : (name entity recognition) which annotates reports with respect to the character appearing in a report.

- SA: (sentiment analysis) which annotates reports with respect to which of the five Hall & Van de Castle emotions (anger, apprehension, confusion, sadn4ssm happiness) appear in a report (possibly, also which character is experiencing them)

- RE: (relation extraction), which extracts entities (characters) in a report and the relation between them. At the moment, the only RE task available refers to the activity feature of the HVDC framework.

To check which models are available for download, just call the `show_models()` method. This will return a list of `Task, model_description: model_name` strings. As it will be shortly more, `dreamy` has default-setted models, based on performance, but you can always switch between them.

In [None]:
dreamy.show_models()

{'NER, EN-only, base, generation of list of (interpretable characters). T5)': 'DReAMy-lib/t5-base-DreamBank-Generation-NER-Char',
 'SA, En-only, large, multi-label classification. Custom architecture from Bertolini et al., 2023': 'DReAMy-lib/DB-custom-architecture',
 'SA, Multilingual (94), large, multi-label classification. XLM-RoBERTa': 'DReAMy-lib/xlm-roberta-large-DreamBank-emotion-presence',
 'SA, En-only, base, multi-label classification. BERT-base-cased': 'DReAMy-lib/bert-base-cased-DreamBank-emotion-presence',
 'SA, En-only, base, generation of emotion and character experiencing them. T5': 'DReAMy-lib/t5-base-DreamBank-Generation-Emot-Char',
 'SA, En-only, base, generation of emotion with ammount of presence. T5': 'DReAMy-lib/t5-base-DreamBank-Generation-Emot-EmotNn',
 'RE, EN-only, base, generation of  (initialiser : activity type : receiver) list. T5)': 'DReAMy-lib/t5-base-DreamBank-Generation-Act-Char'}

To check which are the default mdoels for each task, call the `.show_default_models()` method.

In [None]:
dreamy.show_default_models()

{'NER': 'DReAMy-lib/t5-base-DreamBank-Generation-NER-Char',
 'SA': 'DReAMy-lib/xlm-roberta-large-DreamBank-emotion-presence',
 'RE': 'DReAMy-lib/t5-base-DreamBank-Generation-Act-Char'}

# Annotation with *default* settings
## NER

Now we can start annotate our dream reports. We will start with the manin task: NER.

First, we need to specify such task (the default is `SA`), and ser other minor varubaels, like batch size and which device to use. Lastly, we need a list of reports. We will use the one previosly defined.model_name

In [None]:
task       = "NER"
batch_size = 16
device     = 0  # "cpu" for local "cuda" or device number (e.g., 0) for GPU

In [None]:
NER_annotations = dreamy.annotate_reports(
    list_of_reports, 
    task=task, 
    device=device,
    batch_size=batch_size,
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.35k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

  2%|▏         | 1/63 [00:03<03:24,  3.30s/it]Token indices sequence length is longer than the specified maximum sequence length for this model (1203 > 512). Running this sequence through the model will result in indexing errors
1000it [02:42,  6.15it/s]


In [None]:
NER_annotations[:10]

['individual female known adult; group joint uncertian adult;',
 'individual male husband adult; unknown character;',
 'group joint uncertian adult;',
 'individual male known adult; individual male known adult;',
 'individual female mother adult; individual male father adult; unknown character; individual female known teenager; individual male stranger adult; group joint known teenager;',
 'individual female known adult; individual male uncertian adult; individual male known adult; individual male known adult; group male uncertian adult; group joint parents adult;',
 'individual female mother adult; individual female daughter adult;',
 'individual male known adult; group joint uncertian adult; individual female uncertian adult;',
 'individual male stranger adult;',
 'group female stranger adult; individual female occupational adult;']

## SA

We can do the exact same the same for SA task. However, not that the SA task allows for both a generation (e.g., `individual male occupational adult experienced sadness, ...`) and multi-label classification settings (e.g., `{label: AN, score:.98,...`). This last setting has two settings itself. The default one, `distribution`, where the output contains each of the five emotions (i.e., `label`) *and* the probability associated by the model to each emotion (i.e., `score`). Or a second one, `present`, where the output will have a list form, containing only those emotions that the model was trained to predict "as truly present" – that is, labels with a `score` > .5. 

In [None]:
task        = "SA"
output_type = "distribution"
batch_size = 16
device     = 0  # "cpu" for local, "cuda" or device number (e.g., 0) for GPU

In [None]:
SA_predictions = dreamy.annotate_reports(
    list_of_reports, 
    task=task, 
    device=device,
    batch_size=batch_size, 
    return_type=output_type,
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/951 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/452 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

1000it [01:32, 10.83it/s]


In [None]:
SA_predictions[:10]

[[{'label': 'CO', 'score': 0.03959300369024277},
  {'label': 'AN', 'score': 0.019475437700748444},
  {'label': 'HA', 'score': 0.01309686154127121},
  {'label': 'AP', 'score': 0.011727494187653065},
  {'label': 'SD', 'score': 0.004194930661469698}],
 [{'label': 'HA', 'score': 0.9959113597869873},
  {'label': 'AP', 'score': 0.9071735143661499},
  {'label': 'CO', 'score': 0.8760898113250732},
  {'label': 'AN', 'score': 0.1336604654788971},
  {'label': 'SD', 'score': 0.04185082018375397}],
 [{'label': 'AP', 'score': 0.9984740614891052},
  {'label': 'HA', 'score': 0.9069458842277527},
  {'label': 'SD', 'score': 0.03238131105899811},
  {'label': 'AN', 'score': 0.00880750734359026},
  {'label': 'CO', 'score': 0.007660091854631901}],
 [{'label': 'AP', 'score': 0.9968907237052917},
  {'label': 'HA', 'score': 0.21021009981632233},
  {'label': 'SD', 'score': 0.005786903668195009},
  {'label': 'CO', 'score': 0.0030503757297992706},
  {'label': 'AN', 'score': 0.002490844577550888}],
 [{'label': 'SD

## RE
Lastly, we look at the relation extraction task. The procedure is one again the same, we just have to set the `task` argument, like so.

In [None]:
task       = "RE"
batch_size = 16
device     = 0  # "cpu" for local, "cuda" or device number (e.g., 0) for GPU

In [None]:
RE_annotations = dreamy.annotate_reports(
    list_of_reports, 
    task=task, 
    device=device,
    batch_size=batch_size,
)

  0%|          | 0/63 [00:00<?, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (519 > 512). Running this sequence through the model will result in indexing errors
1000it [04:56,  3.37it/s]


In [None]:
RE_annotations[:10]

['(dreamer : alone location : none), (dreamer : alone location : none), (dreamer : alone physical : none), (dreamer : verbal towards : individual female mother adult), (individual female mother adult : verbal reciprocated : dreamer), (dreamer : alone verbal : none)',
 '(dreamer : verbal mutual : individual female wife adult), (dreamer : alone auditory : none), (individual male stranger adult : alone physical : none), (individual male stranger adult : verbal towards : dreamer), (dreamer : alone auditory : none), (individual female stranger adult : alone physical : none), (dreamer : alone physical : none), (dreamer : verbal towards : individual female stranger adult), (individual female stranger adult : verbal reciprocated : dreamer), (',
 '(dreamer : alone movement : none), (dreamer : alone movement : none), (dreamer : alone visual : none), (dreamer : alone visual : none), (dreamer : alone physical : none), (dreamer : alone visual : none)',
 '(dreamer : alone physical : none), (dreamer 