# Classifing Reports: sentiment analsysis

In this tutorial, we will look at how to classify dream reports with respect to their emotional content using DReAMy, following the Hall and Van de Castle (HVDC) scoring system. The task is further divided into two settings: 
- labelling
- generation. 

In the first one, an LLM of choice will annotate a set of reports, using a multi-label classification setting. That is, given a (set of) textual reports, the model will output the probability of each of the single HVDC emotions to be present in each report. The second format proposes a similar analysis, but with a "spelt-out" setting, using a text-to-text generation format. This last setting will also allow us to include with which character each emotion is associated to.

If you wish to know more about the general research aspect behind the tuned LLMs, please refer to the paper ([Bertolini et al., 2023](https://arxiv.org/abs/2302.14828)).

To learn more about the tuning performance of each model, visit the [DReAMy's Hugging Face repo](https://huggingface.co/DReAMy-lib), and refer to each model's card for hyperparameters and validation scores.

In [None]:
! pip install dreamy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting dreamy
  Downloading dreamy-0.0.9.5-py3-none-any.whl (13 kB)
Collecting transformers[tokenizers,torch]
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m28.6 MB/s[0m eta [36m0:00:00[0m
Collecting datasets
  Downloading datasets-2.10.1-py3-none-any.whl (469 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m26.4 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess
  Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.0/132.0 KB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting xxhash
  Downloading xxhash-3.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (213 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.0/213.0 KB[0m [31m

In [None]:
import dreamy

Let's start by getting some dreams, by downloading one of the two collections of dream reports scraped from the DreamBank database, freely available from DReAMy's hugging face!

In [None]:
language   = "english" # choose between english/multi
dream_bank = dreamy.get_HF_DreamBank(as_dataframe=True, language=language)

Downloading readme:   0%|          | 0.00/2.45k [00:00<?, ?B/s]

Downloading and preparing dataset None/None to /root/.cache/huggingface/datasets/DReAMy-lib___parquet/DReAMy-lib--DreamBank-dreams-en-98a9abc92d226c3a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/12.0M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/22415 [00:00<?, ? examples/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/DReAMy-lib___parquet/DReAMy-lib--DreamBank-dreams-en-98a9abc92d226c3a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

As you can see, the file, downloaded directly as a pandas DataFrame, has five entries:
- series: the different collections of DreamBank
- description: a brief description of each series
- dreams: the dream-reports
- gender: the gender of the participant
- year: the time window of the series

In [None]:
dream_bank.sample(5)

Unnamed: 0,series,description,dreams,gender,year
16112,madeline4-postgrad,Madeline 4: After College,"I saw two hands, representing gay marriage. T...",female,2003-2004
14489,hall_female,"College women, late 1940s",I dreamt last week that I was in the dining ro...,female,1946-1950
7221,norms-f,Hall/VdC Norms: Female,It seems I went to a certain party where there...,female,1940s-1950s
12438,emma,Emma: 48 years of dreams,All the people who work at the Counseling Cent...,female,1949-1997
20668,b,Barb Sanders,I'm watching TV and had seen the previews of c...,female,1960-1997


Let's now sample a small set of dreams. If you have a more powerful machine (or you are working with a subscription-based Colab), you can increase the number of reports. Note that the whole dataset contains ~ 20k reports.

In [None]:
n_samples = 10
dream_sample = dream_bank.sample(n_samples).reset_index(drop=True)

dream_as_list = dream_sample["dreams"].tolist()

We now set some parameters to decide which model to start with.

## Labelling
Here, we query a model just to know the probability of an emotion being present. We start by asking `dreamy` which models are available and gathering the necessary argument. We can do so with the `get_emotions_model_specifications` method, which will return a dictionary with each model's description, together with the argument to use if we want to use that model.

In [None]:
dreamy.get_emotions_model_specifications()

{'Custom architecture, best for multi-labell classification. (Large) En-only': ['presence',
  'large-en'],
 'Large Multilingual model (XLM-R)': ['presence', 'large-multi'],
 'English-Only base-model (BERT-base)': ['presence', 'base-en'],
 'Base En-only text-generation: Characters + emotions (T5-base)': ['generation',
  'char-en'],
 'Base En-only text-generation: numbered emotions (T5-base)': ['generation',
  'nmbr-en']}

For this tutorial, we will make use of the third model, which is a smaller and English-only model (implemented as a BERT-base LLM). As you can see, to use such a model we need to set `classification_type = "presence"`, and `model_type = "base-en"`. We can also set other useful specifications of the model, such as whether we want all the predicted class (rather than just the above threshold ones), and on which device (cpu or GPU) to run the classification.

In [None]:
classification_type = "presence"
model_type          = "base-en"
return_type         = "distribution" # "present" for above-threshold only
device              = "cpu" # cpu for cpu, cuda for GPUs

We can now obtain the predicted emotions over a list of dreams by simply calling the `predict_emotions` method, using the selected specifications.

In [None]:
predictions = dreamy.predict_emotions(
    dream_as_list, 
    classification_type, 
    model_type,
    return_type=return_type, 
    device=device,
)

In [None]:
predictions

[[{'label': 'CO', 'score': 0.6409539580345154},
  {'label': 'HA', 'score': 0.5593757033348083},
  {'label': 'AN', 'score': 0.028343047946691513},
  {'label': 'SD', 'score': 0.020923538133502007},
  {'label': 'AP', 'score': 0.006222642958164215}],
 [{'label': 'SD', 'score': 0.9714974164962769},
  {'label': 'AN', 'score': 0.08177998661994934},
  {'label': 'AP', 'score': 0.07258966565132141},
  {'label': 'CO', 'score': 0.043272484093904495},
  {'label': 'HA', 'score': 0.03988633304834366}],
 [{'label': 'SD', 'score': 0.9406083822250366},
  {'label': 'AP', 'score': 0.3144640326499939},
  {'label': 'AN', 'score': 0.15911339223384857},
  {'label': 'HA', 'score': 0.02027391456067562},
  {'label': 'CO', 'score': 0.015820380300283432}],
 [{'label': 'SD', 'score': 0.9691457748413086},
  {'label': 'AP', 'score': 0.39766016602516174},
  {'label': 'HA', 'score': 0.07225380837917328},
  {'label': 'AN', 'score': 0.034352272748947144},
  {'label': 'CO', 'score': 0.022598588839173317}],
 [{'label': 'HA

And here are the predictions. As you can see, they are a list of dictionaries. Each dictionary contains two items: `label`, which is a specific emotion, and `score`, the probability for each emotion of being present in the report. 

As mentioned, you can choose to obtain solely the labels for the detected emotions (i.e., those with a `score` >= .5) by setting `return_type="present"`. The return format will be a list of lists of the form
```
[['HA', 'CO'],
 ['CO', 'AP', 'AN'],
 ['AP'],
 ['CO'],
 [],
 ['CO'],
 ['CO'],
 ['AP'],
 ['CO', 'AP'],
 ['AP']]
 ```
Note that one entry contains an empty list. This just means that the model has identified no emotion for that specific report.

To interpret the emotion labels, youse the decodings table installed in DReAMy

In [None]:
dreamy.Coding_emotions

{'AN': 'anger',
 'AP': 'apprehension',
 'SD': 'sadness',
 'CO': 'confusion',
 'HA': 'happiness'}

## Generation

Lets now try to *generate* the emotion encodings. We will use the same data, and general specification, with the key change to the task type. 

In this case, we will make use the `'nmbr-en'` model. A (T5) text-to-text LLM, that, as previously described, was tuned to identify HVDC emotions, as well as the character experiencing those emotions.

Please note that training data are limited to English-language, both for the pre-training and tuning. 

In [1]:
classification_type = "generation"

# The remaining arguments are the same
model_type          = "char-en"
device              = "cpu"

As you can see, we now have a different model name and task for the pipeline. Moreover, we need to call a slightly different function: `generate_emotions`, which also has slightly different inputs.

In [None]:
predictions = dreamy.generate_emotions(
    dream_as_list, 
    classification_type, 
    model_type,
    device=device,
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.35k [00:00<?, ?B/s]

Downloading (…)"spiece.model";:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

As you can see, in this, the dreamy simply reeturns a list containing the generated reports.

In [None]:
predictions

['The dreamer experienced happiness.',
 'The dreamer experienced sadness.',
 'The dreamer experienced apprehension.',
 'The group joint uncertian adult experienced sadness.',
 'The dreamer experienced happiness.',
 'The dreamer experienced apprehension and excitement.',
 'The dreamer experienced apprehension.',
 'The dreamer experienced apprehension.',
 'The dreamer experienced confusion.',
 'The dreamer experienced happiness and apprehension.']