# PrAACT: Predictive Augmentative and Alternative Communication with Transformers

This notebook presents to process to annotate the dataset for the PrAACT project. As mentioned in the paper, the dataset annotation step adapts a text corpus to the context of Augmentative and Alternative Communication (AAC).

The dataset used is the [AACTex](https://aactext.org/). We also use the keywords from ARASAAC as a vocabulary to adapt the text corpus to the AAC context.

## Install dependencies

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m59.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m68.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.1-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.2/199.2 KB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.1 tokenizers-0.13.2 transformers-4.26.1
Looking in indexes: https://pypi.org/simple, https://us

## Download the data

First, we download the AACText.

In [2]:
## download the dataset to `data` folder
!wget https://aactext.org/imagine/aac_comm.zip -P data
## extract the dataset
!unzip data/aac_comm.zip -d data

--2023-10-23 13:44:02--  https://aactext.org/imagine/aac_comm.zip
Resolvendo aactext.org (aactext.org)... 2606:4700:3036::6815:4f16, 2606:4700:3031::ac43:a8b4, 172.67.168.180, ...
Conectando-se a aactext.org (aactext.org)|2606:4700:3036::6815:4f16|:443... conectado.
A requisição HTTP foi enviada, aguardando resposta... 200 OK
Tamanho: 1176451 (1,1M) [application/zip]
Salvando em: ‘data/aac_comm.zip.1’


2023-10-23 13:44:03 (4,76 MB/s) - ‘data/aac_comm.zip.1’ salvo [1176451/1176451]

Archive:  data/aac_comm.zip
   creating: data/aac_comm/
  inflating: data/aac_comm/american_words.txt  
  inflating: data/aac_comm/lm_test_comm.txt  
  inflating: data/aac_comm/lm_test_switch.txt  
  inflating: data/aac_comm/readme.txt  
  inflating: data/aac_comm/sent_dev_aac.txt  
  inflating: data/aac_comm/sent_test_aac.txt  
  inflating: data/aac_comm/sent_train_aac.txt  
  inflating: data/aac_comm/vocab_aac_twitter.txt  


Then, we download the ARASAAC keywords. Here, we are interested on using the keywords which have 2 to 3 words. These keywords are multi-word expression from ARASAAC. Expressions with more than 3 words are not considered as they usually consist of a sentence.

In [4]:
import requests
keywords = requests.get("https://api.arasaac.org/api/keywords/en").json()
keywords = [k.strip().lower() for k in keywords["words"] if len(k.strip().split(" ")) > 1 and len(k.strip().split(" ")) <= 3]
len(keywords)

7326

## Process Dataset

We use a multi-word expression (MWE) tokenizer to tokenize the text corpus. We use the [MWETokenizer](https://www.nltk.org/_modules/nltk/tokenize/mwe.html) from NLTK.

For this, we add 

In [5]:
from nltk.tokenize import MWETokenizer
vocab = []

for k in keywords:
  vocab.append(tuple(k.strip().lower().split(" ")))
MWEtokenizer = MWETokenizer(vocab)

In [10]:
!python3 -m spacy download en_core_web_sm

Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.5.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [11]:
import en_core_web_sm
nlp = en_core_web_sm.load()

def lemmatize(sentence):
  doc = nlp(sentence.lower())
  pos_ = []
  lemmas = []
  tokens = []
  for t in doc:
    tokens.append({
        "text": t.text,
        "lemma": t.lemma_,
        'pos': t.pos_
    })
  return {
      "sentence": sentence,
      "tokens":tokens
  }
lemmatize(" I like to eat green_olives.")

{'sentence': ' I like to eat green_olives.',
 'tokens': [{'text': ' ', 'lemma': ' ', 'pos': 'SPACE'},
  {'text': 'i', 'lemma': 'I', 'pos': 'PRON'},
  {'text': 'like', 'lemma': 'like', 'pos': 'VERB'},
  {'text': 'to', 'lemma': 'to', 'pos': 'PART'},
  {'text': 'eat', 'lemma': 'eat', 'pos': 'VERB'},
  {'text': 'green_olives', 'lemma': 'green_olive', 'pos': 'NOUN'},
  {'text': '.', 'lemma': '.', 'pos': 'PUNCT'}]}

In [12]:
import string

def prepare_text(sentences):
  sentences = [s.rstrip() for s in sentences]
  tokenized_sentences = [MWEtokenizer.tokenize(re.findall(r"[\w']+|[.,!?;]", s)) for s in sentences]
  tokenized_sentences = [s for s in tokenized_sentences if len(s) >=3]
  lemmatized_sentences = [lemmatize(" ".join(s)) for s in tokenized_sentences]
  telegraphic_sentences = []
  for sentence in lemmatized_sentences:
    tokens = [t["lemma"] for t in sentence['tokens'] if t["pos"] not in ["DET","ADP"]]
    n_sentence = " ".join(tokens)
    if "," not in tokens:
      telegraphic_sentences.append(n_sentence.replace("_"," ").translate(str.maketrans('', '', string.punctuation)))
      if n_sentence != sentence["sentence"].lower():
        telegraphic_sentences.append(sentence["sentence"].replace("_"," ").translate(str.maketrans('', '', string.punctuation)))
  return telegraphic_sentences

In [14]:
import re
train = open("./data/aac_comm/sent_train_aac.txt",'r').readlines()
test = open("./data/aac_comm/sent_test_aac.txt",'r').readlines()
dev = open("./data/aac_comm/sent_dev_aac.txt",'r').readlines()

print("Train size: ",len(train))
print("test size: ",len(test))
print("dev size: ",len(dev))

Train size:  5019
test size:  566
dev size:  557


In [15]:
prepared_train = prepare_text(train)
prepared_test = prepare_text(test)
prepared_dev = prepare_text(dev)

print("Train size: ",len(prepared_train))
print("test size: ",len(prepared_test))
print("dev size: ",len(prepared_dev))

Train size:  7824
test size:  936
dev size:  889


## Save prepared data

In [17]:
## save the prepared dataset
with open("./data/aac_comm/prepared_train.txt",'w') as f:
  f.write("\n".join(prepared_train))
with open("./data/aac_comm/prepared_test.txt",'w') as f:
    f.write("\n".join(prepared_test))
with open("./data/aac_comm/prepared_dev.txt",'w') as f:
    f.write("\n".join(prepared_dev))