# Pitt Data Preparation

This notebook assumes that the Pitt text data is available in `../data/Pitt`, please adjust accordingly if necessary.

Refer to the `CHA files processing` notebook for more details on `.cha` file parsing.

In [1]:
import pylangacq as pla
import pandas as pd
from pathlib import Path

In [2]:
Path.ls = lambda p: list(p.iterdir())

## Total utterances by participant

In [3]:
pitt_path = Path('../data/Pitt/')

In [4]:
pitt_path.ls()

[PosixPath('../data/Pitt/Dementia'),
 PosixPath('../data/Pitt/0metadata.cdc'),
 PosixPath('../data/Pitt/Control')]

In [5]:
(pitt_path/'Control').ls()

[PosixPath('../data/Pitt/Control/fluency'),
 PosixPath('../data/Pitt/Control/cookie')]

In [6]:
(pitt_path/'Dementia').ls()

[PosixPath('../data/Pitt/Dementia/recall'),
 PosixPath('../data/Pitt/Dementia/sentence'),
 PosixPath('../data/Pitt/Dementia/fluency'),
 PosixPath('../data/Pitt/Dementia/cookie')]

We have 4 Dementia groups but only two of them have a corresponding Control group. Let's just use the 'cookie' experiment

In [9]:
control_path = pitt_path/'Control'/'cookie'
dementia_path = pitt_path/'Dementia'/'cookie'

In [10]:
control_chats = pla.read_chat((control_path/'*.cha').as_posix(), encoding='utf-8')

In [11]:
control_chats.number_of_utterances(participant='PAR')

3145

In [12]:
dementia_chats = pla.read_chat((dementia_path/'*.cha').as_posix(), encoding='utf-8')
dementia_chats.number_of_utterances(participant='PAR')

3908

In [13]:
control_tagged_sents = control_chats.tagged_sents(participant='PAR')
dementia_tagged_sents = dementia_chats.tagged_sents(participant='PAR')
print(len(control_tagged_sents), len(dementia_tagged_sents))

3145 3908


## Data preparation

Let's prepare sentence data with the samples we have for now. In my opinion, the sensible thing to do would be to train and classify based on whole documents, not isolated sentences. Let's go with it so we can compare results against the paper's.

In [14]:
control_sents = control_chats.sents(participant='PAR')
dementia_sents = dementia_chats.sents(participant='PAR')

In [15]:
control_df = pd.DataFrame({
    'sentence': control_sents,
    'tagged': control_tagged_sents,
    'group': 'control'
})
control_df.head()

Unnamed: 0,sentence,tagged,group
0,"[the, scene, is, in, the, kitchen, .]","[(the, DET:ART, the, (1, 2, DET)), (scene, N, ...",control
1,"[the, mother, is, wiping, dishes, and, the, wa...","[(the, DET:ART, the, (1, 2, DET)), (mother, N,...",control
2,"[a, boy, is, trying, to, get, cookies, out, of...","[(a, DET:ART, a, (1, 2, DET)), (boy, N, boy, (...",control
3,"[the, little, girl, is, reacting, to, his, fal...","[(the, DET:ART, the, (1, 3, DET)), (little, AD...",control
4,"[it, seems, to, be, summer, out, .]","[(it, PRO:PER, it, (1, 2, SUBJ)), (seems, COP,...",control


In [16]:
dementia_df = pd.DataFrame({
    'sentence': dementia_sents,
    'tagged': dementia_tagged_sents,
    'group': 'dementia'
})
dementia_df.head()

Unnamed: 0,sentence,tagged,group
0,"[mhm, .]","[(mhm, CO, mhm=yes, (1, 0, INCROOT)), (., ., ,...",dementia
1,"[alright, .]","[(alright, CO, alright, (1, 0, INCROOT)), (., ...",dementia
2,"[there's, a, young, boy, that's, getting, a, c...","[(there's, PRO:EXIST, there, (1, 2, SUBJ)), (C...",dementia
3,"[and, he's, in, bad, shape, because, the, thin...","[(and, COORD, and, (1, 3, LINK)), (he's, PRO:S...",dementia
4,"[and, in, the, picture, the, mother, is, washi...","[(and, COORD, and, (1, 8, LINK)), (in, PREP, i...",dementia


Assign 20% validation independently to each group. This is not being used in the `pitt-cookie-ilmfit-sentence` notebook.

In [17]:
import random
val_pct = 20

In [18]:
def assign_validation(df, val_pct=val_pct):
    val_values = [random.randint(0, 100) < val_pct for x in range(len(df))]
    df['is_validation'] = val_values

In [19]:
assign_validation(control_df)
assign_validation(dementia_df)

### Flatten word lists to phrases

In [20]:
df = pd.concat([control_df, dementia_df])
len(df)

7053

Note the last item in the sentence is usually a '.'. Keep it for now.

In [21]:
df['text'] = df.apply(lambda row: ' '.join(row.sentence), axis=1)

In [22]:
df.iloc[1].text

'the mother is wiping dishes and the water is running on the floor .'

In [23]:
df.iloc[15].text

'you want more ?'

Reorder columns

In [24]:
df = df[['group', 'is_validation', 'sentence', 'tagged', 'text']]
df.head()

Unnamed: 0,group,is_validation,sentence,tagged,text
0,control,False,"[the, scene, is, in, the, kitchen, .]","[(the, DET:ART, the, (1, 2, DET)), (scene, N, ...",the scene is in the kitchen .
1,control,False,"[the, mother, is, wiping, dishes, and, the, wa...","[(the, DET:ART, the, (1, 2, DET)), (mother, N,...",the mother is wiping dishes and the water is r...
2,control,False,"[a, boy, is, trying, to, get, cookies, out, of...","[(a, DET:ART, a, (1, 2, DET)), (boy, N, boy, (...",a boy is trying to get cookies out of a jar an...
3,control,False,"[the, little, girl, is, reacting, to, his, fal...","[(the, DET:ART, the, (1, 3, DET)), (little, AD...",the little girl is reacting to his falling .
4,control,False,"[it, seems, to, be, summer, out, .]","[(it, PRO:PER, it, (1, 2, SUBJ)), (seems, COP,...",it seems to be summer out .


Save to `csv`

In [25]:
model_path = pitt_path.parent/'models'
model_path.mkdir(exist_ok=True)

In [26]:
df.to_csv(model_path/'pitt-cookie.csv')