# Prolexitim Exploratory Analysis (Prolex-Explore)
## Visualizing Doc2Vec Feature Distribution and Discrimination
### Dataset Preparation: Classes, Subclasses, and Word2Vec Features
<hr>
Dataset from Prolexitim TAS-20 Spain, Prolexitim NLP.
Sept 2019.<br> Prolexitim dataset version 1.2 (MPGS-TFM-Submission).<br> 
<a target="_blank" href="http://www.conscious-robots.com/papers/TFM_MPGS_Arrabales_vWeb.pdf">Arrabales, R. 2019. Artificial Intelligence Tools for the Evaluation and Treatment of Alexithymia.</a><br> <br>
Raúl Arrabales Moreno (Psicobótica / Serendeepia Research)<br>
<a target="_blank" href="http://www.conscious-robots.com/">http://www.conscious-robots.com/</a> <br>
<hr>


## Load Documents (texts + labels)
- Documents are obtained from the Prolexitim Pilot Study.
    - Text are narratives from Prolexitim NLP. 
    - Labels are categorical values from Prolexitim TAS-20.


### Loading Prolexitim TAS-20 + NLP Dataset Load

In [6]:
import pandas as pd 

In [7]:
# My copy of Prolexitim join tables with TAS-20 categorical label and narratives from Prolexitim NLP
tasnlp_dataset_path = "D:\\Dropbox-Array2001\\Dropbox\\UNI\\MPGS\\2_TFM\\Datos\\prolexitim-merged-1.3.csv"

In [8]:
docs_df = pd.read_csv(tasnlp_dataset_path,header=0,delimiter="\t")

In [9]:
docs_df.columns

Index(['RowId', 'code', 'card', 'hum', 'mode', 'time', 'G-score',
       'G-magnitude', 'Azure-TA', 'Text', 'Text-EN', 'nlu-sentiment',
       'nlu-label', 'nlu-joy', 'nlu-anger', 'nlu-disgust', 'nlu-sadness',
       'nlu-fear', 'es-len', 'en-len', 'NLP', 'TAS20', 'F1', 'F2', 'F3',
       'Tas20Time', 'Sex', 'Gender', 'Age', 'Dhand', 'Studies', 'SClass',
       'Siblings', 'SibPos', 'Origin', 'Resid', 'Rtime', 'Ethnic', 'Job',
       'alex-a', 'alex-b'],
      dtype='object')

In [10]:
docs_df.sample(2)

Unnamed: 0,RowId,code,card,hum,mode,time,G-score,G-magnitude,Azure-TA,Text,...,SClass,Siblings,SibPos,Origin,Resid,Rtime,Ethnic,Job,alex-a,alex-b
21,22,29d3a4409792f6a28f9eefbdd7ebcd37,3VH,1,T,200000,-0.8,0.8,0.0,Esto es una señora que le acaban de dar una ma...,...,2.0,1.0,1.0,ES,ES,-1.0,Iberic,Teacher,NoAlex,NoAlex
99,100,1384903a9d83e20ccd863fe23876e919,9VH,4,W,264669,0.1,0.2,0.37,Un soldado que llevaba mucho tiempo fuera de c...,...,2.0,3.0,1.0,ES,ES,-1.0,Iberic,Psychology,NoAlex,NoAlex


In [11]:
# We're only intereted in the Spanish text, the corresponding Alexithymia label and the presented visual stimuli (card)
docs_df = docs_df.dropna()
docs_df = docs_df[['Text', 'alex-a', 'card']]
docs_df.columns = ['Text', 'AlexLabel', 'card']
docs_df.sample(n=6)

Unnamed: 0,Text,AlexLabel,card
283,Distintos forajidos del medio oeste se encuent...,NoAlex,9VH
322,Érase una vez un niño que debía acudir a clase...,NoAlex,1
5,Erase una vez un niño que se encontraba muy tr...,Alex,1
50,Esto es una canoa de unas personas que han ido...,NoAlex,12VN
252,Érase algo,PosAlex,11
222,Un niño al que sus padres le han obligado a to...,NoAlex,1


In [12]:
# As expected, we have a very unbalance dataset (aprox 10% alexithymia)
docs_df.groupby(by='AlexLabel').count()

Unnamed: 0_level_0,Text,card
AlexLabel,Unnamed: 1_level_1,Unnamed: 2_level_1
Alex,31,31
NoAlex,240,240
PosAlex,45,45


In [13]:
# And we decided to consider both Possible Alexithymia and Alexithymia as the same (Positive) class
docs_df['AlexLabel'] = docs_df['AlexLabel'].apply(lambda x: x.replace('PosAlex', 'Alex'))

In [14]:
docs_df.groupby(by='AlexLabel').count()

Unnamed: 0_level_0,Text,card
AlexLabel,Unnamed: 1_level_1,Unnamed: 2_level_1
Alex,76,76
NoAlex,240,240


In [15]:
# Check cards distribution
docs_df.groupby(by='card').count()

Unnamed: 0_level_0,Text,AlexLabel
card,Unnamed: 1_level_1,Unnamed: 2_level_1
1,67,67
10,6,6
11,70,70
12VN,9,9
13HM,65,65
13N,6,6
13V,6,6
18NM,4,4
3VH,7,7
7VH,6,6


In [16]:
# Keep just the significant four cards (the ones we actually used with all participants)
viz_df = docs_df[(docs_df.card == '9VH') | 
                 (docs_df.card == '13HM') |
                 (docs_df.card == '11') | 
                 (docs_df.card == '1')]

In [17]:
viz_df.groupby(by='card').count()

Unnamed: 0_level_0,Text,AlexLabel
card,Unnamed: 1_level_1,Unnamed: 2_level_1
1,67,67
11,70,70
13HM,65,65
9VH,69,69


In [18]:
viz_df.sample(4)

Unnamed: 0,Text,AlexLabel,card
132,"Me recuerda al paisaje donde yo vivo, esos bar...",NoAlex,11
215,un grupo de amigos que se fueron a comer y a p...,NoAlex,9VH
145,un hombre que quisiéra seguir compartiendo el ...,NoAlex,13HM
322,Érase una vez un niño que debía acudir a clase...,NoAlex,1


In [19]:
# Generarte subclasses with pairs (AlexLabel,card)
viz_df = viz_df.copy()
viz_df['SubClass'] = viz_df['AlexLabel'] + '-' + viz_df['card']

In [20]:
# There should be 8 subclasses (4 positives and 4 negatives)
viz_df.groupby(by='SubClass').count()

Unnamed: 0_level_0,Text,AlexLabel,card
SubClass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alex-1,18,18,18
Alex-11,18,18,18
Alex-13HM,16,16,16
Alex-9VH,17,17,17
NoAlex-1,49,49,49
NoAlex-11,52,52,52
NoAlex-13HM,49,49,49
NoAlex-9VH,52,52,52


In [21]:
viz_df.sample(4)

Unnamed: 0,Text,AlexLabel,card,SubClass
114,Erase una vez un niño llamado Antonio que se a...,Alex,1,Alex-1
240,Un acantilado rocoso y solitario,Alex,11,Alex-11
303,Obreros adescando tras un lardo día de arar la...,Alex,9VH,Alex-9VH
268,Las cuevas de Altamira 2.0,NoAlex,11,NoAlex-11


In [22]:
viz_df.shape

(271, 4)

# Save vizualization dataset

In [23]:
viz_dataset_path = "D:\\Dropbox-Array2001\\Dropbox\\UNI\\MPGS\\2_TFM\\Datos\\prolexitim-viz-1.2.csv"

In [24]:
viz_df.to_csv(viz_dataset_path, sep='\t', encoding='utf-8', index=False)