# Tuning dataset and splitting in train and test

We are going to use the following modules:

In [1]:
import pandas as pd

First of all we are loading the data:

In [2]:
data = pd.read_csv('../data/prolexitim-merged-1.3.csv', header=0,delimiter="\t")

In [3]:
data.head()

Unnamed: 0,RowId,code,card,hum,mode,time,G-score,G-magnitude,Azure-TA,Text,...,SClass,Siblings,SibPos,Origin,Resid,Rtime,Ethnic,Job,alex-a,alex-b
0,1,b7adde8a9eec8ce92b5ee0507ce054a4,13V,1,T,200000,-0.2,0.2,0.62,Era un niño pensando en el granero pensando a ...,...,2.0,5.0,3.0,ES,ES,-1.0,Iberic,Manager,NoAlex,NoAlex
1,2,b7adde8a9eec8ce92b5ee0507ce054a4,18NM,2,T,200000,-0.5,0.5,0.41,"Una madre que está consolando a su hijo, despu...",...,2.0,5.0,3.0,ES,ES,-1.0,Iberic,Manager,NoAlex,NoAlex
2,3,b7adde8a9eec8ce92b5ee0507ce054a4,12VN,0,T,200000,0.0,1.2,0.63,Un pantanal con una barca abandonada. A ver qu...,...,2.0,5.0,3.0,ES,ES,-1.0,Iberic,Manager,NoAlex,NoAlex
3,4,76ef63369f7d5b6597a543017e1ef578,12VN,0,T,200000,0.0,0.1,0.89,"Era un paraje muy bonito, con una barca, un po...",...,2.0,3.0,3.0,ES,ES,-1.0,Iberic,Retired,Alex,Alex
4,5,76ef63369f7d5b6597a543017e1ef578,10,2,T,200000,0.3,0.1,0.24,"Era una vez un matrimonio, que se quería muchí...",...,2.0,3.0,3.0,ES,ES,-1.0,Iberic,Retired,Alex,Alex


In [6]:
data.columns

Index(['RowId', 'code', 'card', 'hum', 'mode', 'time', 'G-score',
       'G-magnitude', 'Azure-TA', 'Text', 'Text-EN', 'nlu-sentiment',
       'nlu-label', 'nlu-joy', 'nlu-anger', 'nlu-disgust', 'nlu-sadness',
       'nlu-fear', 'es-len', 'en-len', 'NLP', 'TAS20', 'F1', 'F2', 'F3',
       'Tas20Time', 'Sex', 'Gender', 'Age', 'Dhand', 'Studies', 'SClass',
       'Siblings', 'SibPos', 'Origin', 'Resid', 'Rtime', 'Ethnic', 'Job',
       'alex-a', 'alex-b'],
      dtype='object')

We erase the rows with empty fields and rename the Alex tag:

In [4]:
data = data.dropna()
data['AlexLabel'] = data['alex-a']

We are going to merge texts classified as Alexithymic and possible Alexithymic:

In [5]:
data['AlexLabel'] = data['AlexLabel'].apply(lambda x: x.replace('PosAlex', 'Alex'))

We are going to check how many alexthymic cases there are:

In [6]:
data.groupby(by='AlexLabel').count()['RowId']

AlexLabel
Alex       76
NoAlex    240
Name: RowId, dtype: int64

By putting together possible and positive casess we got to balance a bit the dataset though it is still unbalanced (25% cases of Alexithymia).

We generate subclasses with pairs (AlexLabel,card):

In [7]:
data['SubClass'] = data['AlexLabel'] + '-' + data['card']

We save this dataset:

In [8]:
data.head()

Unnamed: 0,RowId,code,card,hum,mode,time,G-score,G-magnitude,Azure-TA,Text,...,SibPos,Origin,Resid,Rtime,Ethnic,Job,alex-a,alex-b,AlexLabel,SubClass
0,1,b7adde8a9eec8ce92b5ee0507ce054a4,13V,1,T,200000,-0.2,0.2,0.62,Era un niño pensando en el granero pensando a ...,...,3.0,ES,ES,-1.0,Iberic,Manager,NoAlex,NoAlex,NoAlex,NoAlex-13V
1,2,b7adde8a9eec8ce92b5ee0507ce054a4,18NM,2,T,200000,-0.5,0.5,0.41,"Una madre que está consolando a su hijo, despu...",...,3.0,ES,ES,-1.0,Iberic,Manager,NoAlex,NoAlex,NoAlex,NoAlex-18NM
2,3,b7adde8a9eec8ce92b5ee0507ce054a4,12VN,0,T,200000,0.0,1.2,0.63,Un pantanal con una barca abandonada. A ver qu...,...,3.0,ES,ES,-1.0,Iberic,Manager,NoAlex,NoAlex,NoAlex,NoAlex-12VN
3,4,76ef63369f7d5b6597a543017e1ef578,12VN,0,T,200000,0.0,0.1,0.89,"Era un paraje muy bonito, con una barca, un po...",...,3.0,ES,ES,-1.0,Iberic,Retired,Alex,Alex,Alex,Alex-12VN
4,5,76ef63369f7d5b6597a543017e1ef578,10,2,T,200000,0.3,0.1,0.24,"Era una vez un matrimonio, que se quería muchí...",...,3.0,ES,ES,-1.0,Iberic,Retired,Alex,Alex,Alex,Alex-10


In [9]:
data.shape

(316, 43)

In [11]:
data.to_csv('../data/prolexitim_analytics.csv', index=False)

### Splitting in train and test

Since there are only 316 samples we choose a proportion of 0.9 for training and 0.1 for testing. This may change in the future when we have a bigger dataset:

In [12]:
data_train = data.sample(frac=0.9, random_state=233)

In [13]:
data_test = data.drop(data_train.index)

We store the results:

In [14]:
data_train.to_csv('../data/train.csv', index=False)

In [15]:
data_test.to_csv('../data/test.csv', index=False)