This notebook is heavily inspired by [*Working With Text Data*](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html).

Note: the performance is bad. This notebook is just for building up a workflow.

Potential areas of improvement:

- use more ways of word representation (word to number) methods
- use different models
- include all five questions instead of "one-on-one' prediction
- if still use "one-on-one", explore building diferent models for different traits

## Import necessary packages

In [7]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import TruncatedSVD
from sklearn.svm import SVR

random_seed = 8424  # this is very important!

## Load and prepare data

In [8]:
data = pd.read_csv("training_data_participant/siop_ml_train_participant.csv")

In [9]:
X = data.iloc[:, 1:6]
y = data.iloc[:, 6:11]

# rename columns for easier access
X.rename(columns={'open_ended_1': 'A',
                  'open_ended_2': 'C',
                  'open_ended_3': 'E',
                  'open_ended_4': 'N', 
                  'open_ended_5': 'O'}, inplace=True)

y.rename(columns={'E_Scale_score': 'E',
                  'A_Scale_score': 'A',
                  'O_Scale_score': 'O',
                  'C_Scale_score': 'C',
                  'N_Scale_score': 'N'}, inplace=True)

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    random_state=random_seed,
                                                    test_size=0.2,
                                                    shuffle=True)

In [29]:
X_train.shape

(870, 5)

## Procedure

**1. Transform text into numbers:** Term frequency is used as a way of representing words

**2. Train models:** Support vector regression

**3. Evaluation:** Pearson correlation

In [75]:
model = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer(use_idf=False)),
    ('svd', TruncatedSVD(random_state=random_seed, n_components=3)),
    ('model', SVR(kernel='linear'))
])

In [76]:
# Here I only use responses of one questions to predict one trait
for trait in ['O', 'C', 'E', 'A', 'N']:
    model.fit(X_train[trait], y_train[trait])
    y_pred = model.predict(X_test[trait])
    r = np.corrcoef(y_pred, y_test[trait])[0, 1] 
    print((trait, r))

('O', 0.12216577668758352)
('C', 0.11644799035988904)
('E', 0.06505485299866633)
('A', 0.20101985663402858)
('N', -0.03496962446907218)
