## Calculate Leisure Enjoyment Index

Here we calculate the final indicator - Leisure Enjoyment Index (LE), which will be included in econometric models presented in the paper. The LE will be a linear combination of personality traits each with optimally chosen weight. To calculate the optimal weights, we treat it as a least-squares optimization problem with two constraints:
(1) the weights must be non-negative, and (2) they must sum up to 1.

In [168]:
import numpy as np
import pandas as pd
import scipy

import scipy.optimize as opt
from scipy.optimize import minimize
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn import linear_model
from sklearn.linear_model import LinearRegression

In [5]:
# Read data
main = pd.read_csv('main_demo.csv')

### Filter and preprocess the data 

Here we filter the main dataset which contains variables related to leisure engagement. We use only personality traits as input variables and create from them a matrix 'X'. Similarly, we create the output vector 'y' which contains the value of the Leisure Index for each observation.

In [129]:
# Filter personality traits and outcome variable
X = main[['ND8EXT', 'ND8AGR', 'ND8CON', 'ND8EMO', 'ND8INT']].values
y = main['LeisureIndex'].values

In [130]:
# Impute missing values
# Create an instance of the SimpleImputer
imputer = SimpleImputer(strategy='mean')
y = y.reshape(-1, 1)

# Impute missing values in your input data
X_imputed = imputer.fit_transform(X)
y_imputed = imputer.fit_transform(y)

In [199]:
# Normalized X and y
norm_X = (X_imputed - 5) / 45
norm_y = (y_imputed - 1) / 5
np.max(norm_X)

0.8888888888888888

## Solve the Optimization Problem

### 1. Scipy's nnls()

Since this is a prediction problem and we want to obtain weights which capture the true relationship between personality traits and leisure engagement, we will need to perform cross-validation to get generalizable results. From some initial analysis, it seems that the coefficients are sensitive to the data which further increases the need for cross-validation.

In [188]:
# Assuming X and y are your feature matrix and target variable, respectively

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_imputed, y_imputed, test_size=0.2, random_state=42)

# Perform cross-validation
fold_scores = []
for _ in range(5):
    # Split the training set into training and validation sets
    X_train_fold, X_val_fold, y_train_fold, y_val_fold = train_test_split(
        X_train, y_train, test_size=0.2, random_state=61)

    # Train the model using nnls on the training fold
    coefficients, _ = scipy.optimize.nnls(X_train_fold, np.squeeze(y_train_fold))

    # Predict the target variable for the validation fold
    y_pred_val_fold = np.dot(X_val_fold, coefficients)

    # Calculate the performance metric for the validation fold
    fold_score = mean_squared_error(y_val_fold, y_pred_val_fold)
    fold_scores.append(fold_score)

# Calculate the average performance across the folds
mean_score = np.mean(fold_scores)

# Print the mean performance
print("Mean squared error:", mean_score)

coefs_nnls = np.asarray([0, 0.01260824, 0, 0.03263554, 0.0529556])
print(np.sum(coefs_nnls))
type(coefs_nnls)

Mean squared error: 0.22692153658812378
0.09819938


numpy.ndarray

### 2. Linear Regression using Scikit-learn

In [183]:
# Create a linear regression model

model = linear_model.LinearRegression(positive=True)
model.fit(X_imputed, y_imputed)

# Define the cross-validation method (e.g., 5-fold cross-validation)
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

# Perform cross-validation and calculate the mean performance
scores = cross_val_score(model, X_imputed, y_imputed, cv=kfold, scoring='r2')
mean_score = scores.mean()

# Print the mean performance
print("Mean R^2 score:", mean_score)
coefs = model.coef_

print(coefs)
print(np.sum(coefs))
type(coefs)

Mean R^2 score: 0.3980397938171841
[[0.         0.01088561 0.         0.0324213  0.05334985]]
0.09665675238553065


numpy.ndarray

### 3. Ridge Regression using scikit-learn

In [177]:
# Create a linear regression model
from sklearn.linear_model import Ridge

model_ridge = linear_model.Ridge(positive=True, alpha=.1)
model_ridge.fit(X_imputed, y_imputed)

# Define the cross-validation method (e.g., 5-fold cross-validation)
kfold_ridge = KFold(n_splits=5, shuffle=True, random_state=42)

# Perform cross-validation and calculate the mean performance
scores_ridge = cross_val_score(model_ridge, X_imputed, y_imputed, cv=kfold_ridge, scoring='r2')
mean_score_ridge = scores_ridge.mean()

# Print the mean performance
print("Mean R^2 score:", mean_score_ridge)
coefs_ridge = model_ridge.coef_

print(coefs_ridge)
print(np.sum(coefs_ridge))


Mean R^2 score: 0.39803977176459093
[[0.         0.01088455 0.         0.03242437 0.05334881]]
0.09665772813504822


In [190]:
# Mean of the coefficients produced
mean_coefs = (coefs_nnls + coefs + coefs_ridge) / 3
print(mean_coefs)

[[0.         0.01145947 0.         0.03249374 0.05321809]]


In [193]:
weights = mean_coefs / np.sum(mean_coefs)
print(weights)
print(np.sum(weights))
weights.shape

[[0.         0.11793057 0.         0.33439647 0.54767295]]
1.0


(1, 5)

### Calculate the final Leisure Enjoyment Index 

After we obtained and normalized the weights, we can calculate the Leisure Enjoyment Index as the dot product of the matrix norm_X (normalized values of personality traits) and the weight vector.

In [200]:
# Finally calculate the Leisure Enjoyment Index
main['LeisureEnjoy'] = np.dot(norm_X, np.transpose(weights))

In [201]:
main.head()

Unnamed: 0,NCDSID,Essay Text,Preprocessed Text,ND8EXT,ND8AGR,ND8CON,ND8EMO,ND8INT,ND8WEMWB,ND8PHHE,...,Sentiment Continuous,anger,disgust,fear,joy,neutral,sadness,surprise,Sentiment,LeisureEnjoy
0,N28280Y,"I am happily married, we are grand-parents. Ou...","['happily', 'married', 'grandparent', 'two', '...",44.0,45.0,41.0,26.0,37.0,58.0,95.0,...,0.9542,0.001291,0.000894,0.000268,0.982044,0.004638,0.007959,0.002906,1,0.650335
1,N13960Q,"I am retired, not living in London, probably i...","['retired', 'living', 'london', 'probably', 'n...",26.0,25.0,25.0,23.0,32.0,40.0,90.0,...,-0.7001,0.004131,0.001903,0.004476,0.009218,0.007038,0.972608,0.000626,0,0.514776
2,N23786Z,I imagine I'll still be teaching french at Pri...,"['imagine', 'ill', 'still', 'teaching', 'frenc...",36.0,44.0,42.0,30.0,32.0,59.0,85.0,...,0.7345,0.024202,0.011763,0.140934,0.312525,0.410469,0.064782,0.035325,0,0.616586
3,N17606R,I am retired from work. I enjoy leisurely time...,"['retired', 'work', 'enjoy', 'leisurely', 'tim...",22.0,41.0,29.0,38.0,26.0,54.0,100.0,...,0.9432,0.001595,0.001737,0.00034,0.974898,0.003598,0.016691,0.001141,1,0.595149
4,N19466F,"Retired and moved further away from London, Su...","['retired', 'moved', 'away', 'london', 'sussex...",31.0,40.0,32.0,30.0,34.0,51.0,85.0,...,-0.7615,0.001112,0.001434,0.001872,0.014851,0.005902,0.973415,0.001414,0,0.630444
