## Natural Selection (Walmart) - SIOP 2019 Competition

- Natural Selection = Natural Language Processing + Global Selection and Assessment

This competition consisted of a data set containing open-ended resposes to 5 situational judgment items and 5 aggregated personality trait scores. The goal of the competition was to generate the best mean prediction across all 5 traits using only these open-ended responses.

We used three approaches:
- Key Words: a sample of responses from the high- and low-end of each trait distribution were read and then key words were extracted which seemed to occur more at one end of the distribution than the other
- Machine learning: machine learning techniques were used with features from Key Words and other data extracted from the text
- Deep learning: deep learning techniques were used. This is the most refined code and the place where experienced data scientists would find most value in reviewing

The winning submission resulted from combining the methods.

Note on the code contained in this notebook:
- We removed most of the exploratory code from this notebook to focus on what we actually used in the final predictions. Some irrelevant and duplicte elements remain. This code was written by different people with different levels of coding expertise. Thus, the application of code can vary widely and may seem disjointed/incoherent at times.

## Dependencies

- pandas (https://pandas.pydata.org/)
- numpy (http://www.numpy.org/)
- seaborn (https://seaborn.pydata.org/)
- scikit-learn (https://scikit-learn.org/)
- scipy (https://www.scipy.org/)
- pyspellchecker (https://github.com/barrust/pyspellchecker)
- textblob (https://textblob.readthedocs.io/en/dev/)
- spacy (https://spacy.io/)
- tpot (http://epistasislab.github.io/tpot/)
- xgboost (https://xgboost.readthedocs.io/en/latest/)

In [82]:
# Python's best-known DataFrame implementation
import pandas as pd

# Fast, flexible array and numerical linear algebra subroutines
import numpy as np

# OS utilities (e.g. path module)
import os

# Plots & other visualization
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt

# Pretty printing of complex datatypes
from pprint import pprint
import json

# Preprocessing and modeling utilities
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVR, LinearSVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.decomposition import SparsePCA, TruncatedSVD
from sklearn.linear_model import Ridge, Lasso, ElasticNet, BayesianRidge
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

# Evaluation
from scipy.stats import pearsonr

# Text processing tools
from spellchecker import SpellChecker
from textblob import TextBlob

# import language_check


from textstat.textstat import textstatistics, legacy_round 


# For word embeddings and syntactic features
import spacy
import en_core_web_md

nlp = spacy.load('en_core_web_md')

# AutoML
from tpot import TPOTRegressor

#xgboost
from xgboost import XGBRegressor
import scipy
from scipy.stats.stats import pearsonr

  from scipy.stats.stats import pearsonr


## Constants

Here we set some constant values related to local paths to data files as well as lists containing the various predictor and target features.

In [83]:
df = pd.read_csv('/Users/I745133/Desktop/git/NLP/dissertation/data/MLdata.csv')
types = df['Dataset'].value_counts()
print(types)
train_df = df[df['Dataset'] == 'Train']
valid_df = df[df['Dataset'] == 'Dev']
test_df = df[df['Dataset'] == 'Test']
train_df.to_csv('/Users/I745133/Desktop/git/NLP/dissertation/data/train_rep.csv', index=False)
valid_df.to_csv('/Users/I745133/Desktop/git/NLP/dissertation/data/valid_rep.csv', index=False)
test_df.to_csv('/Users/I745133/Desktop/git/NLP/dissertation/data/test_rep.csv', index=False)

Dataset
Train    1088
Dev       300
Test      300
Name: count, dtype: int64


In [84]:
# Paths to various data targets for the competition. 

# Update to reflect the directory hierarchy of your machine 
DATA_DIR = "/Users/I745133/Desktop/git/NLP/dissertation/data"

TRAIN_CSV_DATA_NAME = "/Users/I745133/Desktop/git/NLP/dissertation/data/train_rep.csv"
TEST_CSV_DATA_NAME = "/Users/I745133/Desktop/git/NLP/dissertation/data/valid_rep.csv"
FINAL_CSV_DATA_NAME = "/Users/I745133/Desktop/git/NLP/dissertation/data/test_rep.csv"

# Set some DataFrame-specific constants
TARGET_COLUMN_NAMES = [attribute + "_Scale_score" for attribute in ["A", "E", "O", "N", "C"]]
PREDICTOR_TEXT_COLUMN_NAMES = ["open_ended_" + str(idx) for idx in range(1, 6)]
PREDICTOR_CONCAT_COLUMN_NAME = "open_ended_6"

## Reading In Data

Here we use `pandas` to read our csv data sets into a DataFrame, a common and convenient data structure for the workflows we will be implementing. `df_train` will be used for training purposes, `df_test` will be used for public leaderboard submissions, and 'df_train' will be used for the private leaderboards submissions. 

In [85]:
# Read csv data to base DataFrame
df_train_temp = pd.read_csv(TRAIN_CSV_DATA_NAME)
df_test_temp = pd.read_csv(TEST_CSV_DATA_NAME)
df_final_temp = pd.read_csv(FINAL_CSV_DATA_NAME)

df_train_temp['Source']='Train'
df_test_temp['Source']='Test'
df_final_temp['Source']='Final'

# Combine datasets datasets
df_total=pd.concat([df_train_temp,df_test_temp,df_final_temp],ignore_index=True, sort=True)

# Check data load
df_total['Source'].value_counts()

Source
Train    1088
Test      300
Final     300
Name: count, dtype: int64

In [86]:
df_train_temp.head()

Unnamed: 0,Respondent_ID,open_ended_1,open_ended_2,open_ended_3,open_ended_4,open_ended_5,E_Scale_score,A_Scale_score,O_Scale_score,C_Scale_score,N_Scale_score,Dataset,Source
0,10446116527,"I would change my vacation week, because I am ...",I would reach out to my boss and ask him or he...,I would not go. I am a not a social person. I ...,I would ask my manager why he/she gave me such...,I would find this experience super enjoyable. ...,2.25,3.75,3.166667,3.75,2.916667,Train,Train
1,10440100535,I would talk to my colleague and see if they w...,I would continue to work on the project that w...,I would talk to my colleague and try to talk t...,I would feel upset about the negative feedback...,I would find this experience enjoyable. I feel...,4.666667,4.416667,4.583333,5.0,1.333333,Train,Train
2,10462850071,I would feel upset because perhaps I already b...,I would start working on the project now and g...,I would feel guilty about thinking about not g...,I would feel really defensive about it. I woul...,I would find it enjoyable because I would be r...,2.25,4.75,4.083333,4.666667,2.166667,Train,Train
3,10460008027,I would suggest that whoever requested the tim...,I would try to finish early because it's alway...,"I wouldn't want to go, but I'd have to weigh t...",I would first wait until I'm calm. Then I wou...,I would love to have the opportunity to learn ...,2.916667,4.083333,3.916667,4.916667,1.333333,Train,Train
4,10459746373,I would talk to my colleague to see if he has ...,I would remind my boss that I am working on th...,I would go anyway. Networking is a good way to...,I would talk to my manager first and get some ...,I would find this experience enjoyable as I am...,3.75,4.75,3.666667,4.916667,1.583333,Train,Train


## Data Preprocessing Modules

Here we define various preprocessing utilities (simple python functions that operate on a single input) as well as preprocessing transformers which operate on an entire column of data. Transformers should be implemented as Python classes that inherit from `sklearn.base.BaseEstimator` and `sklearn.base.TransformerMixin` & should implement a `fit` and `transform` method.

In [87]:
df_total[PREDICTOR_CONCAT_COLUMN_NAME] = df_total.apply(
    lambda row: " ".join([row[col_name] for col_name in PREDICTOR_TEXT_COLUMN_NAMES]),
    axis=1
)

PREDICTOR_TEXT_COLUMN_NAMES_ALL =['open_ended_1','open_ended_2','open_ended_3',
                                  'open_ended_4','open_ended_5','open_ended_6']

In [88]:
# Count and Correct Spelling Errors. 

spell_checker = SpellChecker()

def tokenize(text):
    return TextBlob(text).words

def compute_num_spelling_errors(text):
    return len(spell_checker.unknown(tokenize(text)))

def divide(x, y):
    return x / y

def word_count(text): 
    return textstatistics().lexicon_count(text, removepunct=True)

for predictor_col in PREDICTOR_TEXT_COLUMN_NAMES_ALL:
    df_total[predictor_col + "_num_words"] = df_total[predictor_col].apply(word_count)
    df_total[predictor_col + "_num_misspelled"] = df_total[predictor_col].apply(compute_num_spelling_errors)
    df_total[predictor_col + "_percent_misspelled"] = df_total[[predictor_col + "_num_misspelled",
                              predictor_col + "_num_words"
    ]].apply(lambda x: divide(*x), axis=1)

## Building word lists 1

We build the word lists twice because we were lazy. The lists diverged due to different team members refining them and we never got around to reconciling the differences.

In [89]:
# Lists were compliled by reading a sample of comments at either the top or bottom 5% of each trait distribution

O_high_5_LIST = ["accept","allow","apply","benefit","better","career","client","comfortable","contact","contribute","convince",
"correct","enjoy","excited","fair","first","fun","great time","grow","happy to go","help","immediately","improve","insist",
"leader","learn","let","mad","negative","no problem","offer","personal issue","respect","right away","show","team"]

C_high_2_LIST = ["family","report","stress","question","convince","job","deserve","longest","comfortable","win","great time",
"negative","fair","check in","short time","accus","short ","respect","willing","lie","correct","as soon","positive","impres",
"review","problems","immediately","hate networking","anger","proof","upset","prove","open","explain","improve","time ",
"confident","right away","let"]

A_high_1_LIST = ["agree","benefit","best","bonus","change","compromise","considerate","correct","defer","easy going","family",
"flexible","fun","good","help","hurt","incorrect","leader","let","misunderstand","no problem","not interested","obligation",
"paid","pick","priority","problems","quickly","respect","review","show","willing","win"]

E_high_3_LIST = ["career","good","frustrated","nice","best","deny","reflect","confident","grow","consequence","missed out",
"connect","rage","importan","worry","I am sociable","party","right away","priority","sociable","accept","focus","plan ",
"report","excited","reward","contribute","allow","success","contact","review","absolutely go","for sure go","meet","great",
"colleagues","social","not need anyone","regardless","fool","surely attend","leader","network","I like parties","no problem",
"learn","friendship","definitely go","introduce","let","competition","client", "make new friends"]

A_low_2_LIST = ["wrong","question","busy","probably","resent","not go","importan","fun","enjoy","bad","first","problems",
"refuse","better","short time","good","anxiety","avoid going","respect","compromise","losing","angry","regardless",
"social anxiety","rage","decline","pretend","focus","connect","no problem","priority","excuse","procrastinate","fool","sick",
"personal issue","anticipat","deadline","anyone","lose","difficult","meet","judg","worry","plan ","trouble","show","nervous",
"reflect","help","pressure","compensate","bonus","get along","flexible","colleagues","accus","fire","consequence","demand",
"not back down","stand my ground"]
            
A_low_3_LIST = ["I like parties","fire","unpleasant","would not go","sales","quit","discomfort","money","hate networking",
"worthwhile","obligation","panic","emotion","unlikely","hell","skip","social anxious","pick","cold","decline","not go","paid",
"get out of it","hate","wouldn't go","reward","rage","short ","negotiate","beg","difficult","trouble","resent","time ",
"immediately","stress","stressed","stressed out","reconsider","short time","grow","extremely uncomfortable","willing",
"get along","apply","if i had to","risk","anxiety","great","forc","allow","socially awkward","dislike","I am sociable",
"great time","missed out","compensate","oppurtunity","anger","benefit","plan ","confirm","avoid","social anxiety","fair",
"pressure","mad","deserve","not a social person"]
A_low_4_LIST = ["rage","cold","fool","marked","depend","demand","quit","report","probably","career","accept","not go",
"compensate","pressure","quiet","angry","afraid","confront","emotion","job","benefit","mad","threaten","money","unpleasant",
"anxiety","pissed","anyone","obligation","confident","short ","regardless","refuse","appeal","hesitate","examples","immediately",
"bad","suck it up","resent","respect","wrong","harm"]
A_low_5_LIST = ["paid","refuse","avoid going","alone","emotion","pretend","resent","bonus","win","rage","difficult","probably",
"afraid","anger","forc","hate networking","change","agree","depend","wouldn't go","pick","focus","obligation","frustrated",
"considerate","right away","time ","money","negative","colleagues","awkward","improve","success","explain","bad","best",
"respect","let","better","nice","nervous"]

N_low_1_LIST = ["as soon","report","show","problems","best","quickly","bonus","tense","social","correct","win","concede",
"leader","misunderstand","unlikely","incorrect","fire","easy going","paid","hesitate","human resources","time ","emotion",
"worried","racist","slash","fun","valid","stubborn","flexible","review","beg","respect","benefit","open","threaten","short ",
"change","first","trouble","agree","compromise","defend","defer","mad","harm","worry"]
N_low_2_LIST = ["hate networking","client","responsible","longest","unhappy","willing","accus","proof","difficult","family",
"anger","team","correct","consequence","comfortable","stick","trouble","job","pressure","benefit","mad","report","deserve",
"accept","positive","review","open","as soon","risk","time ","let","feel pressure","check in","depend","dislike","judg",
"social anxiety","resent","lie","explain","upset","hard ","leader","frustrated"]
N_low_3_LIST = ["learn","no problem","regardless","network","good","introduce","anyone","definitely go","confident","meet",
"competition","contact","lie","client","I like parties","not need anyone","great","social","party","worry","friendship",
"review","contribute","stretch myself","surely attend","fool","plan ","help","leader","missed out","for sure go","fair",
"let","reluctance","absolutely go","excited","happy to go","priority","excuse","hard ","report","job","anger"]
N_low_5_LIST = ["learn","anyone","short time","help","leader","client","great time","enjoy","importan","excited","hesitate",
"correct","lie","team","losing","career","responsible","insist","immediately","bad","happy to go","pretend","willing",
"emotion","short ","stress","confus","trouble","time ","worry","success","regardless","report","hurt","show","money",
"contact","stick","mad","unlikely"]

C_high_1_LIST = ["question","stubborn","would change","reconsider","as soon","human resources","disagree","defer","risk",
"unpleasant","immediately","worry","argue","petty","explain","mad","proof","hurt","correct","obligated","not go","harm",
"unhappy","leader","misunderstand","win","fire","unlikely","first","pick","angry","priority","bonus","quickly","short time",
"hesitate","tense","social","switch","success","problems","not interested","easy going","no problem","report","reflect",
"upset","anger","team","valid","paid","review","agree","short ","willing","fun","concede","show","seniority","flexible",
"change"]
C_high_3_LIST = ["no problem","introvert","frustrated","win","sad","missed out","alone","dislike","appeal","depend","insist",
"sick","success","not comfortable","stretch myself","report","benefit","accept","hard ","contribute","responsible","compensate",
"fool","social","absolutely go","regardless","anyone","focus","pretend","worried","nightmare","not need anyone","surely attend",
"for sure go","colleagues","competition","let","help","best","great","deny","importan","learn","network","client","uncomfortable",
"priority","lie","improve","good","not attend","fun","definitely go","mad","comfortable","reluctant","excited","excuse","meet"]
C_high_4_LIST = ["willing","change","respect","connect","fun","paid","hard ","immediately","terrible","grow","incorrect","refuse",
"open","resent","quickly","contact","calm","party","short ","contribute","my right","stubborn","rebut","problems","worried",
"as soon","compromise","hurt","good","proof","not true","early","human resources""obligation","colleagues","meet","demand",
"success","negative","allow","concerned","disagree","let","agree"]
C_high_5_LIST = ["success","appeal","worry","fun","busy","hesitate","problems","allow","hurt","improve","excited","good","bad",
"leader","stress","importan","excuse","introduce","lose","enjoy","prove","personal issue","fair","quickly","correct","stick",
"accus","unlikely","comfortable","sad","willing","contact","confus","career","show","losing","immediately","compensate",
"anyone","lie","client","help","learn"]

A_high_2_LIST = ["agree","negative","benefit","overwhelmed","quiet","I had to","lie","team","check in","early","stick",
"feel pressure","allow","family","sacrifice","stressed","learn","frustrated","right away","convince","best","let","fair",
"client","longest","responsible","mad","stressed out","report","time ","upset","confident","dislike","unhappy","anger",
"explain","positive","stress","proof","avoid confrontation","more than willing","don't want conflict","easy going",
"hate conflict","keep people happy","team player"]
A_high_3_LIST = ["change","reluctant","angry","quickly","right away","excuse","stick","would change","early","compromise",
"not comfortable","learn","positive","avoid going","anxious","colleagues","fool","reluctance","absolutely go","fun",
"not attend","tired","losing","worry","busy","no problem","contribute","explain","hurt","network","uncomfortable","consequence",
"social","not need anyone","surely attend","regardless","help","better","excited","importan","priority","responsible",
"outside of my comfort zone","party","stretch myself","for sure go","hard ","report","focus","client","alone","lie","introduce",
"friendship","comfortable","contact","best","good","definitely go","anyone","meet"]
A_high_4_LIST = ["convince","defend","lose","accus","worthwhile","agitated","personal","consequence","concerned","impres",
"anger","success","correct","win","confus","argue","proof","incorrect","focus","terrible","best","negative","not justified",
"as soon","plead","confirm","lie","unfair","early","judg","stressed out","hard ","organize","risk","improve","worried","quickly",
"my right","open","frustrated","contact","meet","compromise","pretend","rebut","stress","reconsider","hurt","would not go",
"importan","positive","problems","agree","let","negotiate","allow","explain","learn","prove","better","anxious","colleagues",
"not true","upset","grow"]
A_high_5_LIST = ["accept","short time","question","happy to go","excited","hard ","impres","good","grow","losing","reward",
"show","contribute","convince","accus","willing","concerned","dislike","contact","hesitate","network","comfortable","apply",
"leader","immediately","stress","correct","importan","great time","hurt","offer","confus","help","anyone","lie","client",
"enjoy","learn"]

N_high_1_LIST = ["willing","regardless","losing","great","shy","career","obligated","organize","stick","forc","appeal","anger",
"unfair","positive","early","reward","my right","I had to","refuse","money","negotiate","personal issue","wrong","anyone",
"family","enjoy","pissed","hard ","team","deny","insist","busy","sacrifice","skip","proof","fair","client","better","contact",
"meet","question","fool","get even","profanity","cold","unhappy","angry","call in","awkward","excuse","upset","get along",
"demand","lose","avoid","deadline","stressed","unpleasant","terrible","difficult","frustrated","confront","hell","plead",
"alone","improve","stare","concerned","hardship","nice","pressure","sad","reflect","probably","friendship","reluctant",
"sick","obligation","quit","hate","offer","hard stance"]
N_high_2_LIST = ["calm","panic","stress","enjoy","bonus","show","learn","question","decline","sick","importan","colleagues",
"worried","worry","connect","meet","rage","paid","pretend","anxiety","avoid going","lose","early","bad","angry","better",
"deadline","losing","hurt","priority","no problem","wrong","demand","beg","I had to","busy","compromise","negotiate","probably"]
N_high_3_LIST = ["wrong","compromise","respect","risk","show","afraid","bonus","worried","tense","worthwhile","dislike","valid",
"confirm","socially awkward","introvert","losing","deserve","quickly","beg","plead","mad","change","better","angry","apply",
"not comfortable","lose","get out of it","agree","paid","outside of my comfort zone","not a social person","miss out",
"time ","family","reconsider","anxious","short time","prove","negative","money","fire","tired","negotiate","I had to","harm",
"appeal","sacrifice","hell","stress","awkward","forc","hesitate","pressure","trouble","willing","deadline","short ","suck it up",
"get along","loner","stressed","resent","skip","social anxiety","bad","not great at networking","nightmare","shy","avoid",
"impres","concerned","difficult","probably","compensate","emotion","unpleasant","obligation","nervous","feel pressure",
"extremely uncomfortable","nerve-wracking","hate networking","immediately","hate","would not go","social anxious","panic",
"unlikely","discomfort","not go","anxiety"]
N_high_5_LIST = ["negative","allow","best","hate networking","let","positive","apply","anger","beg","bonus","comfortable",
"dislike","oppurtunity","obligation","improve","concerned","pick","open","right away","job","rage","probably","refuse",
"upset","afraid","risk","alone","social anxiety","consequence","agree","prove","fair","colleagues","awkward","paid","grow",
"avoid going","early","nervous","forc","depend","resent","frustrated","difficult"]

O_high_1_LIST = ["accept","anyone","as soon","best","bonus","defer","easy going","enjoy","excuse","flexible","fool","get even",
"good","harm","hesitate","hurt","importan","leader","marked","meet","negotiate","obligation","paid","petty","pick","plead",
"positive","probably","problems","quickly","quit","reflect","respect","reward","short ","short time","stubborn","suck it up",
"suffer","tense","terrible","threaten","time ","upset","willing","win","worried"]
O_high_2_LIST = ["agree","allow","anger","best","better","calm","correct","deserve","difficult","excuse","explain","fair",
"forc","frustrated","fun","great time","immediately","importan","improve","learn","let","nervous","offer","pick","positive",
"pressure","problems","proof","prove","respect","responsible","review","short ","short time","show","suffer","team","time ",
"trouble","upset","worried"]
O_high_3_LIST = ["absolutely go","accept","alone","angry","anyone","better","career","client","cold","comfortable","consequence",
"contact","contribute","definitely go","deny","depend","difficult","early","emotion","excited","excuse","feel pressure","focus",
"for sure go","forc","friendship","fun","good","hesitate","I like parties","insist","introduce","let","lie","lose","losing",
"meet","miss out","missed out","money","nerve-wracking","nervous","network","nice","no problem","not comfortable","not need anyone",
"obligated","oppurtunity","outside of my comfort zone","party","plan ","positive","priority","quickly","regardless","responsible",
"success","trouble","worried","worry","would change"]
O_high_4_LIST = ["allow","anxious","as soon","benefit","best","better","bonus","client","cold","colleagues","comfortable",
"concerned","confirm","connect","consequence","deserve","early","explain","fool","forc","fun","great time","grow","help",
"importan","impres","improve","judg","learn","let","lie","losing","marked","meet","negotiate","nervous","nice","not justified",
"not true","offer","paid","party","personal","personal issue","plan ","positive","pretend","problems","prove","quiet",
"reconsider","resent","respect","review","risk","stick","stubborn","team","threaten","trouble"]

A_low_1_LIST = ["stare","responsible","fool","get even","profanity","call in","sick","refuse","emotion","hard stance","racist",
"slash","hardship","demand","compensate","first","stick","quit","personal issue","excuse","trouble","deny","hell","depend",
"money","cold","hard ","marked","pissed","client","deserve","unfair","fair","resent","reconsider","offer","my right","hate",
"forc","worry","reward","reluctant","concerned","organize","sad","losing","rage","bad","insist","busy","difficult","appeal",
"stressed out","stressed","wrong","early","longest","proof","better","petty","improve","contact","avoid","accept","entitle",
"meet","if i had to","seniority","suffer","comfortable","regardless","personal"]

E_low_3_LIST = ["social anxiety","extremely uncomfortable","nervous","social anxious","panic","unlikely","impres","anxiety",
"probably","introvert","immediately","feel pressure","anxious","decline","emotion","nerve-wracking","loner","pressure","avoid",
"stressed out","stressed","dislike","shy","hesitate","losing","bad","difficult","not great at networking","obligation","unfair",
"stretch myself","hate networking","willing","hell","stress","lose","nightmare","quit","avoid going","mad","paid","sad",
"reluctant","get out of it","fair","not a social person","reluctance","quiet","upset","sacrifice","change","not comfortable",
"money","tired","family","appeal","confirm","harm","prove","short ","skip","stick","hate","compensate","deserve","short time",
"sick","outside of my comfort zone","deadline","pretend","discomfort","socially awkward","show","angry","win","convince",
"not interested","apply","get along","negotiate","unpleasant","quickly","awkward","not attend","concerned","plead","fire",
"suck it up","forc","comfortable","pick","uncomfortable","unhappy","excuse","compromise","afraid","do not interact well with strangers",
"don't like being in social situations","don't like networking","don't like socializing","very shy"]

GO_3_LIST = ["absolutely go","all in","attend","attend that meeting","certainly go","cheerfully go","decide to go",
"definitely attend","definitely be in attendance_1","definitely go","definitely still go","go for it","go for sure",
"go to the event","go to the meeting","go to the networking meeting","just go","make an appearance","make sure I go",
"make time to attend","still attend","still go","still opt in","time and go","would attend","would go","would still go"]

NOGO_3_LIST = ["avoid","backing out","bow out of the meeting","choose not to go","come","consider not going","decide to go",
"decline","ditch","get out of it","go home","happy to go","hate going","hesitate to go","in attendance","likely go",
"likely not go","might consider going","no interest","not attend","not come","not consider going","not feel like going",
"not going","not interested","not show up","not volunteer","not want to go","politely decline","probably attend","probably go",
"probably not go","probably still go","probably would not","probably wouldn't","skip","stay at home","try to go","unlikely to go",
"will not go","would be going","would not go","wouldn't be going","wouldn't do it","wouldn't go","wouldn't want to go"]

GO_5_LIST = ['would go','probably go']

NOGO_5_LIST = ['not go','not to go',"n't go"]

NOT_LIST = [" not "]

NO_LIST = [" no "]

In [90]:
# Define function for counting word occurance

def write_keyword_count_column(df, target_column, source_column, keyword_list):
    def compute_keyword_list_count(text):
        return sum([text.count(kw) for kw in keyword_list])    
    df[target_column] = df[source_column].apply(compute_keyword_list_count)

In [91]:
# Specify key word list features

write_keyword_count_column(df_total, 'O_high_5', 'open_ended_5', O_high_5_LIST)

write_keyword_count_column(df_total, 'C_high_2', 'open_ended_2', C_high_2_LIST)

write_keyword_count_column(df_total, 'A_high_1', 'open_ended_1', A_high_1_LIST)

write_keyword_count_column(df_total, 'E_high_3', 'open_ended_3', E_high_3_LIST)

write_keyword_count_column(df_total, 'A_low_2', 'open_ended_2', A_low_2_LIST)
write_keyword_count_column(df_total, 'A_low_3', 'open_ended_3', A_low_3_LIST)
write_keyword_count_column(df_total, 'A_low_4', 'open_ended_4', A_low_4_LIST)
write_keyword_count_column(df_total, 'A_low_5', 'open_ended_5', A_low_5_LIST)

write_keyword_count_column(df_total, 'N_low_1', 'open_ended_1', N_low_1_LIST)
write_keyword_count_column(df_total, 'N_low_2', 'open_ended_2', N_low_2_LIST)
write_keyword_count_column(df_total, 'N_low_3', 'open_ended_3', N_low_3_LIST)
write_keyword_count_column(df_total, 'N_low_5', 'open_ended_5', N_low_5_LIST)

write_keyword_count_column(df_total, 'C_high_1', 'open_ended_1', C_high_1_LIST)
write_keyword_count_column(df_total, 'C_high_3', 'open_ended_3', C_high_3_LIST)
write_keyword_count_column(df_total, 'C_high_4', 'open_ended_4', C_high_4_LIST)
write_keyword_count_column(df_total, 'C_high_5', 'open_ended_5', C_high_5_LIST)

write_keyword_count_column(df_total, 'A_high_2', 'open_ended_1', A_high_2_LIST)
write_keyword_count_column(df_total, 'A_high_3', 'open_ended_3', A_high_3_LIST)
write_keyword_count_column(df_total, 'A_high_4', 'open_ended_4', A_high_4_LIST)
write_keyword_count_column(df_total, 'A_high_5', 'open_ended_5', A_high_5_LIST)

write_keyword_count_column(df_total, 'N_high_1', 'open_ended_1', N_high_1_LIST)
write_keyword_count_column(df_total, 'N_high_2', 'open_ended_2', N_high_2_LIST)
write_keyword_count_column(df_total, 'N_high_3', 'open_ended_3', N_high_3_LIST)
write_keyword_count_column(df_total, 'N_high_5', 'open_ended_5', N_high_5_LIST)

write_keyword_count_column(df_total, 'O_high_1', 'open_ended_1', O_high_1_LIST)
write_keyword_count_column(df_total, 'O_high_2', 'open_ended_2', O_high_2_LIST)
write_keyword_count_column(df_total, 'O_high_3', 'open_ended_3', O_high_3_LIST)
write_keyword_count_column(df_total, 'O_high_4', 'open_ended_4', O_high_4_LIST)

write_keyword_count_column(df_total, 'E_high_3', 'open_ended_3', O_high_2_LIST)
write_keyword_count_column(df_total, 'E_high_4', 'open_ended_4', O_high_3_LIST)
write_keyword_count_column(df_total, 'E_high_5', 'open_ended_5', O_high_4_LIST)

write_keyword_count_column(df_total, 'A_low_1', 'open_ended_1', A_low_1_LIST)

write_keyword_count_column(df_total, 'E_low_3', 'open_ended_3', E_low_3_LIST)

write_keyword_count_column(df_total, 'GO_3', 'open_ended_3', GO_3_LIST)
write_keyword_count_column(df_total, 'NOGO_3', 'open_ended_3', NOGO_3_LIST)

write_keyword_count_column(df_total, 'GO_5', 'open_ended_5', GO_3_LIST)
write_keyword_count_column(df_total, 'NOGO_5', 'open_ended_5', NOGO_3_LIST)

write_keyword_count_column(df_total, 'NOT_1', 'open_ended_1', NOT_LIST)
write_keyword_count_column(df_total, 'NOT_2', 'open_ended_2', NOT_LIST)
write_keyword_count_column(df_total, 'NOT_3', 'open_ended_3', NOT_LIST)
write_keyword_count_column(df_total, 'NOT_4', 'open_ended_4', NOT_LIST)
write_keyword_count_column(df_total, 'NOT_5', 'open_ended_5', NOT_LIST)

write_keyword_count_column(df_total, 'NO_5', 'open_ended_5', NO_LIST)
write_keyword_count_column(df_total, 'NOT_5', 'open_ended_5', NOT_LIST)

In [92]:
# Generating aggregate features--combinations were derived in part from feedback from the public leaderboard

df_total['A_low_comb'] = df_total['A_low_2']+df_total['A_low_3']+df_total['A_low_4']+df_total['A_low_5']
df_total['N_low_comb'] = df_total['N_low_1']+df_total['N_low_2']+df_total['N_low_3']+df_total['N_low_5']
df_total['C_high_comb'] = df_total['C_high_1']+df_total['C_high_3']+df_total['C_high_4']+df_total['C_high_5']
df_total['A_high_comb'] = df_total['A_high_2']+df_total['A_high_3']+df_total['A_high_4']+df_total['A_high_5']
df_total['N_high_comb'] = df_total['N_high_1']+df_total['N_high_2']+df_total['N_high_3']+df_total['N_high_5']
df_total['O_high_comb'] = df_total['O_high_1']+df_total['O_high_2']+df_total['O_high_3']+df_total['O_high_4']
df_total['E_high_3to5'] = df_total['E_high_3']+df_total['E_high_4']+df_total['E_high_5']

df_total['A_not_comb'] = df_total['NOT_1']+df_total['NOT_2']+df_total['NOT_3']+df_total['NOT_4']+df_total['NOT_5']

df_total['O_go_comb'] = df_total['GO_5']-df_total['NOGO_5']

## Building word lists 2

In [93]:
df_total['char_count_3'] = df_total['open_ended_3'].str.len() 
df_total['char_count_4'] = df_total['open_ended_4'].str.len()

In [94]:
def avg_word(sentence):
  words = sentence.split()
  return (sum(len(word) for word in words)/len(words))

df_total['avg_word_1'] = df_total['open_ended_1'].apply(lambda x: avg_word(x))
df_total['avg_word_2'] = df_total['open_ended_2'].apply(lambda x: avg_word(x))
df_total['avg_word_3'] = df_total['open_ended_3'].apply(lambda x: avg_word(x))
df_total['avg_word_4'] = df_total['open_ended_4'].apply(lambda x: avg_word(x))
df_total['avg_word_5'] = df_total['open_ended_5'].apply(lambda x: avg_word(x))


In [95]:
not_list = [" not "]

no_list = [" no "] #apply to 5 only

e_high_3_list=['benefit','best','better','career','client','competition','confident','connect',
'contact','contribute','convince','drink','enjoy','friendship','good','great','grow','importan',
'impres','introduce','leader','learn','meet','miss out','missed out','network','open','oppurtunity',
'party','positive','regardless','reward','sales','show','sociable','success','worthwhile']

e_high_4_list=['benefit','best','better','career','client','competition','confident','connect',
'contact','contribute','convince','drink','enjoy','friendship','good','great','grow','importan',
'impres','introduce','leader','learn','meet','miss out','missed out','network','open','oppurtunity',
'party','positive','regardless','reward','sales','show','sociable','success','worthwhile']

e_high_5_list =['benefit','best','better','career','client','competition','confident','connect',
'contact','contribute','convince','drink','enjoy','friendship','good','great','grow','importan',
'impres','introduce','leader','learn','meet','miss out','missed out','network','open','oppurtunity',
'party','positive','regardless','reward','sales','show','sociable','success','worthwhile']

n_low_comb_emp_1_list=['as soon','report','show','problems','best','quickly','bonus','tense',
'social','correct','win','concede','leader','misunderstand','unlikely','incorrect','fire',
'easy going','paid','hesitate','human resources','time ','emotion','worried','racist','slash',
'fun','valid','stubborn','flexible','review','beg','respect','benefit','open','threaten','short ',
'change','first','trouble','agree','compromise','defend','defer','mad','harm']

n_low_comb_emp_2_list=['worry','hate networking','client','responsible','longest','unhappy',
'willing','accus','proof','difficult','family','anger','team','correct','consequence','comfortable',
'stick','trouble','job','pressure','benefit','mad','report','deserve','accept','positive','review',
'open','as soon','risk','time ','let','feel pressure','check in','depend','dislike','judg','social anxiety',
'resent','lie','explain','upset','hard ','leader']

n_low_comb_emp_3_list=['frustrated','learn','no problem','regardless','network','good','introduce',
'anyone','definitely go','confident','meet','competition','contact','lie','client','I like parties',
'not need anyone','great','social','party','worry','friendship','review','contribute','stretch myself',
'surely attend','fool','plan ','help','leader','missed out','for sure go','fair','let','reluctance',
'absolutely go','excited','happy to go','priority','excuse','hard ','report','job']

n_low_comb_emp_5_list=['anger','learn','anyone','short time','help','leader','client','great time',
'enjoy','importan','excited','hesitate','correct','lie','team','losing','career','responsible',
'insist','immediately','bad','happy to go','pretend','willing','emotion','short ','stress','confus',
'trouble','time ','worry','success','regardless','report','hurt','show','money','contact','stick',
'mad','unlikely']

n_high_comb_emp_1_list=['willing','regardless','losing','great','shy','career','obligated','organize',
'stick','forc','appeal','anger','unfair','positive','early','reward','my right','I had to','refuse',
'money','negotiate','personal issue','wrong','anyone','family','enjoy','pissed','hard ','team',
'deny','insist','busy','sacrifice','skip','proof','fair','client','better','contact','meet','question',
'fool','get even','profanity','cold','unhappy','angry','call in','awkward','excuse','upset','get along',
'demand','lose','avoid','deadline','stressed','unpleasant','terrible','difficult','frustrated','confront',
'hell','plead','alone','improve','stare','concerned','hardship','nice','pressure','sad','reflect','probably',
'friendship','reluctant','sick','obligation','quit','hate','offer','hard stance']

n_high_comb_emp_2_list=['calm','panic','stress','enjoy','bonus','show','learn','question','decline',
'sick','importan','colleagues','worried','worry','connect','meet','rage','paid','pretend','anxiety',
'avoid going','lose','early','bad','angry','better','deadline','losing','hurt','priority','no problem',
'wrong','demand','beg','I had to','busy','compromise','negotiate','probably']

n_high_comb_emp_3_list=['wrong','compromise','respect','risk','show','afraid','bonus','worried','tense',
'worthwhile','dislike','valid','confirm','socially awkward','introvert','losing','deserve','quickly',
'beg','plead','mad','change','better','angry','apply','not comfortable','lose','get out of it',
'agree','paid','outside of my comfort zone','not a social person','miss out','time ','family','reconsider',
'anxious','short time','prove','negative','money','fire','tired','negotiate','I had to','harm','appeal',
'sacrifice','hell','stress','awkward','forc','hesitate','pressure','trouble','willing','deadline','short ',
'suck it up','get along','loner','stressed','resent','skip','social anxiety','bad','not great at networking',
'nightmare','shy','avoid','impres','concerned','difficult','probably','compensate','emotion','unpleasant',
'obligation','nervous','feel pressure','extremely uncomfortable','nerve-wracking','hate networking','immediately',
'hate','would not go','social anxious','panic','unlikely','discomfort','not go','anxiety']

n_high_comb_emp_5_list=['negative','allow','best','hate networking','let','positive','apply','anger',
'beg','bonus','comfortable','dislike','oppurtunity','obligation','improve','concerned','pick','open',
'right away','job','rage','probably','refuse','upset','afraid','risk','alone','social anxiety','consequence',
'agree','prove','fair','colleagues','awkward','paid','grow','avoid going','early','nervous','forc',
'depend','resent','frustrated','difficult']

e_low_3_emp_list=['social anxiety','extremely uncomfortable','nervous','social anxious','panic','unlikely',
'impres','anxiety','probably','introvert','immediately','feel pressure','anxious','decline','emotion',
'nerve-wracking','loner','pressure','avoid','stressed out','stressed','dislike','shy','hesitate','losing',
'bad','difficult','not great at networking','obligation','unfair','stretch myself','hate networking',
'willing','hell','stress','lose','nightmare','quit','avoid going','mad','paid','sad','reluctant','get out of it',
'fair','not a social person','reluctance','quiet','upset','sacrifice','change','not comfortable','money',
'tired','family','appeal','confirm','harm','prove','short ','skip','stick','hate','compensate','deserve',
'short time','sick','outside of my comfort zone','deadline','pretend','discomfort','socially awkward',
'show','angry','win','convince','not interested','apply','get along','negotiate','unpleasant','quickly',
'awkward','not attend','concerned','plead','fire','suck it up','forc','comfortale','pick','uncomfortable',
'unhappy','excuse','compromise','afraid','do not interact well with strangers',"don't like being in social situation",
"don't like networking","don't like socializing",'very shy']    

e_high_3_emp_list=['career','good','frustrated','nice','best','deny','reflect','confident','grow','consequence',
'missed out','connect','rage','importan','worry','I am sociable','party','right away','priority','sociable',
'accept','focus','plan ','report','excited','reward','contribute','allow','success','contact','review',
'absolutely go','for sure go','meet','great','colleagues','social','not need anyone','regardless','fool',
'surely attend','leader','network','I like parties','no problem','learn','friendship','definitely go','introduce',
'let','competition','client','make new friends']

c_high_2_emp_list=['family','report','stress','question','convince','job','deserve','longest','comfortable','win','great time',
'negative','fair','check in','short time','accus','short ','respect','willing','lie','correct','as soon',
'positive','impres','review','problems','immediately','hate networking','anger','proof','upset','prove',
'open','explain','improve','time ','confident','right away','let']    

c_low_comb_emp_1_list=['hard stance','hardship','my right','call in','stare','contact','shy','quit','entitle','sad','reluctant','refuse',
'demand','fool','wrong','get even','profanity','compensate','pressure','convince','responsible','fair','prove',
'client','skip','get along','negotiate','sick','family','hell','difficult','sacrifice','obligation','awkward',
'offer','friendship','hard ','career','threaten','party','allow','avoid','stick','hate','improve','terrible',
'deadline','plan ','great','personal','plead','trouble','marked','respect','probably','early','confront',
'lose','suck it up','better','excuse','suffer','busy','money','stress','beg','unfair','time ','personal issue',
'deny','stressed']
    
c_low_comb_emp_3_list=['obligation','unlikely','panic','would not go','discomfort','deserve','awkward','immediately','unhappy',
'social anxiety','skip','forc','social anxious','harm','probably','unpleasant','worthwhile','oppurtunity',
'fire','negative','get along','shy','team','short ','open','pick','feel pressure','show','quit','angry',
'avoid','hell','confirm','decline','money','socially awkward','negotiate','extremely uncomfortable','time ',
'short time','not go','better','avoid going','not a social person','loner','valid','anger','hesitate',
'hate networking','great time','positive','sociable','prove','emotion','compromise','allow','rage','concerned',
'anxiety','outside of my comfort zone','wrong','deadline','stress','resent','personal issue','reconsider',
'cold','not interested','unfair','early','drink','stressed']

c_low_comb_emp_4_list=['probably','career','appeal','anxiety','depend','wrong','sad','cold','fool','marked','reflect','not go',
'harm','bad','hate networking','job','mad','money','would change','reward','quit','focus','organize',
'stressed','hesitate','leader','valid','difficult','consequence','emotion','personal issue','bonus',
'accept','review','nice','avoid going','plan ','great time','lose','fire','frustrated','pressure','confront']
    
c_low_comb_emp_5_list=['forc','network','resent','considerate','difficult','I had to','frustrated','paid','obligation','nervous',
'refuse','explain','rage','anger','grow','meet','respect','oppurtunity','bonus','nice','insist','job',
'depend','avoid going','upset','awkward','positive','apply','short ','connect','probably','plan ','win',
'open','concerned','question','negative','change','friendship','dislike','focus','alone']
    
c_high_comb_emp_1_list=['question','stubborn','would change','reconsider','as soon','human resources','disagree','defer','risk',
'unpleasant','immediately','worry','argue','petty','explain','mad','proof','hurt','correct','obligated',
'not go','harm','unhappy','leader','misunderstand','win','fire','unlikely','first','pick','angry','priority',
'bonus','quickly','short time','hesitate','tense','social','switch','success','problems','not interested',
'easy going','no problem','report','reflect','upset','anger','team','valid','paid','review','agree',
'short ','willing','fun','concede','show','seniority','flexible','change']
    
c_high_comb_emp_3_list=['no problem','introvert','frustrated','win','sad','missed out','alone','dislike','appeal','depend','insist',
'sick','success','not comfortable','stretch myself','report','benefit','accept','hard ','contribute','responsible',
'compensate','fool','social','absolutely go','regardless','anyone','focus','pretend','worried','nightmare',
'not need anyone','surely attend','for sure go','colleagues','competition','let','help','best','great',
'deny','importan','learn','network','client','uncomfortable','priority','lie','improve','good','not attend',
'fun','definitely go','mad','comfortable','reluctant','excited','excuse','meet']
    
c_high_comb_emp_4_list=['willing','change','respect','connect','fun','paid','hard ','immediately','terrible','grow','incorrect',
'refuse','open','resent','quickly','contact','calm','party','short ','contribute','my right','stubborn',
'rebut','problems','worried','as soon','compromise','hurt','good','proof','not true','early','human resources',
'obligation','colleagues','meet','demand','success','negative','allow','concerned','disagree','let','agree']    
    

c_high_comb_emp_5_list=['success','appeal','worry','fun','busy','hesitate','problems','allow','hurt','improve','excited','good',
'bad','leader','stress','importan','excuse','introduce','lose','enjoy','prove','personal issue','fair',
'quickly','correct','stick','accus','unlikely','comfortable','sad','willing','contact','confus','career',
'show','losing','immediately','compensate','anyone','lie','client','help','learn']
    
a_low_emp_1_list=['stare','responsible','fool','get even','profanity','call in','sick','refuse','emotion','hard stance','racist',
'slash','hardship','demand','compensate','first','stick','quit','personal issue','excuse','trouble','deny',
'hell','depend','money','cold','hard ','marked','pissed','client','deserve','unfair','fair',
'resent','reconsider','offer','my right','hate','forc','worry','reward','reluctant','concerned','organize',
'sad','losing','rage','bad','insist','busy','difficult','appeal','stressed out','stressed','wrong',
'early','longest','proof','better','petty','improve','contact','avoid','accept','entitle','meet',
'if I had to','seniority','suffer','comfortable','regardless','personal','not back down','stand my ground']  
    
a_high_emp_1_list=['agree','benefit','best','bonus','change','compromise','considerate','correct','defer','easy going','family',
'flexible','fun','good','help','hurt','incorrect','leader','let','misunderstand','no problem','not interested',
'obligation','paid','pick','priority','problems','quickly','respect','review','show','willing','win',
'more than willing',"don't want conflict",'easy going','hate conflict','team player','avoid confrontation',
'keep people happy']  
    
a_low_comb_emp_2_list=['wrong','question','busy','probably','resent','not go','importan','fun','enjoy','bad','first','problems',
'refuse','better','short time','good','anxiety','avoid going','respect','compromise','losing','angry',
'regardless','social anxiety','rage','decline','pretend','focus','connect','no problem','priority','excuse',
'procrastinate','fool','sick','personal issue','anticipat','deadline','anyone','lose','difficult','meet',
'judg','worry','plan ','trouble','show','nervous','reflect','help','pressure','compensate','bonus','get along',
'flexible','colleagues','accus','fire','consequence','demand']
    
a_low_comb_emp_3_list=['I like parties','fire','unpleasant','would not go','sales','quit','discomfort','money','hate networking',
'worthwhile','obligation','panic','emotion','unlikely','hell','skip','social anxious','pick','cold',
'decline','not go','paid','get out of it','hate','reward','rage','short ','negotiate','beg','difficult',
'trouble','resent','time ','immediately','stress','stressed','stressed out','reconsider','short time',
'grow','extremely uncomfortable','willing','get along','apply','I had to','risk','anxiety','great',
'forc','allow','socially awkward','dislike','I am sociable','great time','missed out','compensate','oppurtunity',
'anger','benefit','plan ','confirm','avoid','social anxiety','fair','pressure','mad','deserve',
'not a social person']
    
a_low_comb_emp_4_list=['rage','cold','fool','marked','depend','demand','quit','report','probably','career','accept','not go',
'compensate','pressure','quiet','angry','afraid','confront','emotion','job','benefit','mad','threaten',
'money','unpleasant','anxiety','pissed','anyone','obligation','confident','short ','regardless','refuse','appeal',
'hesitate','examples','immediately','bad','suck it up','resent','respect','wrong','harm']
    
a_low_comb_emp_5_list=['paid','refuse','avoid going','alone','emotion','pretend','resent','bonus','win','rage','difficult',
'probably','afraid','anger','forc','hate networking','change','agree','depend','pick','focus','obligation',
'frustrated','considerate','right away','time ','money','negative','colleagues','awkward','improve','success',
'explain','bad','best','respect','let','better','nice','nervous']
    
a_high_comb_emp_2_list=['agree','negative','benefit','overwhelmed','quiet','I had to','lie','team','check in','early','stick',
'feel pressure','allow','family','sacrifice','stressed','learn','frustrated','right away','convince',
'best','let','fair','client','longest','responsible','mad','stressed out','report','time ','upset',
'confident','dislike','unhappy','anger','explain','positive','stress','proof']
    
a_high_comb_emp_3_list=['change','reluctant','angry','quickly','right away','excuse','stick','would change','early','compromise',
'not comfortable','learn','positive','avoid going','anxious','colleagues','fool','reluctance','absolutely go',
'fun','not attend','tired','losing','worry','busy','no problem','contribute','explain','hurt','network',
'uncomfortable','consequence','social','not need anyone','surely attend','regardless','help','better',
'excited','importan','priority','responsible','outside of my comfort zone','party','stretch myself','for sure go',
'hard ','report','focus','client','alone','lie','introduce','friendship','comfortable','contact',
'best','good','definitely go','anyone','meet']  
    
a_high_comb_emp_4_list=['convince','defend','lose','accus','worthwhile','agitated','personal','consequence','concerned','impres',
'anger','success','correct','win','confus','argue','proof','incorrect','focus','terrible','best',
'negative','not justified','as soon','plead','confirm','lie','unfair','early','judg','stressed out','hard ',
'organize','risk','improve','worried','quickly','my right','open','frustrated','contact','meet',
'compromise','pretend','rebut','stress','reconsider','hurt','would not go','importan','positive','problems',
'agree','let','negotiate','allow','explain','learn','prove','better','anxious','colleagues','not true',
'upset','grow']
       
a_high_comb_emp_5_list=['accept','short time','question','happy to go','excited','hard ','impres','good','grow','losing',
'reward','show','contribute','convince','accus','willing','concerned','dislike','contact','hesitate',
'network','comfortable','apply','leader','immediately','stress','correct','importan','great time','hurt',
'offer','confus','help','anyone','lie','client','enjoy','learn']

go_v2_3_list=['go for it','make an appearance','certainly go','would attend','just go','attend','still attend','still go',
'would still go','definitely go','would go','all in','definitely be in attendance','absolutely go',
'attend that meeting','cheerfully go','decide to go','definitely attend','definitely still go','go for sure',
'go to the event','go to the meeting','go to the networking meeting','make sure I go','make time to attend',
'still opt in','time and go']
    
not_go_v2_3_list=['would not go',"wouldn't go",'probably not go',"wouldn't want to go",'unlikely to go','decline',
'not show up','hesitate to go','go home','ditch','avoid','get out of it','probably still go',
'probably go','skip','decide to go','try to go','not going','in attendance','not interested',
'come','not attend','would be going','not come','likely go','happy to go','probably attend',
'bow out of the meeting','choose not to go','consider not going','hate going','likely not go',
'not consider going','not feel like going','not want to go','politely decline','probably would not',
"probably wouldn't",'stay at home','will not go']
    
not_go_5_list=['not go','not to go',"n't go"]

go_5_list=['would go','probably go']

## Word List Section

This section includes the code that was used in the word list prediction. The optimal weights were derived from feedback on the public leaderboard.

In [96]:
# Compute new word count variables from word lists

write_keyword_count_column(df_total, 'not_go_5', 'open_ended_5', not_go_5_list)
write_keyword_count_column(df_total, 'go_5', 'open_ended_5', go_5_list)

df_total['go_comb_5']=df_total['go_5']-df_total['not_go_5']

write_keyword_count_column(df_total, 'not_1', 'open_ended_1', not_list)
write_keyword_count_column(df_total, 'not_2', 'open_ended_2', not_list)
write_keyword_count_column(df_total, 'not_3', 'open_ended_3', not_list)
write_keyword_count_column(df_total, 'not_4', 'open_ended_4', not_list)
write_keyword_count_column(df_total, 'not_5', 'open_ended_5', not_list)

df_total['sum_not']=df_total['not_1']+df_total['not_2']+df_total['not_3']+df_total['not_4']+df_total['not_5']

write_keyword_count_column(df_total, 'no_5', 'open_ended_5', no_list)

write_keyword_count_column(df_total, 'n_low_comb_emp_1', 'open_ended_1', n_low_comb_emp_1_list)
write_keyword_count_column(df_total, 'n_low_comb_emp_2', 'open_ended_2', n_low_comb_emp_2_list)
write_keyword_count_column(df_total, 'n_low_comb_emp_3', 'open_ended_3', n_low_comb_emp_3_list)
write_keyword_count_column(df_total, 'n_low_comb_emp_5', 'open_ended_5', n_low_comb_emp_5_list)

df_total['n_low_comb_emp']=df_total['n_low_comb_emp_1']+df_total['n_low_comb_emp_2']+df_total['n_low_comb_emp_3']+df_total['n_low_comb_emp_5']

write_keyword_count_column(df_total, 'n_high_comb_emp_1', 'open_ended_1', n_high_comb_emp_1_list)
write_keyword_count_column(df_total, 'n_high_comb_emp_2', 'open_ended_2', n_high_comb_emp_2_list)
write_keyword_count_column(df_total, 'n_high_comb_emp_3', 'open_ended_3', n_high_comb_emp_3_list)
write_keyword_count_column(df_total, 'n_high_comb_emp_5', 'open_ended_5', n_high_comb_emp_5_list)

df_total['n_high_comb_emp']=df_total['n_high_comb_emp_1']+df_total['n_high_comb_emp_2']+df_total['n_high_comb_emp_3']+df_total['n_high_comb_emp_5']

write_keyword_count_column(df_total, 'e_high_3', 'open_ended_3', e_high_3_list)
write_keyword_count_column(df_total, 'e_high_4', 'open_ended_4', e_high_4_list)
write_keyword_count_column(df_total, 'e_high_5', 'open_ended_5', e_high_5_list)

df_total['e_high_3to5'] = df_total['e_high_3']+df_total['e_high_4']+df_total['e_high_5']

write_keyword_count_column(df_total, 'go_5', 'open_ended_5', go_5_list)
write_keyword_count_column(df_total, 'not_go_5', 'open_ended_5', not_go_5_list)

df_total['go_comb_5']=df_total['go_5']-df_total['not_go_5']

write_keyword_count_column(df_total, 'go_v2', 'open_ended_3', go_v2_3_list)
write_keyword_count_column(df_total, 'not_go_v2', 'open_ended_3', not_go_v2_3_list)

write_keyword_count_column(df_total, 'c_high_2_emp', 'open_ended_2', c_high_2_emp_list)

write_keyword_count_column(df_total, 'c_low_comb_emp_1', 'open_ended_1', c_low_comb_emp_1_list)
write_keyword_count_column(df_total, 'c_low_comb_emp_3', 'open_ended_2', c_low_comb_emp_3_list)
write_keyword_count_column(df_total, 'c_low_comb_emp_4', 'open_ended_3', c_low_comb_emp_4_list)
write_keyword_count_column(df_total, 'c_low_comb_emp_5', 'open_ended_5', c_low_comb_emp_5_list)

df_total['c_low_comb_emp']=df_total['c_low_comb_emp_1']+df_total['c_low_comb_emp_3']+df_total['c_low_comb_emp_4']+df_total['c_low_comb_emp_5']

write_keyword_count_column(df_total, 'c_high_comb_emp_1', 'open_ended_1', c_high_comb_emp_1_list)
write_keyword_count_column(df_total, 'c_high_comb_emp_3', 'open_ended_2', c_high_comb_emp_3_list)
write_keyword_count_column(df_total, 'c_high_comb_emp_4', 'open_ended_3', c_high_comb_emp_4_list)
write_keyword_count_column(df_total, 'c_high_comb_emp_5', 'open_ended_5', c_high_comb_emp_5_list)

df_total['c_high_comb_emp']=df_total['c_high_comb_emp_1']+df_total['c_high_comb_emp_3']+df_total['c_high_comb_emp_4']+df_total['c_high_comb_emp_5']

write_keyword_count_column(df_total, 'e_low_3_emp', 'open_ended_2', e_low_3_emp_list)
write_keyword_count_column(df_total, 'e_high_3_emp', 'open_ended_2', e_high_3_emp_list)

write_keyword_count_column(df_total, 'a_low_1_emp', 'open_ended_1', a_low_emp_1_list)
write_keyword_count_column(df_total, 'a_high_1_emp', 'open_ended_1', a_high_emp_1_list)

write_keyword_count_column(df_total, 'a_low_comb_emp_2', 'open_ended_1', a_low_comb_emp_2_list)
write_keyword_count_column(df_total, 'a_low_comb_emp_3', 'open_ended_2', a_low_comb_emp_3_list)
write_keyword_count_column(df_total, 'a_low_comb_emp_4', 'open_ended_3', a_low_comb_emp_4_list)
write_keyword_count_column(df_total, 'a_low_comb_emp_5', 'open_ended_5', a_low_comb_emp_5_list)

df_total['a_low_comb_emp']=df_total['a_low_comb_emp_2']+df_total['a_low_comb_emp_2']+df_total['a_low_comb_emp_4']+df_total['a_low_comb_emp_5']

write_keyword_count_column(df_total, 'a_high_comb_emp_2', 'open_ended_1', a_high_comb_emp_2_list)
write_keyword_count_column(df_total, 'a_high_comb_emp_3', 'open_ended_2', a_high_comb_emp_3_list)
write_keyword_count_column(df_total, 'a_high_comb_emp_4', 'open_ended_3', a_high_comb_emp_4_list)
write_keyword_count_column(df_total, 'a_high_comb_emp_5', 'open_ended_5', a_high_comb_emp_5_list)

df_total['a_high_comb_emp']=df_total['a_high_comb_emp_2']+df_total['a_high_comb_emp_2']+df_total['a_high_comb_emp_4']+df_total['a_high_comb_emp_5']

  df[target_column] = df[source_column].apply(compute_keyword_list_count)
  df[target_column] = df[source_column].apply(compute_keyword_list_count)
  df[target_column] = df[source_column].apply(compute_keyword_list_count)
  df[target_column] = df[source_column].apply(compute_keyword_list_count)
  df_total['n_high_comb_emp']=df_total['n_high_comb_emp_1']+df_total['n_high_comb_emp_2']+df_total['n_high_comb_emp_3']+df_total['n_high_comb_emp_5']
  df[target_column] = df[source_column].apply(compute_keyword_list_count)
  df[target_column] = df[source_column].apply(compute_keyword_list_count)
  df[target_column] = df[source_column].apply(compute_keyword_list_count)
  df_total['e_high_3to5'] = df_total['e_high_3']+df_total['e_high_4']+df_total['e_high_5']
  df[target_column] = df[source_column].apply(compute_keyword_list_count)
  df[target_column] = df[source_column].apply(compute_keyword_list_count)
  df[target_column] = df[source_column].apply(compute_keyword_list_count)
  df[target_column]

In [97]:
# Create list of features to standardize

zvarlist=['char_count_3',
 'char_count_4',
 'avg_word_1',
 'avg_word_2',
 'avg_word_3',
 'avg_word_4',
 'avg_word_5',
 'not_go_5',
 'go_5',
 'go_comb_5',
 'sum_not',
 'no_5',
 'not_5',
 'n_low_comb_emp',
 'n_high_comb_emp',
 'e_high_3',
 'e_high_4',
 'e_high_5',
 'e_high_3to5',
 'go_v2',
 'not_go_v2',
 'c_high_2_emp',
 'c_low_comb_emp',
 'c_high_comb_emp',
 'e_low_3_emp',
 'e_high_3_emp',
 'a_low_1_emp',
 'a_high_1_emp',
 'a_low_comb_emp',
 'a_high_comb_emp']

In [98]:
# Standardize list

cols = zvarlist
for col in cols:
    col_zscore = 'Z'+ col
    df_total[col_zscore] = (df_total[col] - df_total[col].mean())/df_total[col].std(ddof=0)

  df_total[col_zscore] = (df_total[col] - df_total[col].mean())/df_total[col].std(ddof=0)
  df_total[col_zscore] = (df_total[col] - df_total[col].mean())/df_total[col].std(ddof=0)
  df_total[col_zscore] = (df_total[col] - df_total[col].mean())/df_total[col].std(ddof=0)
  df_total[col_zscore] = (df_total[col] - df_total[col].mean())/df_total[col].std(ddof=0)
  df_total[col_zscore] = (df_total[col] - df_total[col].mean())/df_total[col].std(ddof=0)
  df_total[col_zscore] = (df_total[col] - df_total[col].mean())/df_total[col].std(ddof=0)
  df_total[col_zscore] = (df_total[col] - df_total[col].mean())/df_total[col].std(ddof=0)
  df_total[col_zscore] = (df_total[col] - df_total[col].mean())/df_total[col].std(ddof=0)
  df_total[col_zscore] = (df_total[col] - df_total[col].mean())/df_total[col].std(ddof=0)
  df_total[col_zscore] = (df_total[col] - df_total[col].mean())/df_total[col].std(ddof=0)
  df_total[col_zscore] = (df_total[col] - df_total[col].mean())/df_total[col].std(ddof=0)
  df_total

In [99]:
# Weighting features

df_total['Zno_5']=df_total['Zno_5'] * -1
df_total['Znot_5']=df_total['Znot_5']  *-1
df_total['Zn_low_comb_emp']=df_total['Zn_low_comb_emp']  *-1
df_total['Zc_high_2_emp']=df_total['Zc_high_2_emp']  *1.25
df_total['Zc_low_comb_emp']=df_total['Zc_low_comb_emp']  *-1
df_total['Zc_high_comb_emp']=df_total['Zc_high_comb_emp'] *1.5
df_total['Ze_low_3_emp']=df_total['Ze_low_3_emp']  *-1.25
df_total['Ze_high_3_emp']=df_total['Ze_high_3_emp']  *1.25
df_total['Znot_go_v2']=df_total['Znot_go_v2']  *-1
df_total['Zsum_not']=df_total['Zsum_not']  *-1
df_total['Za_low_1_emp ']=df_total['Za_low_1_emp']  *-1.5
df_total['Za_low_comb_emp']=df_total['Za_low_comb_emp']  *-1
df_total['Za_high_comb_emp']=df_total['Za_high_comb_emp']  *1.5

  df_total['Za_low_1_emp ']=df_total['Za_low_1_emp']  *-1.5


In [100]:
df_total['o_pred']=df_total[['Zchar_count_3', 'Zchar_count_4', 'Zno_5', 'Znot_5', 'Ze_high_3to5',  'Zavg_word_2', 
                         'Zavg_word_4', 'Zgo_comb_5']].mean(axis=1)                                

  df_total['o_pred']=df_total[['Zchar_count_3', 'Zchar_count_4', 'Zno_5', 'Znot_5', 'Ze_high_3to5',  'Zavg_word_2',


In [101]:
#Recode o_pred

recode_list=df_total[['o_pred']]

def recode_extreme(predictor_col):
    if predictor_col >=0.75:
        val=.75
    else: 
        val=predictor_col
    return val

for predictor_col in recode_list:
    df_total[predictor_col] = df_total[predictor_col].apply(recode_extreme)

In [102]:
df_total['n_pred'] = df_total[['Zn_low_comb_emp', 'Zn_high_comb_emp']].mean(axis=1)

  df_total['n_pred'] = df_total[['Zn_low_comb_emp', 'Zn_high_comb_emp']].mean(axis=1)


In [103]:
df_total['c_pred'] = df_total[['Zc_high_2_emp','Zc_low_comb_emp','Zc_high_comb_emp','Zavg_word_5']].mean(axis=1)

  df_total['c_pred'] = df_total[['Zc_high_2_emp','Zc_low_comb_emp','Zc_high_comb_emp','Zavg_word_5']].mean(axis=1)


In [104]:
df_total['e_pred'] = df_total[['Ze_low_3_emp', 'Ze_high_3_emp', 'Zgo_v2', 'Znot_go_v2']].mean(axis=1)

  df_total['e_pred'] = df_total[['Ze_low_3_emp', 'Ze_high_3_emp', 'Zgo_v2', 'Znot_go_v2']].mean(axis=1)


In [105]:
df_total['a_pred'] = df_total[['Zavg_word_4', 'Zsum_not', 'Zavg_word_5', 'Za_low_1_emp', 'Za_high_1_emp', 'Za_low_comb_emp','Za_high_comb_emp']].mean(axis=1)

  df_total['a_pred'] = df_total[['Zavg_word_4', 'Zsum_not', 'Zavg_word_5', 'Za_low_1_emp', 'Za_high_1_emp', 'Za_low_comb_emp','Za_high_comb_emp']].mean(axis=1)


##  Machine Learning Section

Much of the machine learning that we applied did not result in stronger predictions on the public leaderboard compare to the word lists. Therefore, much of these exploratory features have been removed. We retained what was ultimately submitted to the private leaderboard.

Not all the features that are defined here are important to the prediction. Again, we were lazy and did not prune.

In [106]:
def syllables_count(text): 
    return textstatistics().syllable_count(text) 

def difficult_word_count(text):
    return textstatistics().difficult_words(text)

def sentence_count(text):
    return textstatistics().sentence_count(text)

def avg_syllables_per_word(text): 
    nsyllables=syllables_count(text)
    nwords=word_count(text)
    ASPW=float(nsyllables)/float(nwords)
    return legacy_round(ASPW,2)

def avg_sentence_length(text): 
    nwords = word_count(text) 
    nsentences = sentence_count(text) 
    average_sentence_length = float(nwords / nsentences) 
    return legacy_round(average_sentence_length,2)
  
def flesch_ease_score(text):
    return textstatistics().flesch_reading_ease(text)
    
def flesch_grade_score(text):
    return textstatistics().flesch_kincaid_grade(text)

def linsear_write_score(text):
    return textstatistics().linsear_write_formula(text)

def dale_chall_score(text):
    return textstatistics().dale_chall_readability_score(text)

def gunning_fog_score(text):
    return textstatistics().gunning_fog(text)

def smog_score(text):
    return textstatistics().smog_index(text)

def automated_readability_score(text):
    return textstatistics().automated_readability_index(text)

def coleman_liau_score(text):
    return textstatistics().coleman_liau_index(text)

# This function is supposed to count grammatical errors. 
# def lang_checker(text):
#     tool = language_check.LanguageTool('en-US')
#     count=0
#     matches = tool.check(text)
#     for i in range(len(matches)-1):
#         if matches[i].ruleId == 'WHITESPACE_RULE':
#             pass
#         else:
#             count+=1
#     return count

def lang_checker(text):
    spell = SpellChecker()
    words = text.split()
    misspelled = spell.unknown(words)

    # Count misspelled words
    spell_errors = len(misspelled)

    return spell_errors

def tokenize(text):
    return TextBlob(text).words

In [107]:
# Compute some spelling-based features
for predictor_col in PREDICTOR_TEXT_COLUMN_NAMES_ALL:
    df_total[predictor_col + "_num_chars"] = df_total[predictor_col].apply(len)
    df_total[predictor_col + "_num_words"] = df_total[predictor_col].apply(word_count)
    df_total[predictor_col + "_num_misspelled"] = df_total[predictor_col].apply(compute_num_spelling_errors)
    df_total[predictor_col + "_flesch_grade"] = df_total[predictor_col].apply(flesch_grade_score) 
    df_total[predictor_col + "_percent_misspelled"] = df_total[[predictor_col + "_num_misspelled",
                              predictor_col + "_num_words"
    ]].apply(lambda x: divide(*x), axis=1)

# Compute readability features
df_total["readability_syllables_count"] = df_total['open_ended_6'].apply(syllables_count) 
df_total["readability_word_count"] = df_total['open_ended_6'].apply(word_count) 
df_total["readability_difficult_count"] = df_total['open_ended_6'].apply(difficult_word_count) 
df_total["readability_sentence_count"] = df_total['open_ended_6'].apply(sentence_count) 
df_total["readability_avg_syllables_per_word"] = df_total['open_ended_6'].apply(avg_syllables_per_word)
df_total["readability_avg_sentence_length"] = df_total['open_ended_6'].apply(avg_sentence_length) 
df_total["readability_flesch_ease_score"] = df_total['open_ended_6'].apply(flesch_ease_score) 
df_total["readability_flesch_grade_score"] = df_total['open_ended_6'].apply(flesch_grade_score) 
df_total["readability_linsear_write_score"] = df_total['open_ended_6'].apply(linsear_write_score) 
df_total["readability_dale_chall_score"] = df_total['open_ended_6'].apply(dale_chall_score) 
df_total["readability_smog_score"] = df_total['open_ended_6'].apply(smog_score) 
df_total["readability_coleman_liau_score"] = df_total['open_ended_6'].apply(coleman_liau_score) 

# Compute variable for the number of grammar errors based on Nick's function. 
df_total["number_grammar_errors"] = df_total['open_ended_6'].apply(lang_checker)
  
# Compute Average Word Length for each open ended comment.     
df_total['Avg_word_length_1']=df_total['open_ended_1_num_chars']/df_total['open_ended_1_num_words']
df_total['Avg_word_length_2']=df_total['open_ended_2_num_chars']/df_total['open_ended_2_num_words']
df_total['Avg_word_length_3']=df_total['open_ended_3_num_chars']/df_total['open_ended_3_num_words']
df_total['Avg_word_length_4']=df_total['open_ended_4_num_chars']/df_total['open_ended_4_num_words']
df_total['Avg_word_length_5']=df_total['open_ended_5_num_chars']/df_total['open_ended_5_num_words']
df_total['Avg_word_length_6']=df_total['open_ended_6_num_chars']/df_total['open_ended_6_num_words']

  df_total[predictor_col + "_num_chars"] = df_total[predictor_col].apply(len)
  df_total[predictor_col + "_flesch_grade"] = df_total[predictor_col].apply(flesch_grade_score)
  df_total[predictor_col + "_num_chars"] = df_total[predictor_col].apply(len)
  df_total[predictor_col + "_flesch_grade"] = df_total[predictor_col].apply(flesch_grade_score)
  df_total[predictor_col + "_num_chars"] = df_total[predictor_col].apply(len)
  df_total[predictor_col + "_flesch_grade"] = df_total[predictor_col].apply(flesch_grade_score)
  df_total[predictor_col + "_num_chars"] = df_total[predictor_col].apply(len)
  df_total[predictor_col + "_flesch_grade"] = df_total[predictor_col].apply(flesch_grade_score)
  df_total[predictor_col + "_num_chars"] = df_total[predictor_col].apply(len)
  df_total[predictor_col + "_flesch_grade"] = df_total[predictor_col].apply(flesch_grade_score)
  df_total[predictor_col + "_num_chars"] = df_total[predictor_col].apply(len)
  df_total[predictor_col + "_flesch_grade"] = df_tot

In [108]:
# Get list of the numeric columns to paste into the Z-Score variable list
FEATURES = df_total.select_dtypes(include=[np.number]).columns.tolist()
FEATURES

['A_Scale_score',
 'C_Scale_score',
 'E_Scale_score',
 'N_Scale_score',
 'O_Scale_score',
 'Respondent_ID',
 'open_ended_1_num_words',
 'open_ended_1_num_misspelled',
 'open_ended_1_percent_misspelled',
 'open_ended_2_num_words',
 'open_ended_2_num_misspelled',
 'open_ended_2_percent_misspelled',
 'open_ended_3_num_words',
 'open_ended_3_num_misspelled',
 'open_ended_3_percent_misspelled',
 'open_ended_4_num_words',
 'open_ended_4_num_misspelled',
 'open_ended_4_percent_misspelled',
 'open_ended_5_num_words',
 'open_ended_5_num_misspelled',
 'open_ended_5_percent_misspelled',
 'open_ended_6_num_words',
 'open_ended_6_num_misspelled',
 'open_ended_6_percent_misspelled',
 'O_high_5',
 'C_high_2',
 'A_high_1',
 'E_high_3',
 'A_low_2',
 'A_low_3',
 'A_low_4',
 'A_low_5',
 'N_low_1',
 'N_low_2',
 'N_low_3',
 'N_low_5',
 'C_high_1',
 'C_high_3',
 'C_high_4',
 'C_high_5',
 'A_high_2',
 'A_high_3',
 'A_high_4',
 'A_high_5',
 'N_high_1',
 'N_high_2',
 'N_high_3',
 'N_high_5',
 'O_high_1',
 'O_h

In [109]:
# Create Z Scores for all new feastures

cols = FEATURES
for col in cols:
    col_zscore = 'Z_'+ col
    df_total[col_zscore] = (df_total[col] - df_total[col].mean())/df_total[col].std(ddof=0)

  df_total[col_zscore] = (df_total[col] - df_total[col].mean())/df_total[col].std(ddof=0)
  df_total[col_zscore] = (df_total[col] - df_total[col].mean())/df_total[col].std(ddof=0)
  df_total[col_zscore] = (df_total[col] - df_total[col].mean())/df_total[col].std(ddof=0)
  df_total[col_zscore] = (df_total[col] - df_total[col].mean())/df_total[col].std(ddof=0)
  df_total[col_zscore] = (df_total[col] - df_total[col].mean())/df_total[col].std(ddof=0)
  df_total[col_zscore] = (df_total[col] - df_total[col].mean())/df_total[col].std(ddof=0)
  df_total[col_zscore] = (df_total[col] - df_total[col].mean())/df_total[col].std(ddof=0)
  df_total[col_zscore] = (df_total[col] - df_total[col].mean())/df_total[col].std(ddof=0)
  df_total[col_zscore] = (df_total[col] - df_total[col].mean())/df_total[col].std(ddof=0)
  df_total[col_zscore] = (df_total[col] - df_total[col].mean())/df_total[col].std(ddof=0)
  df_total[col_zscore] = (df_total[col] - df_total[col].mean())/df_total[col].std(ddof=0)
  df_total

In [110]:
Z_Var_List=[
 'Z_open_ended_1_num_words',
 'Z_open_ended_1_num_misspelled',
 'Z_open_ended_1_percent_misspelled',
 'Z_open_ended_2_num_words',
 'Z_open_ended_2_num_misspelled',
 'Z_open_ended_2_percent_misspelled',
 'Z_open_ended_3_num_words',
 'Z_open_ended_3_num_misspelled',
 'Z_open_ended_3_percent_misspelled',
 'Z_open_ended_4_num_words',
 'Z_open_ended_4_num_misspelled',
 'Z_open_ended_4_percent_misspelled',
 'Z_open_ended_5_num_words',
 'Z_open_ended_5_num_misspelled',
 'Z_open_ended_5_percent_misspelled',
 'Z_open_ended_6_num_words',
 'Z_open_ended_6_num_misspelled',
 'Z_open_ended_6_percent_misspelled',
 'Z_O_high_5',
 'Z_C_high_2',
 'Z_A_high_1',
 'Z_E_high_3',
 'Z_A_low_2',
 'Z_A_low_3',
 'Z_A_low_4',
 'Z_A_low_5',
 'Z_N_low_1',
 'Z_N_low_2',
 'Z_N_low_3',
 'Z_N_low_5',
 'Z_C_high_1',
 'Z_C_high_3',
 'Z_C_high_4',
 'Z_C_high_5',
 'Z_A_high_2',
 'Z_A_high_3',
 'Z_A_high_4',
 'Z_A_high_5',
 'Z_N_high_1',
 'Z_N_high_2',
 'Z_N_high_3',
 'Z_N_high_5',
 'Z_O_high_1',
 'Z_O_high_2',
 'Z_O_high_3',
 'Z_O_high_4',
 'Z_E_high_4',
 'Z_E_high_5',
 'Z_A_low_1',
 'Z_E_low_3',
 'Z_GO_3',
 'Z_NOGO_3',
 'Z_GO_5',
 'Z_NOGO_5',
 'Z_NOT_1',
 'Z_NOT_2',
 'Z_NOT_3',
 'Z_NOT_4',
 'Z_NOT_5',
 'Z_NO_5',
 'Z_A_low_comb',
 'Z_N_low_comb',
 'Z_C_high_comb',
 'Z_A_high_comb',
 'Z_N_high_comb',
 'Z_O_high_comb',
 'Z_E_high_3to5',
 'Z_A_not_comb',
 'Z_O_go_comb',
 'Z_open_ended_1_num_chars',
 'Z_open_ended_1_flesch_grade',
 'Z_open_ended_2_num_chars',
 'Z_open_ended_2_flesch_grade',
 'Z_open_ended_3_num_chars',
 'Z_open_ended_3_flesch_grade',
 'Z_open_ended_4_num_chars',
 'Z_open_ended_4_flesch_grade',
 'Z_open_ended_5_num_chars',
 'Z_open_ended_5_flesch_grade',
 'Z_open_ended_6_num_chars',
 'Z_open_ended_6_flesch_grade',
 'Z_readability_syllables_count',
 'Z_readability_word_count',
 'Z_readability_difficult_count',
 'Z_readability_sentence_count',
 'Z_readability_avg_syllables_per_word',
 'Z_readability_avg_sentence_length',
 'Z_readability_flesch_ease_score',
 'Z_readability_flesch_grade_score',
 'Z_readability_linsear_write_score',
 'Z_readability_dale_chall_score',
 'Z_readability_smog_score',
 'Z_readability_coleman_liau_score',
 'Z_number_grammar_errors',
 'Z_Avg_word_length_1',
 'Z_Avg_word_length_2',
 'Z_Avg_word_length_3',
 'Z_Avg_word_length_4',
 'Z_Avg_word_length_5',
 'Z_Avg_word_length_6']

In [111]:
Z_Var_List

['Z_open_ended_1_num_words',
 'Z_open_ended_1_num_misspelled',
 'Z_open_ended_1_percent_misspelled',
 'Z_open_ended_2_num_words',
 'Z_open_ended_2_num_misspelled',
 'Z_open_ended_2_percent_misspelled',
 'Z_open_ended_3_num_words',
 'Z_open_ended_3_num_misspelled',
 'Z_open_ended_3_percent_misspelled',
 'Z_open_ended_4_num_words',
 'Z_open_ended_4_num_misspelled',
 'Z_open_ended_4_percent_misspelled',
 'Z_open_ended_5_num_words',
 'Z_open_ended_5_num_misspelled',
 'Z_open_ended_5_percent_misspelled',
 'Z_open_ended_6_num_words',
 'Z_open_ended_6_num_misspelled',
 'Z_open_ended_6_percent_misspelled',
 'Z_O_high_5',
 'Z_C_high_2',
 'Z_A_high_1',
 'Z_E_high_3',
 'Z_A_low_2',
 'Z_A_low_3',
 'Z_A_low_4',
 'Z_A_low_5',
 'Z_N_low_1',
 'Z_N_low_2',
 'Z_N_low_3',
 'Z_N_low_5',
 'Z_C_high_1',
 'Z_C_high_3',
 'Z_C_high_4',
 'Z_C_high_5',
 'Z_A_high_2',
 'Z_A_high_3',
 'Z_A_high_4',
 'Z_A_high_5',
 'Z_N_high_1',
 'Z_N_high_2',
 'Z_N_high_3',
 'Z_N_high_5',
 'Z_O_high_1',
 'Z_O_high_2',
 'Z_O_high_3

In [112]:
# Create subset dataframes. 

df_train=df_total.loc[df_total['Source']=='Train'] 
df_test=df_total.loc[df_total['Source']=='Test'] 
df_final=df_total.loc[df_total['Source']=='Final']

In [113]:
df_total['Source'].value_counts()

Source
Train    1088
Test      300
Final     300
Name: count, dtype: int64

In [114]:
X = df_train[Z_Var_List]
Y = df_train['O_Scale_score']
X_train, X_test, y_train, y_test = train_test_split(X, Y, train_size=0.60, test_size=0.40)
O_pipeline_optimizer = TPOTRegressor(generations=20, population_size=20, cv=5,random_state=42, verbosity=2, n_jobs = -1)
O_pipeline_optimizer.fit(X_train,y_train)

                                                                              
Generation 1 - Current best internal CV score: -0.45993245043907355
                                                                             
Generation 2 - Current best internal CV score: -0.45993245043907355
                                                                             
Generation 3 - Current best internal CV score: -0.459039883890072
                                                                             
Generation 4 - Current best internal CV score: -0.45555577086707466
                                                                              
Generation 5 - Current best internal CV score: -0.45378904790778307
                                                                              
Generation 6 - Current best internal CV score: -0.45378904790778307
                                                                                
Generation 7 - Current best internal CV sc

In [115]:
X = df_train[Z_Var_List]
Y = df_train['C_Scale_score']
X_train, X_test, y_train, y_test = train_test_split(X, Y, train_size=0.60, test_size=0.40)
C_pipeline_optimizer = TPOTRegressor(generations=20, population_size=20, cv=5,random_state=42, verbosity=2, n_jobs = -1)
C_pipeline_optimizer.fit(X_train,y_train)

                                                                             
Generation 1 - Current best internal CV score: -0.2820844364049678
                                                                             
Generation 2 - Current best internal CV score: -0.2820348344037934
                                                                             
Generation 3 - Current best internal CV score: -0.2820348344037934
                                                                             
Generation 4 - Current best internal CV score: -0.2820348344037934
                                                                              
Generation 5 - Current best internal CV score: -0.2820348344037934
                                                                              
Generation 6 - Current best internal CV score: -0.28104864293570725
                                                                              
Generation 7 - Current best internal CV score: -

In [116]:
X = df_train[Z_Var_List]
Y = df_train['E_Scale_score']
X_train, X_test, y_train, y_test = train_test_split(X, Y, train_size=0.60, test_size=0.40)
E_pipeline_optimizer = TPOTRegressor(generations=20, population_size=20, cv=5,random_state=42, verbosity=2, n_jobs = -1)
E_pipeline_optimizer.fit(X_train,y_train)

                                                                             
Generation 1 - Current best internal CV score: -0.542754539535164
                                                                             
Generation 2 - Current best internal CV score: -0.5411311434691033
                                                                             
Generation 3 - Current best internal CV score: -0.5356820304161364
                                                                             
Generation 4 - Current best internal CV score: -0.5316581617030952
                                                                              
Generation 5 - Current best internal CV score: -0.5316200616199164
                                                                              
Generation 6 - Current best internal CV score: -0.5316200616199164
                                                                              
Generation 7 - Current best internal CV score: -0.

In [117]:
X = df_train[Z_Var_List]
Y = df_train['A_Scale_score']
X_train, X_test, y_train, y_test = train_test_split(X, Y, train_size=0.60, test_size=0.40)
A_pipeline_optimizer = TPOTRegressor(generations=20, population_size=20, cv=5,random_state=42, verbosity=2)
A_pipeline_optimizer.fit(X_train,y_train)

                                                                               
Generation 1 - Current best internal CV score: -0.29736157592028345
                                                                             
Generation 2 - Current best internal CV score: -0.29736157592028345
                                                                             
Generation 3 - Current best internal CV score: -0.29732741318582306
                                                                              
Generation 4 - Current best internal CV score: -0.29732741318582306
                                                                              
Generation 5 - Current best internal CV score: -0.29732741318582306
                                                                              
Generation 6 - Current best internal CV score: -0.29637548798815383
                                                                              
Generation 7 - Current best internal CV 

In [118]:
# Auto ML with feature set predicting Agreeableness
X = df_train[Z_Var_List]
Y = df_train['N_Scale_score']
X_train, X_test, y_train, y_test = train_test_split(X, Y, train_size=0.60, test_size=0.40)
N_pipeline_optimizer = TPOTRegressor(generations=20, population_size=20, cv=5,random_state=42, verbosity=2, n_jobs = -1)
N_pipeline_optimizer.fit(X_train,y_train)

                                                                              
Generation 1 - Current best internal CV score: -0.48104831075238186
                                                                             
Generation 2 - Current best internal CV score: -0.48104831075238186
                                                                             
Generation 3 - Current best internal CV score: -0.47922569308107477
                                                                             
Generation 4 - Current best internal CV score: -0.47861565312888354
                                                                              
Generation 5 - Current best internal CV score: -0.47861565312888354
                                                                              
Generation 6 - Current best internal CV score: -0.47861565312888354
                                                                              
Generation 7 - Current best internal CV sc

In [119]:
Z_Var_List

['Z_open_ended_1_num_words',
 'Z_open_ended_1_num_misspelled',
 'Z_open_ended_1_percent_misspelled',
 'Z_open_ended_2_num_words',
 'Z_open_ended_2_num_misspelled',
 'Z_open_ended_2_percent_misspelled',
 'Z_open_ended_3_num_words',
 'Z_open_ended_3_num_misspelled',
 'Z_open_ended_3_percent_misspelled',
 'Z_open_ended_4_num_words',
 'Z_open_ended_4_num_misspelled',
 'Z_open_ended_4_percent_misspelled',
 'Z_open_ended_5_num_words',
 'Z_open_ended_5_num_misspelled',
 'Z_open_ended_5_percent_misspelled',
 'Z_open_ended_6_num_words',
 'Z_open_ended_6_num_misspelled',
 'Z_open_ended_6_percent_misspelled',
 'Z_O_high_5',
 'Z_C_high_2',
 'Z_A_high_1',
 'Z_E_high_3',
 'Z_A_low_2',
 'Z_A_low_3',
 'Z_A_low_4',
 'Z_A_low_5',
 'Z_N_low_1',
 'Z_N_low_2',
 'Z_N_low_3',
 'Z_N_low_5',
 'Z_C_high_1',
 'Z_C_high_3',
 'Z_C_high_4',
 'Z_C_high_5',
 'Z_A_high_2',
 'Z_A_high_3',
 'Z_A_high_4',
 'Z_A_high_5',
 'Z_N_high_1',
 'Z_N_high_2',
 'Z_N_high_3',
 'Z_N_high_5',
 'Z_O_high_1',
 'Z_O_high_2',
 'Z_O_high_3

In [120]:
df_test

Unnamed: 0,A_Scale_score,C_Scale_score,Dataset,E_Scale_score,N_Scale_score,O_Scale_score,Respondent_ID,Source,open_ended_1,open_ended_2,...,Z_readability_dale_chall_score,Z_readability_smog_score,Z_readability_coleman_liau_score,Z_number_grammar_errors,Z_Avg_word_length_1,Z_Avg_word_length_2,Z_Avg_word_length_3,Z_Avg_word_length_4,Z_Avg_word_length_5,Z_Avg_word_length_6
1088,4.666667,4.583333,Dev,3.833333,1.583333,4.416667,10460010474,Test,I would look into changing my vacation plans t...,I would work on the project little by little d...,...,-0.601579,-0.248236,-0.050278,-0.465280,0.547747,-0.220849,-0.374477,-0.216930,1.372646,0.082688
1089,4.583333,5.000000,Dev,4.083333,3.166667,4.750000,10440103178,Test,"I have always been a team player, but this wou...",I would first address my concerns with my boss...,...,-0.044436,-0.416939,-0.122402,1.870369,-0.254109,1.056723,-0.501801,0.907520,-0.446863,0.316143
1090,4.166667,4.000000,Dev,2.500000,1.750000,2.416667,10440099430,Test,I would try to come to a compromise with my co...,I would go to my boss and ask him if he has an...,...,0.347627,0.595275,-0.253536,-1.079924,-0.313317,-1.927906,-1.610265,-0.069316,0.409290,-1.161596
1091,4.166667,4.750000,Dev,4.500000,1.916667,4.500000,10460189074,Test,I would explain to the supervisor why I need t...,I would try to finish the project as soon as I...,...,-0.711632,-0.653122,-1.040341,-0.956996,-0.208275,-0.359690,-1.356688,-0.375642,-0.573661,-0.808626
1092,3.666667,4.000000,Dev,3.833333,2.333333,3.833333,10459700329,Test,I would tell them I will work this time if nex...,I would let him know I have a big project and ...,...,1.943394,-3.554802,-0.174856,-2.678000,-2.676775,-2.248792,0.973655,-1.616769,-0.373999,-1.822368
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1383,2.500000,4.416667,Dev,2.666667,1.333333,4.083333,10460095413,Test,I would avoid doing anything. It's likely I pu...,I would plan out how long it would take for me...,...,-0.188880,-0.518160,-0.915764,0.641080,-0.234286,-0.320575,-0.979713,-0.600724,-0.468943,-0.916312
1384,4.833333,4.916667,Dev,4.416667,1.083333,4.583333,10459955025,Test,If my colleague was not willing to change thei...,I would complete the project the earliest poss...,...,-0.360838,-0.450679,-0.469908,0.026436,0.365695,0.164579,-1.362333,0.655415,0.197926,0.109735
1385,4.750000,5.000000,Dev,3.833333,1.166667,4.000000,10460413642,Test,I would start by finding out if they have plan...,I would complete my project with plenty of tim...,...,-0.209515,-0.147015,0.107083,0.886938,1.434860,-0.049981,0.398559,-0.279103,-0.896192,0.179590
1386,4.333333,4.000000,Dev,3.750000,2.166667,3.750000,10460105559,Test,I would ask my supervisor that since I am not ...,I would get this project done and over with. ...,...,0.189426,0.089168,-0.181413,-0.219422,-0.493101,-0.435618,0.474347,-0.950852,0.271063,-0.476115


In [121]:
test = O_pipeline_optimizer.predict(df_test[Z_Var_List])
test

array([4.13700114, 3.98529496, 3.76850874, 3.81241361, 3.95560496,
       4.10650083, 3.64369744, 4.12097779, 3.69105403, 3.91327018,
       3.82800388, 3.86913752, 3.72453714, 4.13069866, 3.99793406,
       3.93480533, 3.82143547, 3.83433478, 3.88208745, 4.13401051,
       4.07582464, 3.65871354, 3.96590469, 4.00245045, 3.7130331 ,
       4.07624827, 3.88403707, 3.7134188 , 3.92119074, 3.65718481,
       3.80179018, 3.76754705, 3.85769076, 3.89545335, 3.82257067,
       3.83625501, 3.99208356, 4.17172616, 4.19650189, 4.08437071,
       3.91580636, 3.9714718 , 3.8360542 , 3.87428101, 3.85792152,
       3.86125305, 3.82124171, 4.04952698, 3.63575167, 3.91672943,
       4.04653174, 3.80335184, 3.8273403 , 3.80841406, 3.63404248,
       3.83324893, 3.95141601, 4.14061789, 4.01209586, 4.0769265 ,
       3.82025955, 3.87364002, 4.01114598, 3.60664977, 4.05031941,
       3.91050771, 3.77805306, 3.93974303, 4.00936678, 4.00870947,
       3.94334511, 3.92773253, 4.06169633, 3.86935908, 3.71830

In [122]:
# Save predicted values

df_test['O_Scale_pred'] = O_pipeline_optimizer.predict(df_test[Z_Var_List])
df_test['C_Scale_pred'] = C_pipeline_optimizer.predict(df_test[Z_Var_List])
df_test['E_Scale_pred'] = E_pipeline_optimizer.predict(df_test[Z_Var_List])
df_test['A_Scale_pred'] = A_pipeline_optimizer.predict(df_test[Z_Var_List])
df_test['N_Scale_pred'] = N_pipeline_optimizer.predict(df_test[Z_Var_List])

df_final['O_Scale_pred'] = O_pipeline_optimizer.predict(df_final[Z_Var_List])
df_final['C_Scale_pred'] = C_pipeline_optimizer.predict(df_final[Z_Var_List])
df_final['E_Scale_pred'] = E_pipeline_optimizer.predict(df_final[Z_Var_List])
df_final['A_Scale_pred'] = A_pipeline_optimizer.predict(df_final[Z_Var_List])
df_final['N_Scale_pred'] = N_pipeline_optimizer.predict(df_final[Z_Var_List])

df_test.to_csv("ML Test.csv", index = False)
df_final.to_csv("ML Final.csv", index = False)

  df_test['O_Scale_pred'] = O_pipeline_optimizer.predict(df_test[Z_Var_List])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test['O_Scale_pred'] = O_pipeline_optimizer.predict(df_test[Z_Var_List])
  df_test['C_Scale_pred'] = C_pipeline_optimizer.predict(df_test[Z_Var_List])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test['C_Scale_pred'] = C_pipeline_optimizer.predict(df_test[Z_Var_List])
  df_test['E_Scale_pred'] = E_pipeline_optimizer.predict(df_test[Z_Var_List])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[ro

## Deep Learning Section

We wrote this section to run independently from the previous sections (i.e., there are no independencies). In this way, you can see how we used deep learning without getting confused with all the other junk.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt

# import tensorflow as tf
# import tensorflow_hub as hub

# from keras.layers import Dense, Dropout, Embedding, Flatten, Input, MaxPooling1D
# from keras.optimizers import Adam, SGD
# from keras.models import Sequential
# from keras.preprocessing.sequence import pad_sequences
# # from keras.wrappers.scikit_learn import KerasRegressor

# from keras import backend as K 
# import keras.layers as layers
# from keras.models import Model, load_model
# from keras.engine import Layer

from sklearn.model_selection import cross_val_score, train_test_split, KFold
from sklearn.preprocessing import StandardScaler
from scipy.stats import pearsonr

# Initialize session
# config = tf.ConfigProto()
# config.gpu_options.allow_growth = True
# sess = tf.Session(config=config)
# K.set_session(sess)

import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras.layers import Input, Dropout, Dense, concatenate
from tensorflow.keras.models import Model
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import mean_squared_error
from scipy.stats import pearsonr
import numpy as np


from scikeras.wrappers import KerasRegressor


2024-11-08 20:12:03.094276: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:479] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-08 20:12:03.116885: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:10575] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-08 20:12:03.116916: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1442] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-08 20:12:03.130853: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
from tensorflow.keras.layers import Layer
import tensorflow as tf
from tensorflow.keras import layers, Model
from tensorflow.keras.layers import Dropout
from tensorflow_hub import KerasLayer

In [3]:
import logging
logging.getLogger("tensorflow").setLevel(logging.WARNING)

In [4]:
# Make sure directory hierarchy aligns
train_raw_df = pd.read_csv("/home/ubuntu/git/NLP/dissertation/data/train_rep.csv")
df_test = pd.read_csv("/home/ubuntu/git/NLP/dissertation/data/test_rep.csv")
df_dev = pd.read_csv("/home/ubuntu/git/NLP/dissertation/data/valid_rep.csv")

In [5]:
import torch
import torch.nn as nn
import torch.optim as optim
from allennlp.modules.elmo import Elmo, batch_to_ids
from tqdm import tqdm


  from .autonotebook import tqdm as notebook_tqdm
  warn(f"Failed to load image Python extension: {e}")
  return torch.cuda.amp.custom_fwd(orig_func)  # type: ignore
  return torch.cuda.amp.custom_bwd(orig_func)  # type: ignore


In [6]:
# Configuration for ELMo
options_file = "https://allennlp.s3.amazonaws.com/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json"
weight_file = "https://allennlp.s3.amazonaws.com/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5"


In [7]:
# Create the ELMo model
elmo = Elmo(options_file, weight_file, num_output_representations=1, dropout=0).cuda()

In [79]:
dir(elmo)

['T_destination',
 '__annotations__',
 '__call__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_apply',
 '_backward_hooks',
 '_backward_pre_hooks',
 '_buffers',
 '_call_impl',
 '_compiled_call_impl',
 '_dropout',
 '_elmo_lstm',
 '_forward_hooks',
 '_forward_hooks_always_called',
 '_forward_hooks_with_kwargs',
 '_forward_pre_hooks',
 '_forward_pre_hooks_with_kwargs',
 '_get_backward_hooks',
 '_get_backward_pre_hooks',
 '_get_name',
 '_has_cached_vocab',
 '_is_full_backward_hook',
 '_keep_sentence_boundaries',
 '_load_from_state_dict',
 '_load_state_dict_post_hooks',
 '_load_state_dict_pre_hooks',
 '_maybe_warn_non_f

In [80]:
elmo

Elmo(
  (_elmo_lstm): _ElmoBiLm(
    (_token_embedder): _ElmoCharacterEncoder(
      (char_conv_0): Conv1d(16, 32, kernel_size=(1,), stride=(1,))
      (char_conv_1): Conv1d(16, 32, kernel_size=(2,), stride=(1,))
      (char_conv_2): Conv1d(16, 64, kernel_size=(3,), stride=(1,))
      (char_conv_3): Conv1d(16, 128, kernel_size=(4,), stride=(1,))
      (char_conv_4): Conv1d(16, 256, kernel_size=(5,), stride=(1,))
      (char_conv_5): Conv1d(16, 512, kernel_size=(6,), stride=(1,))
      (char_conv_6): Conv1d(16, 1024, kernel_size=(7,), stride=(1,))
      (_highways): Highway(
        (_layers): ModuleList(
          (0-1): 2 x Linear(in_features=2048, out_features=4096, bias=True)
        )
      )
      (_projection): Linear(in_features=2048, out_features=512, bias=True)
    )
    (_elmo_lstm): ElmoLstm(
      (forward_layer_0): LstmCellWithProjection(
        (input_linearity): Linear(in_features=512, out_features=16384, bias=False)
        (state_linearity): Linear(in_features=512, ou

In [8]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers, Model
from tensorflow.keras.layers import Dropout
import tensorflow_hub as hub
from sklearn.model_selection import train_test_split
from scipy.stats import pearsonr
from scikeras.wrappers import KerasRegressor



In [109]:
# Assuming you have train_raw_df, df_test, and df_dev DataFrames

ATTRIBUTE_LIST = ["E", "A", "O", "C", "N"]

X = train_raw_df[['open_ended_' + str(idx) for idx in range(1, 6)]]
Y = np.array(train_raw_df[[att + "_Scale_score" for att in ATTRIBUTE_LIST]].values)

X_train, X_test, Y_train, Y_test = train_test_split(
    X,
    Y,
    test_size=0.2,
    random_state=23
)

# X_train = [X_train['open_ended_' + str(idx)].values for idx in range(1, 6)]
# X_test = [X_test['open_ended_' + str(idx)].values for idx in range(1, 6)]


# X_dev = [df_test['open_ended_' + str(idx)].values for idx in range(1, 6)]
# X_dev_ = [df_dev['open_ended_' + str(idx)].values for idx in range(1, 6)]


# X_train = np.array(X_train).T
# X_test = np.array(X_test).T

X_train = [X_train['open_ended_' + str(idx)] for idx in range(1, 6)]
X_train_dev = [X_test['open_ended_' + str(idx)] for idx in range(1, 6)]
X_test = [df_test['open_ended_' + str(idx)] for idx in range(1, 6)]
X_dev = [df_dev['open_ended_' + str(idx)] for idx in range(1, 6)]

In [110]:
y_test = np.array(df_test[[att + "_Scale_score" for att in ATTRIBUTE_LIST]].values)

In [111]:
# Convert sentences (nested lists) to character IDs
X_train_ids = [batch_to_ids([sentence.split() for sentence in X_train[idx]]).cuda() for idx in range(5)]
X_train_dev_ids = [batch_to_ids([sentence.split() for sentence in X_train_dev[idx]]).cuda() for idx in range(5)]
X_dev_ids = [batch_to_ids([sentence.split() for sentence in X_dev[idx]]).cuda() for idx in range(5)]
X_test_ids = [batch_to_ids([sentence.split() for sentence in X_test[idx]]).cuda() for idx in range(5)]


In [None]:
X_train_ids

In [113]:
class ElmoConcatRegressionModel(nn.Module):
    def __init__(self, elmo_model, input_dim, dense_dropout_rate=0.5, include_hidden_layer=False, hidden_layer_size=64):
        super(ElmoConcatRegressionModel, self).__init__()
        self.elmo = elmo_model
        self.dropout = nn.Dropout(dense_dropout_rate)
        self.include_hidden_layer = include_hidden_layer
        
        # Define layers
        if include_hidden_layer:
            self.hidden_layer = nn.Linear(5 * input_dim, hidden_layer_size)
            self.output_layer = nn.Linear(hidden_layer_size, 1)
        else:
            self.output_layer = nn.Linear(5 * input_dim, 1)

    def forward(self, *inputs):
        embeddings = []
        
        # Pass each input through ELMo and compute the mean embedding
        for input_tensor in inputs:
            elmo_output = self.elmo(input_tensor)['elmo_representations'][0]
            mean_embedding = torch.mean(elmo_output, dim=1)  # Mean pooling
            embeddings.append(mean_embedding)
        
        # Concatenate embeddings from all inputs
        concat_embeddings = torch.cat(embeddings, dim=1)
        
        # Apply dropout and hidden layer if specified
        x = self.dropout(concat_embeddings)
        if self.include_hidden_layer:
            x = F.relu(self.hidden_layer(x))
            x = self.dropout(x)
        
        # Final output
        output = self.output_layer(x)
        return output


In [108]:
X_train_ids

[tensor([[[259,  74, 260,  ..., 261, 261, 261],
          [259, 120, 112,  ..., 261, 261, 261],
          [259, 117, 102,  ..., 261, 261, 261],
          ...,
          [  0,   0,   0,  ...,   0,   0,   0],
          [  0,   0,   0,  ...,   0,   0,   0],
          [  0,   0,   0,  ...,   0,   0,   0]],
 
         [[259,  74, 260,  ..., 261, 261, 261],
          [259, 120, 112,  ..., 261, 261, 261],
          [259, 100, 105,  ..., 261, 261, 261],
          ...,
          [  0,   0,   0,  ...,   0,   0,   0],
          [  0,   0,   0,  ...,   0,   0,   0],
          [  0,   0,   0,  ...,   0,   0,   0]],
 
         [[259,  74, 260,  ..., 261, 261, 261],
          [259, 103, 106,  ..., 261, 261, 261],
          [259,  98, 116,  ..., 261, 261, 261],
          ...,
          [  0,   0,   0,  ...,   0,   0,   0],
          [  0,   0,   0,  ...,   0,   0,   0],
          [  0,   0,   0,  ...,   0,   0,   0]],
 
         ...,
 
         [[259,  74, 260,  ..., 261, 261, 261],
          [259, 12

In [98]:
def process_batches(data_loader, data_tensor):
    all_embeddings = []
    for batch_indices in data_loader:
        batch_data = data_tensor[batch_indices]  # Index tensors using Python lists for proper slicing
        embeddings = elmo(batch_data)['elmo_representations'][0]  # Get ELMo embeddings
        mean_embeddings = torch.mean(embeddings, dim=1).cuda()  # Mean across sequence length dimension
        all_embeddings.append(mean_embeddings)
    return torch.cat(all_embeddings, dim=0)

In [116]:
import torch.nn.functional as F


In [117]:
# Training loop setup
num_epochs = 1
batch_size = 32

for idx, att in enumerate(ATTRIBUTE_LIST):
    print(f"Training for attribute {att}")

    # Initialize the model
    model = ElmoConcatRegressionModel(elmo, input_dim=1024, include_hidden_layer=True, hidden_layer_size=64).cuda()
    criterion = nn.MSELoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    # Create DataLoaders
    train_loader = DataLoader(TensorDataset(torch.arange(X_train_ids[0].size(0))), batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(TensorDataset(torch.arange(X_dev_ids[0].size(0))), batch_size=batch_size, shuffle=False)
    test_loader = DataLoader(TensorDataset(torch.arange(X_test_ids[0].size(0))), batch_size=batch_size, shuffle=False)

    # Training loop
    for epoch in range(num_epochs):
        model.train()
        running_loss = 0.0
        for batch_indices in tqdm(train_loader):
            optimizer.zero_grad()
            # Collect batch data for all inputs at once
            batch_data = [X_train_ids[i][batch_indices].cuda() for i in range(5)]

            # Forward pass
            outputs = model(*batch_data).squeeze()

            # Get the corresponding target values for this batch
            batch_targets = targets_train[batch_indices].squeeze()

            # Compute loss
            loss = criterion(outputs, batch_targets)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()

        if (epoch + 1) % 2 == 0:
            print(f'Epoch [{epoch + 1}/{num_epochs}], Loss: {running_loss/len(train_loader):.4f}')

    # Evaluate in batches
    model.eval()
    preds_test = []
    preds_val = []

    with torch.no_grad():
        print('making test predictions')
        for batch_indices in tqdm(test_loader):
            batch_data = [X_test_ids[i][batch_indices].cuda() for i in range(5)]
            preds_batch = model(*batch_data).squeeze()
            preds_test.append(preds_batch.cpu().numpy())

        print('making val predictions')
        for batch_indices in tqdm(val_loader):
            batch_data = [X_dev_ids[i][batch_indices].cuda() for i in range(5)]
            preds_batch = model(*batch_data).squeeze()
            preds_val.append(preds_batch.cpu().numpy())

    # Combine predictions for final evaluation
    preds_test = np.concatenate(preds_test)
    preds_val = np.concatenate(preds_val)

    # Save predictions
    df_test[f'{att}_Pred'] = preds_test
    df_dev[f'{att}_Pred'] = preds_val

    pearson_r_test = pearsonr(y_test[:, idx], preds_test)[0]
    test_scores.append(pearson_r_test)

    print(f"{att} - Test r: {pearson_r_test}")
    print("")

print(f"Average Test r: {np.mean(test_scores)}")
print(f"Average Train r: {np.mean(train_scores)}")


Training for attribute E


100%|██████████| 109/109 [01:20<00:00,  1.36it/s]


making test predictions


100%|██████████| 38/38 [00:29<00:00,  1.29it/s]


making val predictions


100%|██████████| 38/38 [00:27<00:00,  1.38it/s]


E - Test r: 0.015322170992865492

Training for attribute A


100%|██████████| 109/109 [01:19<00:00,  1.36it/s]


making test predictions


100%|██████████| 38/38 [00:29<00:00,  1.30it/s]


making val predictions


100%|██████████| 38/38 [00:27<00:00,  1.38it/s]


A - Test r: -0.04467315286122029

Training for attribute O


100%|██████████| 109/109 [01:19<00:00,  1.37it/s]


making test predictions


100%|██████████| 38/38 [00:29<00:00,  1.29it/s]


making val predictions


100%|██████████| 38/38 [00:27<00:00,  1.36it/s]


O - Test r: 0.0067910596279656685

Training for attribute C


 47%|████▋     | 51/109 [00:37<00:36,  1.58it/s]

In [104]:
Y_test.shape

(218, 5)

In [101]:
preds_test.shape

(300,)

In [82]:
df_test

Unnamed: 0,Respondent_ID,open_ended_1,open_ended_2,open_ended_3,open_ended_4,open_ended_5,E_Scale_score,A_Scale_score,O_Scale_score,C_Scale_score,N_Scale_score,Dataset
0,10440136230,I would re-schedule my vacation time because I...,I would start on the project as soon as possib...,I would go by myself or convince my colleague ...,I do not feel good about the situation. I woul...,I would find this experience enjoyable. I like...,3.000000,4.333333,4.166667,4.166667,2.333333,Test
1,10459740644,I would likely complain privately to someone o...,I would start working on my project during all...,I want to make sure that my personal life is n...,I would be very upset particularly if the cons...,I would be very interested in learning about N...,2.916667,3.500000,3.750000,4.250000,1.666667,Test
2,10446110785,I would most likely be willing to change. I am...,I would start immediately. I am not a procrast...,I would take the time and go. I would stay for...,"I would feel scared, anxious, frustrated and a...",I would find this experience enjoyable. I love...,3.333333,4.166667,3.583333,5.000000,1.500000,Test
3,10446118106,I would give my partner the vacation week. I w...,I will tell my boss in a gentle yet firm way t...,I would understand my client and not force the...,I would ask my boss to discuss this important ...,I would find this enjoyable. Norway is one of ...,2.833333,4.000000,4.000000,4.500000,1.666667,Test
4,10460409624,It would depend on the plans made for my trip....,I would attempt to finish the project before t...,"If I am looking to advance in my career, then ...",I would not be happy. I would listen to my man...,I would not volunteer to get involved on the p...,2.916667,4.583333,3.500000,4.583333,1.916667,Test
...,...,...,...,...,...,...,...,...,...,...,...,...
295,10446128741,I would first try to talk and negotiate with m...,I would work on that project by breaking it up...,I would go to the meeting. That point of netwo...,I would feel upset which would motivate me to ...,I would find it enjoyable. Any chance to learn...,4.166667,4.583333,3.250000,5.000000,1.333333,Test
296,10459975899,"I would probably be ok with this. Generally, I...",I would have no problem with project completio...,I would see if I could 'lean' him in the other...,"I would push him for details. Who, what,... Th...",I would enjoy the experience. I've lived and w...,4.250000,3.750000,4.166667,4.916667,1.750000,Test
297,10459749226,I would first have a conversation directly wit...,I would chart out my plan to be sure I had eno...,I would probably not want to go either. Howev...,I would take some time to think it over. When...,I think it would be interesting. I am Norwegi...,2.416667,4.166667,3.416667,4.833333,2.333333,Test
298,10459961050,I would look for an alternative solution. Wor...,I would get the project done with the time tha...,I would go to the party. Even if I do not kno...,I would ask for additional information. I wou...,I would find it enjoyable. Learning new thing...,3.750000,3.500000,4.333333,4.333333,3.083333,Test


In [None]:
df_test.to_csv(
    "preds_test_01.csv",
    columns=["Respondent_ID", *[sym + "_Pred" for sym in ATTRIBUTE_LIST]],
    index=False
)

df_dev.to_csv(
    "preds_dev_01.csv",
    columns=["Respondent_ID", *[sym + "_Pred" for sym in ATTRIBUTE_LIST]],
    index=False
)

## Winning submission

Each of the three sets of predicted values generated from the above code were submitted to the private leader board. With the exception of Openness, the best predictors from those were then averaged together to form a fourth submission. Our openness predictor was poor, so we continued to tinker with it on the fourth submission. In all cases the averaged values had stronger correlations than the independent values.

The final submission was as follows:
- Openness: Word List
- Concientiousness: averaged the z-transformed predicted values from World List and Deep Learning
- Agreeableness: averaged the z-transformed predicted values from Machine Learning and Deep Learning
- Extraversion: averaged the z-transformed predicted values from Word List and Deep Learning
- Neuroticism: averaged the z-transformed predicted values from Word List and Deep Learning