<h1>Predicting Author Personality Traits from Social Media Posts</h1>
<h3>Maxwell Fredenburgh</h3>

<h2>Part 1: Introduction</h2>

<p>In this Notebook, I will walk through the process of using Machine Learning and Natural Language Processing in order to predict 
the Big-5 personality traits of a user based on their Social Media posts and other Social Media data.</p>

<p>The Big-5 perosnality model is the most popular measure of one's personality used by psychologists. This model describes an individual's personality through 5 key traits: Conscientiousness, Agreeableness, Neuroticism, Openness, and Extraversion</p>

<p><strong>Conscientiousness</strong> (messy vs. organized): characterized by one’s organization and setting of goals.</p>
<p><strong>Agreeableness</strong> (uncooperative vs. cooperative): characterized by one’s cooperativeness with others.</p>
<p><strong>Neuroticism</strong> (emotional instability vs. emotional stability): characterized by one’s emotional stability.</p>
<p><strong>Openness</strong> (unimaginative vs. insightful): characterized by one’sinsightfulness, imagination, and ability to consider abstract ideas.</p>
<p><strong>Extraversion</strong> (shy vs. sociable): characterized by one’s sociability with others.</p>


<h2>Part 2: The Data</h2>

<h3>A. The default dataset</h3>
<p>A sample of the myPersonality dataset collected by Michal Kosinski and David Stillwell from the University of Cambridge.</p> 
<p>Celli, F., Pianesi, F., Stillwell, D., & Kosinski, M. (2013). Workshop on
Computational Personality Recognition: Shared Task. AAAI Workshop - Technical
Report . Association for the Advancement of Artificial Intelligence.</p>
<a href="https://www.researchgate.net/profile/Fabio_Celli/publication/258045642_Workshop_on_Computational_Personality_Recognition_Shared_Task/links/00b49526b90298373b000000/Workshop-on-Computational-Personality-Recognition-Shared-Task.pdf">https://www.researchgate.net/profile/Fabio_Celli/publication/258045642_Workshop_on_Computational_Personality_Recognition_Shared_Task/links/00b49526b90298373b000000/Workshop-on-Computational-Personality-Recognition-Shared-Task.pdf</a>

In [1]:
import pandas as pd
pd.options.mode.chained_assignment = None
df = pd.read_csv("data/mypersonality_final.csv", encoding="ISO-8859-1")
df.head(5)

Unnamed: 0,#AUTHID,STATUS,sEXT,sNEU,sAGR,sCON,sOPN,cEXT,cNEU,cAGR,cCON,cOPN,DATE,NETWORKSIZE,BETWEENNESS,NBETWEENNESS,DENSITY,BROKERAGE,NBROKERAGE,TRANSITIVITY
0,b7b7764cfa1c523e4e93ab2a79a946c4,likes the sound of thunder.,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,06/19/09 03:21 PM,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1
1,b7b7764cfa1c523e4e93ab2a79a946c4,is so sleepy it's not even funny that's she ca...,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,07/02/09 08:41 AM,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1
2,b7b7764cfa1c523e4e93ab2a79a946c4,is sore and wants the knot of muscles at the b...,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,06/15/09 01:15 PM,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1
3,b7b7764cfa1c523e4e93ab2a79a946c4,likes how the day sounds in this new song.,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,06/22/09 04:48 AM,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1
4,b7b7764cfa1c523e4e93ab2a79a946c4,is home. <3,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,07/20/09 02:31 AM,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1


In [2]:
print('Number of rows: '+str(df.shape[0]))
print('Number of columns: '+str(df.shape[1]))

Number of rows: 9917
Number of columns: 20


<h3>B. Data cleaning</h3>
<p>Changing y/n values to binary 1/0 and remove rows with N/A values</p>

In [3]:
myp = df[['STATUS','NETWORKSIZE','BETWEENNESS','NBETWEENNESS','DENSITY','BROKERAGE','NBROKERAGE','TRANSITIVITY','cEXT','cNEU','cAGR','cCON','cOPN']]
myp['cEXT'] = myp['cEXT'].map({'y': 1, 'n': 0})
myp['cNEU'] = myp['cNEU'].map({'y': 1, 'n': 0})
myp['cAGR'] = myp['cAGR'].map({'y': 1, 'n': 0})
myp['cCON'] = myp['cCON'].map({'y': 1, 'n': 0})
myp['cOPN'] = myp['cOPN'].map({'y': 1, 'n': 0})

#drop rows with NaN values
included_features = ['NETWORKSIZE','BETWEENNESS','NBETWEENNESS','DENSITY','BROKERAGE','NBROKERAGE','TRANSITIVITY']
for f in included_features:
    myp = myp[myp[f].notna()]
    
bigdata = myp
bigdata.head(5)

Unnamed: 0,STATUS,NETWORKSIZE,BETWEENNESS,NBETWEENNESS,DENSITY,BROKERAGE,NBROKERAGE,TRANSITIVITY,cEXT,cNEU,cAGR,cCON,cOPN
0,likes the sound of thunder.,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1,0,1,0,0,1
1,is so sleepy it's not even funny that's she ca...,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1,0,1,0,0,1
2,is sore and wants the knot of muscles at the b...,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1,0,1,0,0,1
3,likes how the day sounds in this new song.,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1,0,1,0,0,1
4,is home. <3,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1,0,1,0,0,1


In [4]:
print('Number of rows: '+str(bigdata.shape[0]))
print('Number of columns: '+str(bigdata.shape[1]))

Number of rows: 9916
Number of columns: 13


<h2>Part 3: Baseline Analysis</h2>
<p>Through data exploration, we can compute the number of positive and negative classifications
for each class (CANOE) and therefore the prior probabilities of each class being predicted.</p>

In [5]:
import numpy as np
import matplotlib.pyplot as plt
value_counts = pd.DataFrame({'Trait':['cCON', 'cAGR', 'cNEU', 'cOPN', 'cEXT'],
                   'N count':[bigdata["cCON"].value_counts()[0],bigdata["cAGR"].value_counts()[0],bigdata["cNEU"].value_counts()[0],bigdata["cOPN"].value_counts()[0],bigdata["cEXT"].value_counts()[0]],
                   'Y count': [bigdata["cCON"].value_counts()[1],bigdata["cAGR"].value_counts()[1],bigdata["cNEU"].value_counts()[1],bigdata["cOPN"].value_counts()[1],bigdata["cEXT"].value_counts()[1]]})
value_counts.plot(x="Trait", y=["N count", "Y count"], kind="bar")

<matplotlib.axes._subplots.AxesSubplot at 0x21d25ddc278>

<p>We can assume the accuracy of a hypothetical untrained classification model in
predicting the classification of each personality trait. </p>

In [6]:
for index, row in value_counts.iterrows():
    sum = row['N count']+row['Y count']
    ppN = round((row['N count']/sum)*100, 2)
    ppY = round((row['Y count']/sum)*100, 2)
    hyp_acc = max(ppN, ppY)
    value_counts.loc[index, 'prior probability N'] = ppN
    value_counts.loc[index, 'prior probability Y'] = ppY
    value_counts.loc[index, 'Hypothetical Accuracy %'] = hyp_acc
value_counts.head(10)

Unnamed: 0,Trait,N count,Y count,prior probability N,prior probability Y,Hypothetical Accuracy %
0,cCON,5361,4555,54.06,45.94,54.06
1,cAGR,4649,5267,46.88,53.12,53.12
2,cNEU,6199,3717,62.52,37.48,62.52
3,cOPN,2547,7369,25.69,74.31,74.31
4,cEXT,5707,4209,57.55,42.45,57.55


<p>For example, if we were to create a model
that only classifies a status as being positive in trait Openness, then we can expect the model
to have an accuracy score of 74.31%. Our goal will be to have an increase in classification accuracy compared to these hypothetical accuracies.</p>
<p>If we see an increase in classification accuracy with the models trained on linguisitc features, then we can conclude that personality traits do express different linguistic cues within text, and that the identification of these linguistic cues can aid in the classification of the Big-5 personality traits of the author</p>

<h2>Part 4: Natural Language Processing and Feature Creation</h2>
<p>Pennbaker and King (1999) found many correlations between the Big-5 personality traits of University of Texas students and the linguisitc cues observed in their essays. A summary of these correlations can be found in the table below.</p>
<table>
  <tr>
    <th>Big-5 Trait</th>
    <th>Associated Linguistic Cues</th>
  </tr>
  <tr>
    <td>Conscientiousness</td>
    <td>
        <ul>
            <li>Avoidance of negations</li>
            <li>Avoidance of negative emotion words</li>
        </ul>
    </td>
  </tr>
  <tr>
    <td>Agreeableness</td>
    <td>
        <ul>
            <li>More frequent use of positive emotion words</li>
            <li>Avoidance of negative emotion words</li>
        </ul>
    </td>
  </tr>
  <tr>
    <td>Neuroticism</td>
    <td>
        <ul>
            <li>More frequent use of negative emotion words</li>
            <li>Less frequent use of positive emotion words</li>
            <li>More frequent use of first-person singular pronouns</li>
        </ul>
    </td>
  </tr>
  <tr>
    <td>Openness</td>
    <td>
        <ul>
            <li>Avoidance of first-person singular pronouns</li>
            <li>Avoidance of present-tense forms</li>
            <li>Tendency to use longer words</li>
        </ul>
    </td>
  </tr>
  <tr>
    <td>Extraversion</td>
    <td>
        <ul>
            <li>More frequent use of positive words</li>
            <li>More frequent use of pronouns, verbs, adverbs, and interjections</li>
        </ul>
    </td>
  </tr>
</table>
<p>Pennebaker, J. W., & King, L. A. (1999). Linguistic styles: Language use as an individual
difference . Journal of Personality and Social Psychology, 77, 1296–1312.</p>
<a href="https://www.researchgate.net/publication/12688664_Linguistic_styles_Language_use_as_an_individual_difference">https://www.researchgate.net/publication/12688664_Linguistic_styles_Language_use_as_an_individual_difference</a>

<p>Based on the findings of Pennbaker & King (1999), we can use NLP techniques to identify relevant linguisitc feautres within statuses:</p>
<p><strong>Step 1.</strong> Use ScikitLearn’s (SKLearn) CountVectorizer to convert the statuses to a
matrix of token counts.</p>
<p><strong>Step 2.</strong> Use Natural Language Toolkit’s (NLTK) word_tokenizer to remove
non-alphanumeric tokens and stopwords from each status.</p>
<p><strong>Step 3. </strong>Count the number of tokens and the average length of tokens in each status.</p>
<p><strong>Step 4. </strong>Use AFINN Sentiment Analysis to determine the frequency of positive emotion
words, frequency of negative emotion words, and overall sentiment of status.</p>
<p><strong>Step 5.</strong> Use NLTK’s Part-of-Speech (POS) tagger to determine the frequencies of
grammatical categories (eg. noun, verb, adjective, etc) and numbers (plural,
singular). This will create a frequency feature for each POS-tag (eg. NN,
VB, JJ, etc).</p>

In [7]:
#Import NLP Libraries
import nltk
from nltk import word_tokenize
import re
#nltk.download('stopwords')
# Import the stopwords
from nltk.corpus import stopwords
from afinn import Afinn
af = Afinn(emoticons=True)

In [8]:
features = {}

for status in bigdata["STATUS"]:
    tokens = word_tokenize(status.lower())
    tok_alpha = [t for t in tokens if re.match("^[a-zA-Z]+$", t)]
    posTok = nltk.pos_tag(tok_alpha)
    for tokenPOS in posTok:
        token = tokenPOS[0]
        pos = tokenPOS[1]
        if not pos in features:
            features[pos] = []

features["numTokens"] = []
features["avgTokenLen"] = []
features["sentimentScore"] = []
features["posEmo"] = []
features["negEmo"] = []
features["neuEmo"] = []

for status in bigdata["STATUS"]:
    tokens = word_tokenize(status.lower())
    tok_alpha = [t for t in tokens if re.match("^[a-zA-Z]+$", t)]
    posTok = nltk.pos_tag(tok_alpha)
    numTokens = len(tok_alpha)
    totalCharLen = 0
    posEmo = 0
    negEmo = 0
    neuEmo = 0
    
    statusFeatures = {}
    
    for tokenPOS in posTok:
        token = tokenPOS[0]
        pos = tokenPOS[1]
        totalCharLen+=len(token)
        emo = af.score(token)
        if emo==0:
            neuEmo+=1
        elif emo>0:
            posEmo+=1
        else:
            negEmo+=1
        
        if pos in statusFeatures:
            statusFeatures[pos] += (1/numTokens)
            #statusFeatures[pos] += (1)
        else:
            statusFeatures[pos] = (1/numTokens)
            #statusFeatures[pos] += (1)
            
    if numTokens != 0:
        statusFeatures["numTokens"] = numTokens
        statusFeatures["avgTokenLen"] = totalCharLen/numTokens
        statusFeatures["sentimentScore"] = af.score(status)
        statusFeatures["posEmo"] = posEmo/numTokens
        statusFeatures["negEmo"] = negEmo/numTokens
        statusFeatures["neuEmo"] = neuEmo/numTokens
    else:
        statusFeatures["numTokens"] = 0
        statusFeatures["avgTokenLen"] = 0
        statusFeatures["sentimentScore"] = 0
    
    for key in features:
        if key in statusFeatures:
            features[key].append(statusFeatures[key])
        else:
            features[key].append(0)

# Normalize sentimentScore to no negative values
min_score = min(features["sentimentScore"])          
features["sentimentScore"] = [t + (-1*min_score) for t in features["sentimentScore"]]

<p>Add the features and values to dataframe.</p>

In [9]:
for key in features:
    bigdata[key] = np.asarray(list(map(float, features[key])))
bigdata = bigdata[bigdata.numTokens != 0]
bigdata.head(5)

Unnamed: 0,STATUS,NETWORKSIZE,BETWEENNESS,NBETWEENNESS,DENSITY,BROKERAGE,NBROKERAGE,TRANSITIVITY,cEXT,cNEU,...,NNP,RBS,WP$,POS,numTokens,avgTokenLen,sentimentScore,posEmo,negEmo,neuEmo
0,likes the sound of thunder.,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1,0,1,...,0.0,0.0,0.0,0.0,5.0,4.4,38.0,0.2,0.0,0.8
1,is so sleepy it's not even funny that's she ca...,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1,0,1,...,0.0,0.0,0.0,0.0,13.0,3.307692,40.0,0.076923,0.0,0.923077
2,is sore and wants the knot of muscles at the b...,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1,0,1,...,0.0,0.0,0.0,0.0,25.0,3.56,35.0,0.0,0.12,0.88
3,likes how the day sounds in this new song.,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1,0,1,...,0.0,0.0,0.0,0.0,9.0,3.666667,38.0,0.111111,0.0,0.888889
4,is home. <3,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1,0,1,...,0.0,0.0,0.0,0.0,2.0,3.0,39.0,0.0,0.0,1.0


<p>Create 3 different feature sets for training classification models: </p>
<p><strong>Linguistic Cues (LC): </strong>CountVectorizer matrix of token counts, number of tokens, average token
length, positive emotion word frequency, negative emotion word
frequency, sentiment score of status, POS-tag frequency.</p>
<p><strong>Social Network Metadata (SNM): </strong>Network size, betweenness, density, brokerage, and
transitivity.</p>
<p><strong>All (A): </strong>All LC features plus all SNM features.</p>

In [10]:
#Defining different feature sets for training
lingustic_features = list(features.keys())
all_features = list(features.keys())
social_network_metadata = ['NETWORKSIZE','BETWEENNESS','NBETWEENNESS','DENSITY','BROKERAGE','NBROKERAGE','TRANSITIVITY']
all_features.extend(social_network_metadata)
print("Feature Sets")
print("linguistic features: "+ str(lingustic_features))
print("social network features: "+ str(social_network_metadata))
print("all features: "+ str(all_features))

Feature Sets
linguistic features: ['VBZ', 'DT', 'NN', 'IN', 'RB', 'JJ', 'PRP', 'VBP', 'MD', 'VB', 'TO', 'RBR', 'CC', 'NNS', 'PRP$', 'WRB', 'VBD', 'VBG', 'VBN', 'CD', 'WP', 'RP', 'WDT', 'JJR', 'JJS', 'PDT', 'EX', 'FW', 'UH', 'NNP', 'RBS', 'WP$', 'POS', 'numTokens', 'avgTokenLen', 'sentimentScore', 'posEmo', 'negEmo', 'neuEmo']
social network features: ['NETWORKSIZE', 'BETWEENNESS', 'NBETWEENNESS', 'DENSITY', 'BROKERAGE', 'NBROKERAGE', 'TRANSITIVITY']
all features: ['VBZ', 'DT', 'NN', 'IN', 'RB', 'JJ', 'PRP', 'VBP', 'MD', 'VB', 'TO', 'RBR', 'CC', 'NNS', 'PRP$', 'WRB', 'VBD', 'VBG', 'VBN', 'CD', 'WP', 'RP', 'WDT', 'JJR', 'JJS', 'PDT', 'EX', 'FW', 'UH', 'NNP', 'RBS', 'WP$', 'POS', 'numTokens', 'avgTokenLen', 'sentimentScore', 'posEmo', 'negEmo', 'neuEmo', 'NETWORKSIZE', 'BETWEENNESS', 'NBETWEENNESS', 'DENSITY', 'BROKERAGE', 'NBROKERAGE', 'TRANSITIVITY']


<h2>Part 5: Creation of Classification Models</h2>
<p>We train and test the 3 different feature sets with 4 different classification models: Naive Bayes, Logistic Regression, Multi-Layered Perceptron, and Gradient Boost</p>

In [11]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import GradientBoostingClassifier

In [12]:
def train_test(X,y):
    
    train, test, train_tags, test_tags = train_test_split(X, y,test_size=0.145,random_state=10)
    
    #Naive Bayes
    clf_nb = MultinomialNB().fit(train, train_tags)
    print("NB: train accuracy - " + str(round(clf_nb.score(train,train_tags)*100,2))+"%", 
          "test accuracy - " + str(round(clf_nb.score(test,test_tags)*100,2))+"%")
    
    #Logistic Regression
    clf_lr = LogisticRegression(solver='lbfgs', multi_class="multinomial", max_iter=28000, random_state=0)
    clf_lr.fit(train, train_tags)
    print("LR: train accuracy - " + str(round(clf_lr.score(train,train_tags)*100,2))+"%", 
          "test accuracy - " + str(round(clf_lr.score(test,test_tags)*100,2))+"%")
    
    #MLP
    clf_mlp = MLPClassifier(solver='lbfgs', alpha=1e-4, hidden_layer_sizes=(24, 12), random_state=0, max_iter=10000, learning_rate_init=0.01, warm_start=True)
    clf_mlp.fit(train, train_tags)
    print("MLP:train accuracy - " + str(round(clf_mlp.score(train,train_tags)*100,2))+"%", 
          "test accuracy - " + str(round(clf_mlp.score(test,test_tags)*100,2))+"%")
    
    #Boosting
    clf_b = GradientBoostingClassifier(random_state=0)
    clf_b.fit(train, train_tags)
    print("GB: train accuracy - " + str(round(clf_b.score(train,train_tags)*100,2))+"%", 
          "test accuracy - " + str(round(clf_b.score(test,test_tags)*100,2))+"%")

<h2>Part 6: Training and Testing Different Feature Sets</h2>
<p>Train each model in classifying each personality trait for each feature set.</p>

In [13]:
import scipy as sp
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

traits = ['cCON','cAGR','cNEU','cOPN','cEXT']

def train_test_label(feature_set_label,feature_set, labels):
    print("----------"+feature_set_label+"----------")
    X = sp.sparse.hstack((count_vect.fit_transform(bigdata['STATUS']),
                        bigdata[feature_set]),
                        format='csr')
    for label in labels:
        print("-----"+label+"-----")
        y = bigdata[[label]].values.ravel()
        train_test(X,y)  
        

<p>Linguisitc Cues Feature Set</p>

In [14]:
train_test_label('lingustic features',lingustic_features, traits)

----------lingustic features----------
-----cCON-----
NB: train accuracy - 86.28% test accuracy - 61.12%
LR: train accuracy - 93.44% test accuracy - 60.77%
MLP:train accuracy - 94.52% test accuracy - 59.58%
GB: train accuracy - 64.38% test accuracy - 57.06%
-----cAGR-----
NB: train accuracy - 84.82% test accuracy - 60.98%
LR: train accuracy - 93.44% test accuracy - 59.16%
MLP:train accuracy - 85.27% test accuracy - 57.27%
GB: train accuracy - 63.67% test accuracy - 55.52%
-----cNEU-----
NB: train accuracy - 81.68% test accuracy - 64.34%
LR: train accuracy - 93.4% test accuracy - 62.24%
MLP:train accuracy - 71.53% test accuracy - 62.38%
GB: train accuracy - 66.54% test accuracy - 63.01%
-----cOPN-----
NB: train accuracy - 82.38% test accuracy - 76.78%
LR: train accuracy - 93.13% test accuracy - 75.59%
MLP:train accuracy - 77.12% test accuracy - 74.9%
GB: train accuracy - 75.79% test accuracy - 76.71%
-----cEXT-----
NB: train accuracy - 84.05% test accuracy - 61.54%
LR: train accuracy - 

<p>Best Results for Linguisitc Cues Feature Set compared to Hypothetical Accuracies:</p>
<table>
  <tr>
    <th>Big-5 Trait</th>
    <th>Best Accuracy % (NB)</th>
    <th>Hypothetical Accuracy %</th>
    <th>Difference %</th>
  </tr>
  <tr>
    <td>Conscientiousness</td>
    <td>61.12</td>
    <td>54.06</td>
    <td>+7.06</td>
  </tr>
  <tr>
    <td>Agreeableness</td>
    <td>60.98</td>
    <td>53.12</td>
    <td>+7.86</td>
  </tr>
  <tr>
    <td>Neuroticism</td>
    <td>76.78</td>
    <td>62.52</td>
    <td>+14.26</td>
  </tr>
  <tr>
    <td>Openness</td>
    <td>76.78</td>
    <td>74.31</td>
    <td>+2.47</td>
  </tr>
  <tr>
    <td>Extraversion</td>
    <td>61.54</td>
    <td>57.55</td>
    <td>+3.99</td>
  </tr>
  <tr>
    <td><strong>Average</strong></td>
    <td>64.95</td>
    <td>60.31</td>
    <td>+4.64</td>
  </tr>
</table>

<p>Social Network Metadata Feature Set</p>

In [15]:
train_test_label('social network metadata',social_network_metadata, traits)

----------social network metadata----------
-----cCON-----
NB: train accuracy - 59.06% test accuracy - 59.58%
LR: train accuracy - 58.05% test accuracy - 58.53%
MLP:train accuracy - 55.78% test accuracy - 55.45%
GB: train accuracy - 94.88% test accuracy - 94.62%
-----cAGR-----
NB: train accuracy - 59.51% test accuracy - 56.57%
LR: train accuracy - 57.77% test accuracy - 57.27%
MLP:train accuracy - 46.32% test accuracy - 46.29%
GB: train accuracy - 97.0% test accuracy - 96.29%
-----cNEU-----
NB: train accuracy - 53.16% test accuracy - 54.83%
LR: train accuracy - 68.97% test accuracy - 67.13%
MLP:train accuracy - 62.82% test accuracy - 61.82%
GB: train accuracy - 97.79% test accuracy - 97.2%
-----cOPN-----
NB: train accuracy - 67.48% test accuracy - 65.8%
LR: train accuracy - 73.68% test accuracy - 76.15%
MLP:train accuracy - 26.0% test accuracy - 23.57%
GB: train accuracy - 95.47% test accuracy - 94.97%
-----cEXT-----
NB: train accuracy - 59.96% test accuracy - 62.24%
LR: train accuracy

<p>Best Results for Social Network Metadata Feature Set compared to Hypothetical Accuracies:</p>
<table>
  <tr>
    <th>Big-5 Trait</th>
    <th>Best Accuracy % (GB)</th>
    <th>Hypothetical Accuracy %</th>
    <th>Difference %</th>
  </tr>
  <tr>
    <td>Conscientiousness</td>
    <td>94.62</td>
    <td>54.06</td>
    <td>+40.56</td>
  </tr>
  <tr>
    <td>Agreeableness</td>
    <td>96.29</td>
    <td>53.12</td>
    <td>+43.17</td>
  </tr>
  <tr>
    <td>Neuroticism</td>
    <td>97.2</td>
    <td>62.52</td>
    <td>+34.68</td>
  </tr>
  <tr>
    <td>Openness</td>
    <td>94.97</td>
    <td>74.31</td>
    <td>+20.66</td>
  </tr>
  <tr>
    <td>Extraversion</td>
    <td>97.06</td>
    <td>57.55</td>
    <td>+39.51</td>
  </tr>
  <tr>
    <td><strong>Average</strong></td>
    <td>96.68</td>
    <td>60.31</td>
    <td>+36.37</td>
  </tr>
</table>

<p>All features</p>

In [16]:
train_test_label('all features',all_features, traits)

----------all features----------
-----cCON-----
NB: train accuracy - 59.02% test accuracy - 59.65%
LR: train accuracy - 58.4% test accuracy - 58.95%
MLP:train accuracy - 54.02% test accuracy - 54.2%
GB: train accuracy - 95.29% test accuracy - 95.03%
-----cAGR-----
NB: train accuracy - 59.23% test accuracy - 56.5%
LR: train accuracy - 57.61% test accuracy - 56.92%
MLP:train accuracy - 46.71% test accuracy - 46.92%
GB: train accuracy - 95.85% test accuracy - 95.38%
-----cNEU-----
NB: train accuracy - 53.11% test accuracy - 54.76%
LR: train accuracy - 69.14% test accuracy - 67.27%
MLP:train accuracy - 62.47% test accuracy - 60.77%
GB: train accuracy - 97.8% test accuracy - 97.2%
-----cOPN-----
NB: train accuracy - 66.26% test accuracy - 64.83%
LR: train accuracy - 73.68% test accuracy - 76.15%
MLP:train accuracy - 26.01% test accuracy - 23.57%
GB: train accuracy - 95.43% test accuracy - 95.03%
-----cEXT-----
NB: train accuracy - 60.03% test accuracy - 62.24%
LR: train accuracy - 63.92% te

<p>Best Results for All Feature Set compared to Hypothetical Accuracies:</p>
<table>
  <tr>
    <th>Big-5 Trait</th>
    <th>Best Accuracy % (GB)</th>
    <th>Hypothetical Accuracy %</th>
    <th>Difference %</th>
  </tr>
  <tr>
    <td>Conscientiousness</td>
    <td>95.03</td>
    <td>54.06</td>
    <td>+40.97</td>
  </tr>
  <tr>
    <td>Agreeableness</td>
    <td>95.38</td>
    <td>53.12</td>
    <td>+42.26</td>
  </tr>
  <tr>
    <td>Neuroticism</td>
    <td>97.2</td>
    <td>62.52</td>
    <td>+34.68</td>
  </tr>
  <tr>
    <td>Openness</td>
    <td>95.03</td>
    <td>74.31</td>
    <td>+20.72</td>
  </tr>
  <tr>
    <td>Extraversion</td>
    <td>97.06</td>
    <td>57.55</td>
    <td>+39.51</td>
  </tr>
  <tr>
    <td><strong>Average</strong></td>
    <td>95.94</td>
    <td>60.31</td>
    <td>+35.63</td>
  </tr>
</table>