In [1]:
!pip install xgboost



You should consider upgrading via the 'C:\Users\ajb25\PycharmProjects\capstone-case-studies\venv\Scripts\python.exe -m pip install --upgrade pip' command.


In [2]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import xgboost as xgb

sentences = pd.read_csv('Sentence_Annotation_Assignments_Final_Dataset.tsv', sep='	')
sentences.drop(['id', 'text', 'source'], axis=1, inplace=True)

### Trying to predict satisfaction based on everything else

In [3]:
targets = ['satisfaction_bucket', 'considerateness_bucket', 'dedication_bucket', 'emotion_bucket']
accuracies = {}

for i in targets:
    y_ = sentences[i]
    X_ = sentences.drop([i], axis=1)
    X_train_, X_test_, y_train_, y_test_ = train_test_split(X_, y_)
    knn_ = KNeighborsClassifier()
    knn_.fit(X_train_, y_train_)
    
    knn_preds_ = knn_.predict(X_test_)
    accuracies[i] = accuracy_score(knn_preds_, y_test_)

In [4]:
accuracies

{'satisfaction_bucket': 0.9399477806788512,
 'considerateness_bucket': 0.9046997389033943,
 'dedication_bucket': 0.9138381201044387,
 'emotion_bucket': 0.9255874673629243}

In [5]:
xgb_targets = ['satisfaction_bucket', 'considerateness_bucket', 'dedication_bucket', 'emotion_bucket']
xgb_accuracies = {}

for i in xgb_targets:
    xgb_y_ = sentences[i] - 1
    xgb_X_ = sentences.drop([i], axis=1)
    xgb_X_train_, xgb_X_test_, xgb_y_train_, xgb_y_test_ = train_test_split(xgb_X_, xgb_y_)
    xgb_ = xgb.XGBClassifier()
    xgb_.fit(xgb_X_train_, xgb_y_train_)
    
    xgb_preds_ = xgb_.predict(xgb_X_test_)
    xgb_accuracies[i] = accuracy_score(xgb_preds_, xgb_y_test_)

In [6]:
xgb_accuracies

{'satisfaction_bucket': 1.0,
 'considerateness_bucket': 1.0,
 'dedication_bucket': 1.0,
 'emotion_bucket': 1.0}

In [7]:
print(sentences.columns)

Index(['satisfaction', 'satisfaction_std', 'satisfaction_bucket',
       'satisfaction_support', 'considerateness', 'considerateness_std',
       'considerateness_bucket', 'considerateness_support', 'dedication',
       'dedication_std', 'dedication_bucket', 'dedication_support', 'emotion',
       'emotion_std', 'emotion_bucket', 'emotion_support'],
      dtype='object')


## DATA LEAKAGE
Unfortunately, there is a big problem with the label predictor above. There is way too much info being given to the learners (see columns printed above). The learners (KNN and XGB) were just learning David's bucket technique. The raw average of the labels is under the column {signal} and it was being used to predict {signal_bucket}.

## SKELETON PREDICTOR
Below is XGBoost using ONLY the other bucket labels (3 columns to predict 1 - "skeleton").
Still does pretty decent with considerateness and dedication getting around 73%. This is still good news. If somehow the final transformer/classifier model is very poor at predicting say, considerateness, but it is good at predicting the other three, then we'd have a decent model for considerateness.
There is a slight good news bad news happening here. The good news, is that we picked and defined useful signals that can predict the others. The bad news is that the other group's signals are not very good at predicting ours. I don't find this insanely surprising.

In [33]:
skel_targets = ['satisfaction_bucket', 'considerateness_bucket', 'dedication_bucket', 'emotion_bucket']
skel_accuracies = {}

for i in xgb_targets:
    skel_y_ = sentences[i] - 1
    skel_X_ = sentences[[x for x in xgb_targets if x != i]]
    skel_X_train_, skel_X_test_, skel_y_train_, skel_y_test_ = train_test_split(skel_X_, skel_y_)
    xgb_skel_ = xgb.XGBClassifier()
    xgb_skel_.fit(skel_X_train_, skel_y_train_)

    skel_preds_ = xgb_skel_.predict(skel_X_test_)
    skel_accuracies[i] = accuracy_score(skel_preds_, skel_y_test_)

In [34]:
print(skel_accuracies)

{'satisfaction_bucket': 0.6788511749347258, 'considerateness_bucket': 0.7545691906005222, 'dedication_bucket': 0.7532637075718016, 'emotion_bucket': 0.577023498694517}


## BABY SKELETON PREDICTOR
Now to try a predictive model using only 2 out of the 3 labels to predict 1.
Does surprisingly well!
So, if we only get two decently accurate models, then we can use those two labels to get some idea for the other two.

In [35]:
targets = ['satisfaction_bucket', 'considerateness_bucket', 'dedication_bucket', 'emotion_bucket']
baby_accuracies = {}

for i in targets:
    baby_y_ = sentences[i] - 1
    rem_buckets = [x for x in targets if x != i]
    baby_accuracies[i] = []
    for j in range(3):
        baby_X_ = sentences[[rem_buckets[j], rem_buckets[(j+1)%3]]]
        baby_X_train_, baby_X_test_, baby_y_train_, baby_y_test_ = train_test_split(baby_X_, baby_y_)
        xgb_baby_ = xgb.XGBClassifier()
        xgb_baby_.fit(baby_X_train_, baby_y_train_)

        baby_preds_ = xgb_baby_.predict(baby_X_test_)
        baby_accuracies[i].append(accuracy_score(baby_preds_, baby_y_test_))

In [36]:
print(baby_accuracies)

{'satisfaction_bucket': [0.6618798955613577, 0.7114882506527415, 0.6631853785900783], 'considerateness_bucket': [0.7650130548302873, 0.7911227154046997, 0.7088772845953003], 'dedication_bucket': [0.7428198433420365, 0.720626631853786, 0.7193211488250653], 'emotion_bucket': [0.587467362924282, 0.49738903394255873, 0.5848563968668408]}


## BONE PREDICTOR
Using only 1 label to predict 1 label!
Again, does much better than expected!
Baseline accuracy is 20% (Random guess = 1/5)
With only 1 other (perfectly accurate) label we can get:
> Satisfaction: ~65%
> Considerateness: ~71%
> Dedication: ~71%
>  Emotion: ~50%

Emotion is the hardest to predict using the other signals.

In [31]:
targets = ['satisfaction_bucket', 'considerateness_bucket', 'dedication_bucket', 'emotion_bucket']
bone_accuracies = {}

for i in targets:
    bone_y_ = sentences[i] - 1
    rem_buckets = [x for x in targets if x != i]
    bone_accuracies[i] = []
    for j in range(3):
        bone_X_ = sentences[[rem_buckets[j]]]
        bone_X_train_, bone_X_test_, bone_y_train_, bone_y_test_ = train_test_split(bone_X_, bone_y_)
        xgb_bone_ = xgb.XGBClassifier()
        xgb_bone_.fit(bone_X_train_, bone_y_train_)

        bone_preds_ = xgb_bone_.predict(bone_X_test_)
        bone_accuracies[i].append(accuracy_score(bone_preds_, bone_y_test_))

In [32]:
print(bone_accuracies)

{'satisfaction_bucket': [0.6449086161879896, 0.6723237597911227, 0.6488250652741514], 'considerateness_bucket': [0.7232375979112271, 0.7245430809399478, 0.706266318537859], 'dedication_bucket': [0.7127937336814621, 0.7114882506527415, 0.7049608355091384], 'emotion_bucket': [0.5391644908616188, 0.5026109660574413, 0.48433420365535246]}
