In [250]:
import pandas as pd
import requests
import os
import operator
from sklearn.metrics import classification_report

### Read raw data file
A copy of the raw datafile can be found on [Open Science Framework](https://mfr.osf.io/render?url=https://osf.io/gyrc7/?action=download%26mode=render).

In [251]:
df_pickle_path = "../../pickles/dataframe_survey_2018-01-23_enriched.pickle"
indata = pd.read_pickle(df_pickle_path)
# remove non-English texts
indata = indata[indata.lang == "en"]
print("Number of rows: {}".format(len(indata)))
print(indata.actual_temp.value_counts())
# print(indata.func.value_counts()) # dominant function for each myers-briggs typ, not used in this experiment
print()

Number of rows: 22588
nf    9028
nt    7836
sf    3035
st    2189
Name: actual_temp, dtype: int64



# Perceiving functions
## Filter out sensing (s) and intuition (n) sampes into perceiving samples dataset
Since we have 5224 (3035 + 2189) examples of s (sensing) as the smallest class and we want the two classes to be trained on fairly similar amount of text we go for dividing 5000 examples from both sensing (s) and intuition (n) into training, evaluation and external example sets. Even though we have 16 864 (9028 + 7836) examples of intuition (n).

First we create two new columns separating perceiving (s/n) from judging (t/f).

In [309]:
indata["perc_func"] = indata.actual_temp.str.extract("(\w)\w", expand=False)
indata.perc_func.value_counts()

n    16864
s     5224
Name: perc_func, dtype: int64

Then we sample 5000 from each class. We keep the column "actual_temp" e.g. st, sf, nt, nf to be able to sanity check the values in the perc_func column.

In [310]:
perc_samples = pd.concat([
            indata[indata.perc_func == "s"].sample(5000, random_state=123456)[["text","perc_func","actual_temp"]],
            indata[indata.perc_func == "n"].sample(5000, random_state=123456)[["text","perc_func","actual_temp"]]
            ])
perc_samples.perc_func.value_counts()

n    5000
s    5000
Name: perc_func, dtype: int64

In [311]:
perc_samples.head(2)

Unnamed: 0,text,perc_func,actual_temp
8623,Sonny Jooooooooon INDEX ASK PAST THEME Sonny J...,s,sf
11987,Log in | Tumblr Sign up Terms Privacy Posted b...,s,st


In [312]:
perc_samples.tail(2)

Unnamed: 0,text,perc_func,actual_temp
1034,A small dose of life. A small dose of life. A ...,n,nf
6086,ehhhhh whatever ehhhhh whatever whining about ...,n,nf


### Extra check 1: Assert no overlap present in the indecies for s and n examples already

In [313]:
# Using the index method unique(), expected value 10000
len(perc_samples.index.unique())

10000

In [314]:
# Using the index method intersection, expected value 0
len( s_samples.index.intersection(n_samples.index) ) 

0

In [317]:
# Store the perceiving samples in separate file, preserving the orginal index values in 'origIx'
perc_samples.to_csv("perceiving_samples_n10000.csv", sep=";", index=True, index_label="origIx")

In [326]:
perc_samples_test.origIx.values

array([ 8623, 11987,  5340, ...,  5822,  1034,  6086])

### Extra check 2: Inspect origIx with perc_samples.index

In [328]:
# Read in the file to make sure ix and origIx looks reasonable
perc_samples_test = pd.read_csv("perceiving_samples_n10000.csv", sep=";")
print("Unique ix: {}".format(len(perc_samples_test.index.unique())))
print("Unique origIx: {}".format(len(perc_samples_test.origIx.unique())))

Unique ix: 10000
Unique origIx: 10000


Something seems to be wrong below

In [333]:
# Intersection of values in samples ix and original indicies. Expected value 0.
len(perc_samples_test.index.intersection(perc_samples_test.origIx))

3738

In [350]:
# Adding .values to origIx just to explore if there is something peculiar with how Pandas uses .intersection
len(perc_samples_test.index.intersection(perc_samples_test.origIx.values))

3738

Why is it 3738? Shouldn't it be all 10000 - or 0 if I confused intersection with union? 

Could it be the way index.intersection works?

In [351]:
s1 = pd.Series([1,2,3])
s2 = pd.Series([2,3,4])
s3 = pd.Series([3,4,5])
print(s1.index.values)
print(s2.index.values)
print(len(s1.index.intersection(s2.index)))
print(len(s1.index.intersection(s2.values))) # Expected value 1 since 2 is also in s1.index.values
print(len(s1.index.intersection(s3.values))) # Expected value 0 since no values in [3,4,5] are in [0,1,2]

[0 1 2]
[0 1 2]
3
1
0


Nope, index.intersection behaves as expected.

Or is it in perc_samples.ix?

In [355]:
perc_samples.index

Int64Index([ 8623, 11987,  5340, 23420, 18909,  7557, 17158, 20508,  4604,
            14753,
            ...
             8081, 13099, 16591, 21953, 14180,  4113, 24786,  5822,  1034,
             6086],
           dtype='int64', length=10000)

Yes. There it was. The indicies in perc_samples are in *their* turn from indata.index which has length 22588.

In [357]:
indata.index

Int64Index([    1,     2,     3,     5,    10,    11,    14,    15,    16,
               17,
            ...
            25425, 25426, 25428, 25429, 25430, 25431, 25432, 25435, 25436,
            25437],
           dtype='int64', length=22588)

Maybe we should reset index for perc_samples?

**No.** As long as origIx is sure not to have duplicates we can still ensure that training and evaluation datasets don't overlap.

## Separate training, evaluation and external data and store it to separate files
Select 2100 examples for training, 900 for evaluation and 2000 for an extra possibility to check.

Also store original index number to be able to **ensure that training and evaluation examples do not overlap**.

In [358]:
# sensing 
s_examples = perc_samples[perc_samples.perc_func == "s"]
print("Sensing total: {}".format(len(s_examples)))

s_training = s_examples.iloc[:2100] # select first 2100
print("training: {}".format(len(s_training)))

s_evaluation = s_examples.iloc[2100:3000] # select next 900
print("evaluation: {}".format(len(s_evaluation)))

s_external = s_examples.iloc[3000:5000] # select remaining 2000
print("external: {}".format(len(s_external)))
print()

# intution
n_examples = perc_samples[perc_samples.perc_func == "n"]
print("Intuition total: {}".format(len(s_examples)))

n_training = n_examples.iloc[:2100] # select first 2100
print("training: {}".format(len(s_training)))

n_evaluation = n_examples.iloc[2100:3000] # select next 900
print("evaluation: {}".format(len(s_evaluation)))

n_external = n_examples.iloc[3000:5001] # select remaining 2000
print("external: {}".format(len(s_external)))

# Combine sensing + intution to creade perceiving classification training, evaluation and external datasets
perc_training = pd.concat([s_training, n_training])
perc_evaluation = pd.concat([s_evaluation, n_evaluation])
perc_external = pd.concat([s_external, n_external])

# Store each dataset to separate CSV-files
perc_training.to_csv("perc_trainingdata_n4200.csv", sep=";", index=True, index_label="origIx")
perc_evaluation.to_csv("perc_evaluationdata_n1800.csv", sep=";", index=True, index_label="origIx")
perc_external.to_csv("perc_externaldata_n4000.csv", sep=";", index=True, index_label="origIx")

Sensing total: 5000
training: 2100
evaluation: 900
external: 2000

Intuition total: 5000
training: 2100
evaluation: 900
external: 2000


In [359]:
trainingset = set(training.origIx.values)
evaluationset = set(evaluation.origIx.values)
print("Evaluationset is subset of trainingset (element-wise): {}".format(trainingset.issubset(evaluationset)))

# Expected union of training and evaluation is 4200 + 1800 = 6000
print("The length of the union of training set ix and evaluation set ix: {}".format(len(trainingset.union(evaluationset))))

# Expected intersection of training and evaluation is 0
print("The lenght of the intersection of training set and evaluation set: {}".format(len(trainingset.intersection(evaluationset))))


Evaluationset is subset of trainingset (element-wise): False
The length of the union of training set ix and evaluation set ix: 6000
The lenght of the intersection of training set and evaluation set: 0


# Train uClassify perceiving classifier 
Make sure to load the training data from the correct file containing training data *only*.

In [259]:
trainingdata = pd.read_csv("perc_trainingdata_n4200.csv", sep=";")
print("Training data rows: {}".format(len(trainingdata)))
print("Training top 5 rows: \n{}".format(trainingdata.head(5)))
print()

Training data rows: 4200
Training top 5 rows: 
   origIx                                               text perc_func  \
0    8623  Sonny Jooooooooon INDEX ASK PAST THEME Sonny J...         s   
1   11987  Log in | Tumblr Sign up Terms Privacy Posted b...         s   
2    5340  a thing of blood © hi im logan and i love the ...         s   
3   23420  Nobody can be uncheered with a baloon (^ v ^) ...         s   
4   18909  Wit Beyond Measure Wit Beyond Measure Aug 14, ...         s   

  actual_temp  
0          sf  
1          st  
2          sf  
3          sf  
4          st  



In [260]:
def train_jung_cognitive_functions_en_classes(func, classifier):
    """Presupposes that classifier is created and that setup_jung_functions_en_classes() is already run.
    func: expects one of ["s","n","t","f"]
    classifier: expects on of ["sntf", "tf", "sn"]
    
    """
    if classifier == "sn":
            
        data = {"texts":[row["text"]]}
        header = {"Content-Type": "application/json",
                "Authorization": "Token " + os.environ["UCLASSIFY_WRITE"]}
            
        try:
            response = requests.post('https://api.uclassify.com/v1/me/jung-perceiving-verification-20180321-no2/' + func + "/train", 
                                    json = data,
                                    headers = header)
        except Exception as e:
            print("Error: {}. retrying in 3 minutes.")
            time.sleep(180)
            response = requests.post('https://api.uclassify.com/v1/me/jung-perceiving-verification-20180321-no2/' + func + "/train", 
                                json = data,
                                headers = header)
        
    elif classifier == "tf":
    
        data = {"texts":[row["text"]]}
        header = {"Content-Type": "application/json",
                 "Authorization": "Token " + os.environ["UCLASSIFY_WRITE"]}
            
        try:
            response = requests.post('https://api.uclassify.com/v1/me/jung-judging-verification-20180321-no2/' + func + "/train", 
                                    json = data,
                                    headers = header)
        except Exception as e:
            print("Error: {}. retrying in 3 minutes.")
            time.sleep(180)
            response = requests.post('https://api.uclassify.com/v1/me/jung-judging-verification-20180321-no2/' + func + "/train", 
                                    json = data,
                                    headers = header)

In [261]:
row_cnt = 1
for ix, row in trainingdata.iterrows():
    train_jung_cognitive_functions_en_classes(func=row["perc_func"], classifier="sn")
    if row_cnt % 100 == 0:
        print("Row {} of {} trained.".format(row_cnt, len(trainingdata)))
    row_cnt += 1

Row 100 of 4200 trained.
Row 200 of 4200 trained.
Row 300 of 4200 trained.
Row 400 of 4200 trained.
Row 500 of 4200 trained.
Row 600 of 4200 trained.
Row 700 of 4200 trained.
Row 800 of 4200 trained.
Row 900 of 4200 trained.
Row 1000 of 4200 trained.
Row 1100 of 4200 trained.
Row 1200 of 4200 trained.
Row 1300 of 4200 trained.
Row 1400 of 4200 trained.
Row 1500 of 4200 trained.
Row 1600 of 4200 trained.
Row 1700 of 4200 trained.
Row 1800 of 4200 trained.
Row 1900 of 4200 trained.
Row 2000 of 4200 trained.
Row 2100 of 4200 trained.
Row 2200 of 4200 trained.
Row 2300 of 4200 trained.
Row 2400 of 4200 trained.
Row 2500 of 4200 trained.
Row 2600 of 4200 trained.
Row 2700 of 4200 trained.
Row 2800 of 4200 trained.
Row 2900 of 4200 trained.
Row 3000 of 4200 trained.
Row 3100 of 4200 trained.
Row 3200 of 4200 trained.
Row 3300 of 4200 trained.
Row 3400 of 4200 trained.
Row 3500 of 4200 trained.
Row 3600 of 4200 trained.
Row 3700 of 4200 trained.
Row 3800 of 4200 trained.
Row 3900 of 4200 trai

# Evaluate perceiving classifier on un-seen data
Make sure to read in the correct file containing evaluation data *only*.

In [360]:
evaluation = pd.read_csv("perc_evaluationdata_n1800.csv", sep=";", )
print(evaluation.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1800 entries, 0 to 1799
Data columns (total 4 columns):
origIx         1800 non-null int64
text           1800 non-null object
perc_func      1800 non-null object
actual_temp    1800 non-null object
dtypes: int64(1), object(3)
memory usage: 56.3+ KB
None


In [361]:
evaluation.head(3)

Unnamed: 0,origIx,text,perc_func,actual_temp
0,954,Dizzy With Enchantments Dizzy With Enchantment...,s,sf
1,15360,You Drool When You Sleep Create Destroy Ask Th...,s,st
2,18179,Lazyrainbow562 Lazyrainbow562 Deviantart Art B...,s,sf


In [362]:
evaluation.tail(3)

Unnamed: 0,origIx,text,perc_func,actual_temp
1797,4703,I think I saw you in my sleep Lee | 17 | Canad...,n,nf
1798,11842,"Wow, Fantastic Baby Archive Ask Submit Wow, Fa...",n,nf
1799,14711,A Social Pariah A Social Pariah Elizabeth - 16,n,nt


In [265]:
def classify_jung_percieving_function_of_text(text):
    """Does what it says, pretty much."""
    header = {"Content-Type": "application/json",
             "Authorization": "Token " + os.environ["UCLASSIFY_READ"]}
    data = {"texts":[text]} # send a one-item list for now, since we don't have a feel for sizes
    try:
        result = requests.post("https://api.uclassify.com/v1/prfekt/jung-perceiving-verification-20180321-no2/classify",
                       json = data,
                       headers = header)
    except Exception as e:
        print("Error connecting with uClassify. Retrying in 3 minutes.")
        time.sleep(180)
        result = requests.post("https://api.uclassify.com/v1/prfekt/jung-perceiving-verification-20180321-no2/classify",
                       json = data,
                       headers = header)
        
    json_result = result.json()
    
    res_dict = {"s":0, "n":0}
    
    for classItem in json_result[0]["classification"]:
        res_dict[classItem["className"]] = classItem["p"]
    
    sorted_dict = sorted(res_dict.items(), key=operator.itemgetter(1), reverse=True)
    return sorted_dict

Now, let's classify the text in each row in the evaluationdata and store the best classification result in a list.

In [266]:
sn_results = []
row_cnt = 1
for ix, row in evaluation.iterrows():
    # The function returns a sorted list of tuples, max class first e.g. [('n', 0.528311), ('s'. 0.471689)]
    res = classify_jung_percieving_function_of_text(row["text"])
    sn_results.append(res[0][0])
    if row_cnt % 100 == 0:
        print("Row {} of {} classified.".format(row_cnt, len(evaluation)))
    row_cnt += 1           

Row 100 of 1800 classified.
Row 200 of 1800 classified.
Row 300 of 1800 classified.
Row 400 of 1800 classified.
Row 500 of 1800 classified.
Row 600 of 1800 classified.
Row 700 of 1800 classified.
Row 800 of 1800 classified.
Row 900 of 1800 classified.
Row 1000 of 1800 classified.
Row 1100 of 1800 classified.
Row 1200 of 1800 classified.
Row 1300 of 1800 classified.
Row 1400 of 1800 classified.
Row 1500 of 1800 classified.
Row 1600 of 1800 classified.
Row 1700 of 1800 classified.
Row 1800 of 1800 classified.


Then we add the classification results as a separate column named "uClassify" to the evaluation DataFrame.

In [363]:
evaluation.head(2)

Unnamed: 0,origIx,text,perc_func,actual_temp
0,954,Dizzy With Enchantments Dizzy With Enchantment...,s,sf
1,15360,You Drool When You Sleep Create Destroy Ask Th...,s,st


In [366]:
classified_evaluation = pd.concat([evaluation, pd.DataFrame(sn_results, index=evaluation.index)],
                                    axis=1)
classified_evaluation.head(3)

Unnamed: 0,origIx,text,perc_func,actual_temp,0
0,954,Dizzy With Enchantments Dizzy With Enchantment...,s,sf,n
1,15360,You Drool When You Sleep Create Destroy Ask Th...,s,st,n
2,18179,Lazyrainbow562 Lazyrainbow562 Deviantart Art B...,s,sf,s


In [368]:
# Fix the missing column name. This behaviour is related to pd.concat and might change in Pandas 0.23.0
classified_evaluation.columns = ["origIx","text","perc_func","actual_temp","uClassify"]
classified_evaluation.head(3)

Unnamed: 0,origIx,text,perc_func,actual_temp,uClassify
0,954,Dizzy With Enchantments Dizzy With Enchantment...,s,sf,n
1,15360,You Drool When You Sleep Create Destroy Ask Th...,s,st,n
2,18179,Lazyrainbow562 Lazyrainbow562 Deviantart Art B...,s,sf,s


In [372]:
# Just to see if anything looks funny
classified_evaluation.tail(3)

Unnamed: 0,origIx,text,perc_func,actual_temp,uClassify
1797,4703,I think I saw you in my sleep Lee | 17 | Canad...,n,nf,s
1798,11842,"Wow, Fantastic Baby Archive Ask Submit Wow, Fa...",n,nf,n
1799,14711,A Social Pariah A Social Pariah Elizabeth - 16,n,nt,n


In [370]:
sn_accuracy = sum(classified_evaluation["perc_func"]==classified_evaluation["uClassify"])/len(classified_evaluation)
print("Accuracy: {}".format(sn_accuracy))

Accuracy: 0.5594444444444444


In [371]:
sn_cr = classification_report(classified_evaluation["perc_func"], classified_evaluation["uClassify"])
print(sn_cr)

             precision    recall  f1-score   support

          n       0.57      0.51      0.54       900
          s       0.55      0.61      0.58       900

avg / total       0.56      0.56      0.56      1800



 # Results from 3 separate runs of this notebook

Ahaaaa. The experiment was nitially ran with the wrong classifier url in line 7 in in[265]. The accuracy figure looks familiar. Apparently we get an accuracy at 0.87 when using that classifier on "unseen" data which in relation to the other results means that there is an overlap between training and evaluation data in the data that was used to train that classifier.

The true results for perceiving classification with 2100 training examples and 900 evaluation examples is an accuracy of 0.559.

I still haven't been able to isolate the error in the previous notebook, however. It has probably something to do with the sampling.