# Final Project

## Framing

**Introduction**: describe your dataset, and why you're interested in it

In this dataset, Kinect sensors captured students' body postures, location, and gestures in a makerspace over the period of a 13-weeks semester, recording nearly half a million observations from 16 students enrolled in a class. 

I am interested in this dataset because I am currently involved in the next iteration of the Makerspace project and it would be good to gain more familiarity with the dataset by analyzing it in this final project. 

**Research question(s)**: describe the overall research question of your project

Can we generate “profiles” (or personas) for students, based on their behavior in the space?

**Hypotheses**:
    * Describe 2-3 hypotheses that you're planning to test with your dataset
    * Each hypoteses should be based on academic research (cite a paper) and/or background knowledge that you have about the dataset if you've collected it yourself (e.g., if you've conducted interviews)
    * Each hypotheses should be formulated as an affirmation (and not a question)
    * You can also describe alternative hypotheses, if you think that your results could go either way (but again, have a rationale as for why)

    
**Hypothesis 1**
Higher velocity movements are correlated with personality traits such as extraversion and openness. 

**Hypothesis 2**
Students who lean towards their partners when collaborating within the makerspace correlates more with the personality trait of agreebleness, while students who lean away correlates more with neuroticism. 


**Papers**
Wache, J. (2014) The secret language of our body - Affect and personality recognition using physiological signals. In Proceedings of the 16th International Conference on Multimodal Interaction (pp. 389-393). ICMI 

McCrae, R.R. & John, O.P. (1992) An introduction to the five- factor model and its applications. Journal of personality 60(2), 175–215.

Srivastava, R., Feng, J., Roy, S., Sim, T., and Yan, S. Don’t Ask Me What I’m Like, Just Watch and Listen. Proceedings of the 20th ACM international conference on Multimedia. ACM, 2012., (2012), 329–338.


**Results**:
    * how are you planning to test each hypothesis? What models are you thinking of using?
I am planning to use clustering methods to group students based on the characteristics of the body movement (eg low or high velocity, lean towards or lean away) and use deep learning models to predict their personality traits.

    * what are the best results you can hope for? Is that interesting / relevant for other researchers?
The best results would be a strong correlation between the students' body language and personality traits. This would be interesting for other researchers as personality profiling could lead to other areas of research such as personalized interventions and quality of social interactions.

    * what are implications of your potential findings for practioners? 
Personality traits could be inferred from body language data without the traditional use of surveys. This presents practioners a way of understanding their students to cater to their individual learning needs.

**Threads**
    * Describe issues that might arise during the analyses above
It might be possible that we may not derive clear clusters that separate the students or we may also be looking at the incorrect characteristics of the body movement to inform us of the personality traits.

    * Come up with backup plans in case you run into theses issues
Think of alternative characteristics of the students' body movement or allow the models to uncover the relevant characteristics without making any prior assumptions.

## Data Exploration

Describe your raw data below; provide definition / explanations for the measures you're using

The raw data consists of student information (eg person_id, name), skeletal joint data, facial expressions (eg isSmiling) and collection information (eg timestamp, kinect_id). For the purpose of this investigation, we will drop information related to facial expressions as 1) these variables are dichotomous 2) they contain a limited range of facial expressions.

## Data Cleaning

Clean you data in this section, and make sure it's ready to be analyzed for next week!

In [1]:
# import files from directory

import os

files = []

for file in os.listdir('./dataset'):
    if file.endswith('.csv'):
        files.append(file)

print(files)

['6.csv', '7.csv', '5.csv', '4.csv', '3.csv', '2.csv', '10.csv', '11.csv', '13.csv', '12.csv', '9.csv']


In [2]:
# create new folder to store cleaned data

os.mkdir('./data_cleaned')
os.listdir('./')

['.DS_Store',
 'dataset',
 'data_cleaned',
 'Wee9-Final-Project.ipynb',
 '.ipynb_checkpoints',
 '.git']

In [3]:
# clean data and output as csv to folder

import pandas as pd

for file in os.listdir('./dataset'):
    if file.endswith('.csv'):
        file_input = os.path.join('./dataset',file)
        df = pd.read_csv(file_input,parse_dates=True,index_col='timestamp')
        cleaned = df.drop([df.columns[0],'name_aga_conf','confidence_value','name_aga_freq','frequency','freq_count',
                           'isTalking','isWearingGlasses','isSmiling','leftHandRaised','rightHandRaised',
                           'ip','et_timestamp','skeleton','face_detected'], axis=1)
        cleaned.dropna(inplace=True)
        file_output = os.path.join('./data_cleaned',file)
        cleaned.to_csv(file_output)
        
print(os.listdir('./data_cleaned'))

['6.csv', '7.csv', '5.csv', '4.csv', '3.csv', '2.csv', '10.csv', '11.csv', '13.csv', '12.csv', '9.csv']


## Data Analysis

In [1]:
# import files from directory

import os

files = []

for file in os.listdir('./data_cleaned'):
    if file.endswith('.csv'):
        files.append(file)

print(files)

['6.csv', '7.csv', '5.csv', '4.csv', '3.csv', '2.csv', '10.csv', '11.csv', '13.csv', '12.csv', '9.csv']


In [2]:
# combine all data into single dataframe

import pandas as pd

df = pd.DataFrame()

for file in files:
    file_input = os.path.join('./data_cleaned',file)
    df_temp = pd.read_csv(file_input,parse_dates=True,index_col='timestamp')
    df = pd.concat( [df, df_temp], ignore_index=True)
    print(df.shape)

(133242, 30)
(203950, 30)
(248749, 30)
(295838, 30)
(331968, 30)
(415609, 30)
(508608, 30)
(605386, 30)
(692745, 30)
(759264, 30)
(780442, 30)


In [3]:
# create columns for H1,H2,V1,V2

import numpy as np

df['H1'] = np.sqrt((df['ShoulderLeft_x']-df['ShoulderRight_x'])**2
                   +(df['ShoulderLeft_y']-df['ShoulderRight_y'])**2
                   +(df['ShoulderLeft_z']-df['ShoulderRight_z'])**2)

df['H2'] = np.sqrt((df['ElbowLeft_x']-df['ElbowRight_x'])**2
                   +(df['ElbowLeft_y']-df['ElbowRight_y'])**2
                   +(df['ElbowLeft_z']-df['ElbowRight_z'])**2)

df['ShoulderMid_x'] = (df['ShoulderLeft_x']+df['ShoulderRight_x'])/2
df['ShoulderMid_y'] = (df['ShoulderLeft_y']+df['ShoulderRight_y'])/2
df['ShoulderMid_z'] = (df['ShoulderLeft_z']+df['ShoulderRight_z'])/2

df['V1'] = np.sqrt((df['Head_x']-df['ShoulderMid_x'])**2
                   +(df['Head_y']-df['ShoulderMid_y'])**2
                   +(df['Head_z']-df['ShoulderMid_z'])**2)

df['ElbowMid_x'] = (df['ElbowLeft_x']+df['ElbowRight_x'])/2
df['ElbowMid_y'] = (df['ElbowLeft_y']+df['ElbowRight_y'])/2
df['ElbowMid_z'] = (df['ElbowLeft_z']+df['ElbowRight_z'])/2

df['V2'] = np.sqrt((df['ShoulderMid_x']-df['ElbowMid_x'])**2
                   +(df['ShoulderMid_y']-df['ElbowMid_y'])**2
                   +(df['ShoulderMid_z']-df['ElbowMid_z'])**2)

df.head()

Unnamed: 0,kinect_id,person_id,movementAmount,headAngle,Head_x,Head_y,Head_z,ShoulderLeft_x,ShoulderLeft_y,ShoulderLeft_z,...,H1,H2,ShoulderMid_x,ShoulderMid_y,ShoulderMid_z,V1,ElbowMid_x,ElbowMid_y,ElbowMid_z,V2
0,9615765000.0,72057594038539369,13.153,1.101641,1.157095,3.3208,1.096605,1.181513,3.387554,1.009208,...,0.231078,0.249384,1.121523,3.378288,0.910899,0.197628,1.170847,3.344927,1.056543,0.157347
1,9615765000.0,72057594038539369,0.986,1.101641,1.155413,3.322507,1.096537,1.180122,3.387663,1.004032,...,0.224662,0.235693,1.121065,3.379618,0.908818,0.199198,1.170402,3.348209,1.05758,0.159846
2,9615765000.0,72057594038539369,1.678,1.101641,1.157352,3.321452,1.09615,1.162785,3.339473,0.93173,...,0.191018,0.149868,1.111688,3.352963,0.852174,0.250205,1.122059,3.300787,0.876574,0.058525
3,9615765000.0,72057594038539369,11.852,1.101641,1.158329,3.323097,1.098499,1.17098,3.351069,0.966265,...,0.19791,0.075865,1.118411,3.358849,0.882791,0.222265,1.129098,3.304462,0.873087,0.056271
4,9615765000.0,72057594038539369,4.07,1.101641,1.157029,3.321284,1.094744,1.166033,3.350567,0.964835,...,0.194065,0.078277,1.113369,3.358484,0.883724,0.218677,1.126031,3.302213,0.871812,0.058895


In [4]:
# number of students

namelist = set(df['weighted_name'])
print(namelist)

n = len(namelist)
print(n)

{'othereight', 'otherone', 'dan', 'carlie', 'otherfour', 'liz', 'pat', 'zoe', 'sue', 'othertwo', 'billy', 'mia', 'ben', 'meg', 'lucia', 'tonya', 'name', 'kim', 'ken', 'otherfive', 'otherseven', 'othersix', 'bea', 'eva', 'andreas', 'sammie', 'otherthree', 'yanni', 'ann', 'amy', 'noe', 'ryan', 'ron'}
33


In [30]:
# data for model fitting

df_fit = df[['H1','H2','V1','V2','weighted_name']]
df_fit.head()
X = df_fit[['H1','H2','V1','V2']].values
y = df_fit['weighted_name'].values
dummy_y = pd.get_dummies(df_fit['weighted_name'])

In [33]:
# Supervised: KNeighborsClassifier

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

knn = KNeighborsClassifier(n_neighbors=n)

knn.fit(X_train,y_train)

y_pred = knn.predict(X_test)

print(knn.score(X_train, y_train))
print(knn.score(X_test, y_test))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.3280579305850718
0.279255807596537
[[2881  117   30 ...    2  153   30]
 [  82 1157   46 ...    0  343   44]
 [  36   79  915 ...    2  111   20]
 ...
 [   4    7    4 ...   34   22    2]
 [ 129  253   94 ...    1 2296   91]
 [  64   88   42 ...    0  286  498]]


  'precision', 'predicted', average, warn_for)


              precision    recall  f1-score   support

         amy       0.59      0.41      0.48      7097
     andreas       0.21      0.16      0.19      7031
         ann       0.36      0.24      0.29      3781
         bea       0.24      0.06      0.10      1788
         ben       0.22      0.18      0.20      8017
       billy       0.31      0.37      0.34     15216
      carlie       0.28      0.51      0.36     29614
         dan       0.25      0.11      0.15      3971
         eva       0.25      0.17      0.20     10490
         ken       0.23      0.23      0.23      9375
         kim       0.20      0.15      0.17     13888
         liz       0.23      0.10      0.14      2797
       lucia       0.18      0.03      0.06       743
         meg       0.23      0.12      0.15      6239
         mia       0.26      0.15      0.19      8707
        name       0.25      0.44      0.32     32374
         noe       0.25      0.15      0.19      9123
  othereight       0.25    

In [38]:
# Supervised: Deep Learning

import keras
from keras.layers import Dense
from keras.models import Sequential
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

X_train, X_test, y_train, y_test = train_test_split(X, dummy_y, test_size = 0.3, random_state=42)


n_cols = X_train.shape[1]
input_shape = (n_cols,)


def baseline_model():
    dl = Sequential()
    dl.add(Dense(10, activation='relu', input_shape = input_shape))
    dl.add(Dense(10, activation='relu'))
    dl.add(Dense(n, activation='softmax'))
    dl.compile(optimizer = 'adam', loss = 'categorical_crossentropy',metrics=['accuracy'])
    return dl

estimator = KerasClassifier(build_fn=baseline_model, epochs=1, batch_size=1)

results = cross_val_score(estimator, X, dummy_y, cv=2)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Epoch 1/1
Epoch 1/1
Baseline: 14.16% (2.45%)
