# Final Project

## Framing

**Introduction**: describe your dataset, and why you're interested in it

In this dataset, Kinect sensors captured students' body postures, location, and gestures in a makerspace over the period of a 13-weeks semester, recording nearly half a million observations from 16 students enrolled in a class. 

I am interested in this dataset because I am currently involved in the next iteration of the Makerspace project and it would be good to gain more familiarity with the dataset by analyzing it in this final project. 

**Research question(s)**: describe the overall research question of your project

Can the activities of students be detected from their location and their body postures in the makerspace (e.g., using a laser cutter, interacting with someone, etc.)

**Hypotheses**:
    * Describe 2-3 hypotheses that you're planning to test with your dataset
    * Each hypoteses should be based on academic research (cite a paper) and/or background knowledge that you have about the dataset if you've collected it yourself (e.g., if you've conducted interviews)
    * Each hypotheses should be formulated as an affirmation (and not a question)
    * You can also describe alternative hypotheses, if you think that your results could go either way (but again, have a rationale as for why)

    
**Hypothesis 1**
Using the x-y coordinates of the head joint, the approximate position of students can be inferred, which then informs if they are using tools or working with others within the makerspace. 

**Hypothesis 2**
Students who have joint line of sight with their partners indicates collaboration within the makerspace. 

**Papers**
Hall, E. T. (1966). The Hidden Dimension. Anchor Books.

Reilly, J.M. et al. Exploring Collaboration Using Motion Sensors and Multi-Modal Learning Analytics. Educational Data Mining Conference 2018.


**Results**:
    * how are you planning to test each hypothesis? What models are you thinking of using?
I am planning to first derive features such as proximity, hand gesturing and line of sight of students to indicate the activities of students within the makerspace. Thereafter, I will train a machine learning algorithm based on these features to predict the activities of students. As a test of the accuracy of the algorithm, I will exclude one week of data during the training stage and test the algorithm later using this test set. For the test set, manual coding using video would be done to denote the activities of the students instead of using the generated features. 


    * what are the best results you can hope for? Is that interesting / relevant for other researchers?
The best results would be a strong correlation between the generated features and student activities. This would be interesting for other researchers as the profiling of student activities could lead to other areas of research such as personalized interventions and quality of social interactions.

    * what are implications of your potential findings for practioners? 
Students activities could be inferred from the generated features without the teachers having to pay continuous attention to each and every student within the makerspace. This presents practioners a way of understanding their students to cater to their individual learning needs.

**Threads**
    * Describe issues that might arise during the analyses above
It might be possible that we may not derive features that can successfully predict the activities of students or we may also be looking at the incorrect features of the body movement to inform us of the activities of students.

    * Come up with backup plans in case you run into theses issues
Think of alternative features of the students' body movement or allow the models to uncover the relevant features without making any prior assumptions.

## Data Exploration

Describe your raw data below; provide definition / explanations for the measures you're using

The raw data consists of student information (eg person_id, name), skeletal joint data, facial expressions (eg isTalking) and collection information (eg timestamp, kinect_id). For the purpose of this investigation, we will drop information mostly related to facial expressions as 1) these variables are dichotomous 2) they contain a limited range of facial expressions.

## Data Cleaning

Clean you data in this section, and make sure it's ready to be analyzed for next week!

In [10]:
# import files from directory

import os

files = []

for file in os.listdir('./test'):
    if file.endswith('.csv'):
        files.append(file)

print(files)

['test.csv']


In [11]:
# create new folder to store cleaned data

os.mkdir('./data_cleaned')
os.listdir('./')

['.DS_Store',
 'test',
 'dataset',
 'data_cleaned',
 'Wee9-Final-Project.ipynb',
 '.ipynb_checkpoints',
 '.git']

In [12]:
# clean data and output as csv to folder

import pandas as pd

instructor_list = ['sammie','othertwo','otherthree','name','yanni','carlie','billy',
                   'othereight','otherfive','otherfour','otherone','otherseven','othersix']

for file in os.listdir('./test'):
    if file.endswith('.csv'):
        file_input = os.path.join('./test',file)
        df = pd.read_csv(file_input)
        cleaned = df.drop([df.columns[0],'name_aga_conf','confidence_value','name_aga_freq','frequency','freq_count',
                            'isWearingGlasses','isSmiling','leftHandRaised','rightHandRaised',
                           'ip','et_timestamp','skeleton','person_id_lifespan','face_detected','timestamp'], axis=1)
        cleaned.dropna(inplace=True)
        
        for name in instructor_list:
            indexNames = cleaned[cleaned['weighted_name'] == name].index
            cleaned.drop(indexNames,inplace=True)
        
        file_output = os.path.join('./data_cleaned',file)
        cleaned.to_csv(file_output)
        
print(os.listdir('./data_cleaned'))

['test.csv']


## Data Analysis (Preparation)

In [27]:
# import files from directory

import os

files = []

for file in os.listdir('./data_cleaned'):
    if file.endswith('.csv'):
        files.append(file)

print(files)

['test.csv']


In [28]:
# create new folder to store data for analysis

os.mkdir('./data_analysis')
os.listdir('./')

['data_analysis',
 '.DS_Store',
 'test',
 'dataset',
 'data_cleaned',
 'Wee9-Final-Project.ipynb',
 '.ipynb_checkpoints',
 '.git']

In [29]:
# combine all data into single dataframe

import pandas as pd

df_raw = pd.DataFrame()

for file in files:
    file_input = os.path.join('./data_cleaned',file)
    df_temp = pd.read_csv(file_input)
    df_raw = pd.concat( [df_raw, df_temp], ignore_index=True)
    print(df_raw.shape)
    

df_raw = df_raw.drop('Unnamed: 0', axis=1)

(44975, 32)


In [30]:
# round timestamp

df_raw['minute'] = pd.to_datetime(df_raw['timeframe']).dt.round('min')

In [31]:
# output df as csv

file_output = os.path.join('./data_analysis','df_raw.csv')
df_raw.to_csv(file_output)

In [32]:
# number of students

namelist = set(df_raw['weighted_name'])
namelist = list(namelist)
print(namelist)

n = len(namelist)
print(n)

['lucia', 'ryan', 'noe', 'ann', 'bea', 'liz', 'amy', 'ron', 'ben', 'eva', 'dan', 'kim', 'pat', 'meg', 'sue', 'zoe', 'ken', 'mia', 'andreas']
19


## Data Analysis (Proximity)

In [33]:
# function to calculate euclidean distance 

import numpy as np

def map_distance(x1,y1,x2,y2):
    dist = np.sqrt((x1-x2)**2+(y1-y2)**2)
    return dist

In [34]:
# create df_pos to determine proximity

collab_dist = 1
tool_dist = 0.2
laser_coords = [1.50,0.75]
printer_coords = [0.50,1.1]
tools = ['laser_cutter', '3D_printer']
activities = namelist + tools 
minute_index = list(set(df_raw['minute']))

df_pos = df_raw.groupby(['minute','weighted_name'])['Head_x','Head_y', 
                                                    'movementAmount','isTalking',
                                                    'headAngle','Head_z', 
                                                    'ShoulderLeft_x', 'ShoulderLeft_y', 'ShoulderLeft_z', 
                                                    'ShoulderRight_x','ShoulderRight_y', 'ShoulderRight_z', 
                                                    'ElbowLeft_x', 'ElbowLeft_y', 'ElbowLeft_z', 
                                                    'ElbowRight_x', 'ElbowRight_y', 'ElbowRight_z',
                                                    'HandLeft_x', 'HandLeft_y', 'HandLeft_z', 
                                                    'HandRight_x', 'HandRight_y', 'HandRight_z', 
                                                    'leanVector_x', 'leanVector_z','activity'].mean()

df_pos = pd.concat([df_pos, pd.DataFrame(columns = activities)],axis=1)
df_pos.fillna(0,inplace=True)

for i in range(len(minute_index)):
    time = minute_index[i]
    df_temp = df_pos.xs(minute_index[i])
    df_temp = df_temp.reset_index()

# Activity : Collaboration
    for j in range(df_temp.shape[0]-1):
        c = df_temp.shape[0]-j-1
        for k in range(c):
            d = k + j + 1
            name1 = df_temp.iloc[j][0]
            name2 = df_temp.iloc[d][0]
            name1_x = df_temp.iloc[j][1]
            name1_y = df_temp.iloc[j][2]
            name2_x = df_temp.iloc[d][1]
            name2_y = df_temp.iloc[d][2]
            if map_distance(name1_x,name1_y,name2_x,name2_y) <= collab_dist:
                df_pos.loc[(time,name1),name2] = 1
                df_pos.loc[(time,name2),name1] = 1

# Activity : Tool Use
            if map_distance(name1_x,name1_y,laser_coords[0],laser_coords[1]) <= tool_dist:
                df_pos.loc[(time,name1),'laser_cutter'] = 1
            if map_distance(name1_x,name1_y,printer_coords[0],printer_coords[1]) <= tool_dist:
                df_pos.loc[(time,name1),'3D_printer'] = 1


In [35]:
# output df_pos as csv

file_output = os.path.join('./data_analysis','df_pos.csv')
df_pos.to_csv(file_output)

## Data Analysis (Building Models)

In [36]:
# select data for model building

file = 'df_pos.csv'
file_input = os.path.join('./data_analysis',file)
df_analysis = pd.read_csv(file_input)
print(df_analysis.shape)
    
X = df_analysis[['Head_x', 'Head_y','movementAmount','isTalking', 
                 'headAngle', 'Head_z', 
                 'ShoulderLeft_x', 'ShoulderLeft_y','ShoulderLeft_z', 
                 'ShoulderRight_x', 'ShoulderRight_y','ShoulderRight_z', 
                 'ElbowLeft_x', 'ElbowLeft_y', 'ElbowLeft_z',
                 'ElbowRight_x', 'ElbowRight_y', 'ElbowRight_z', 
                 'HandLeft_x','HandLeft_y', 'HandLeft_z', 
                 'HandRight_x', 'HandRight_y', 'HandRight_z',
                 'leanVector_x', 'leanVector_z', 
                 'zoe', 'pat', 'meg', 'ken', 'ann','kim', 'noe', 
                 'mia', 'ron', 'andreas', 'eva', 'liz', 'ben', 'dan',
                 'tonya', 'sue', 'ryan', 'lucia', 'amy', 'bea', 
                 'laser_cutter','3D_printer']].values

y = df_analysis['activity'].values
dummy_y = pd.get_dummies(df_analysis['activity'])

(3882, 50)


In [39]:
# Supervised: KNeighborsClassifier

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

knn = KNeighborsClassifier(n_neighbors=n)

knn.fit(X_train,y_train)

y_pred = knn.predict(X_test)

print(knn.score(X_train, y_train))
print(knn.score(X_test, y_test))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.16231137283768862
0.041201716738197426
[[ 6  2  6  8  3  4  4  2  4  1  2  2  1  0  3  3  4  1  0  0  1  0]
 [ 7  5  2  6  1  2  0  4  3  2  0  1  0  0  2  4  1  2  1  0  2  1]
 [ 6  4  6  6  2  3  1  1  2  3  0  1  2  0  1  1  0  1  1  0  2  2]
 [ 4  5  8  3  5  4  1  4  1  1  1  3  0  1  1  1  1  2  1  1  2  1]
 [ 4  7  3  5  4  1  3  3  2  3  1  2  1  0  1  2  1  2  1  0  1  1]
 [ 8  5  1  5  2  2  2  3  4  2  2  2  1  0  1  2  1  1  0  1  1  1]
 [ 4  4  4 12  1  2  3  1  1  2  3  2  1  1  4  3  3  3  1  1  0  1]
 [ 4  7  4  9  4  1  5  0  2  1  0  1  3  4  1  1  0  1  0  2  3  0]
 [ 8  4  6  3  0  4  3  3  5  1  1  0  1  1  1  0  0  0  1  3  1  0]
 [11  2  9  5  4  2  5  1  1  5  1  2  1  0  1  2  3  0  0  0  0  1]
 [ 7  3  8  6  3  3  2  4  4  2  1  2  2  2  3  1  1  0  0  0  2  1]
 [ 8  8  6  5  2  3  1  5  1  6  1  1  2  1  1  1  2  0  1  0  1  0]
 [ 7  8  9  3  1  2  3  4  5  1  2  3  0  3  3  5  1  2  0  0  0  2]
 [ 3  7  7  5  6  7  4  6  2  2  3  0  1  1  1  1  2  3  3  0 

In [52]:
# Supervised: Deep Learning

import keras
from keras.layers import Dense
from keras.models import Sequential
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

X_train, X_test, y_train, y_test = train_test_split(X, dummy_y, test_size = 0.3, random_state=42)


n_cols = X_train.shape[1]
input_shape = (n_cols,)


def baseline_model():
    dl = Sequential()
    dl.add(Dense(10, activation='relu', input_shape = input_shape))
    dl.add(Dense(10, activation='relu'))
    dl.add(Dense(dummy_y.shape[1], activation='softmax'))
    dl.compile(optimizer = 'adam', loss = 'categorical_crossentropy',metrics=['accuracy'])
    return dl

estimator = KerasClassifier(build_fn=baseline_model, epochs=1, batch_size=1)

results = cross_val_score(estimator, X, dummy_y, cv=2)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Instructions for updating:
Use tf.cast instead.
Epoch 1/1
Epoch 1/1
Baseline: 3.81% (0.36%)


## Data Visualization