<a href="https://colab.research.google.com/github/ashishpal2702/HumanActivityrecognition/blob/main/Logistic_Regression_and_Classification_POC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Human Activity Recognition

## Approach to Problem Statement

Let’s go ahead and open the Human activity recognition level 0 notebook. In this notebook we will be building the level 0 architecture of MLOps, and our major focus would be on the model registry part. The basic analysis and data visualisation steps are skipped in the interest of time. The major take away for this lesson is to learn:

* How to manually save the results of the modelling experiments and the dataset used for modelling?
* How to manually save the models under the model registry? and,
* How to use the saved models at a later stage?


## Import Libraries:
Let’s start by importing the relevant libraries for the model-building process. 
1. Import libraries like os, pandas, and numpy for data handling and feature engineering processes.
2. Import necessary Scikit-learn libraries for the model-building and evaluation process.

In [66]:
import os

import pandas as pd
import numpy as np 

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report
from sklearn.ensemble import ExtraTreesClassifier

## 1. Data Import

Now, let’s import the dataset present in the variable file path. Please change the file paths accordingly to the location of the data in your system. 

We are reading the dataset file using the read_parquet() function from pandas. This function reads the file in the Parquet format and assigns it to data. Parquet is a columnar storage file format that is commonly used for storing and processing big data. 

In [67]:
# Importing the file path using parquet function
filepath = '/Users/harish/Desktop/Human Acivity Recognition/Data/Train_data.gzip'
data = pd.read_parquet(filepath)

Let’s look at the number of rows and columns of the dataset. As you can see, we have a large dataset of one lakh rows and 563 columns.

In [68]:
data.shape

(100000, 563)

Let's have a look at the few samples of the dataset. 

In [23]:
data.head()

Unnamed: 0,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,tBodyAcc-max()-X,...,fBodyBodyGyroJerkMag-kurtosis(),"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)",date_time,Activity
0,0.229771,0.006191,-0.063642,0.148432,-0.017294,-0.214122,0.10912,-0.140795,-0.060678,0.136123,...,-0.669427,0.001615,-0.50346,-0.695799,0.61824,-0.487791,0.006897,0.012113,2020-01-01 01:00:00,WALKING_DOWNSTAIRS
1,0.096259,-0.004479,-0.011539,-0.628955,-0.803592,-0.217212,-0.174894,-0.421507,-0.025912,-0.539172,...,-0.286686,-0.079703,0.052752,0.000921,-0.014742,0.155335,-0.236586,-0.277595,2020-01-01 01:01:00,LAYING
2,0.034372,-0.043343,-0.31156,-0.075527,-0.063103,0.025375,-0.032692,-0.083888,0.127356,-0.021718,...,-0.136133,0.00518,0.202458,0.084711,-0.297366,-0.540108,0.063798,0.131656,2020-01-01 01:02:00,WALKING_UPSTAIRS
3,0.024097,0.009133,-0.086416,0.071118,0.006441,-0.122816,-0.03205,0.024451,-0.164227,0.075734,...,-0.206189,0.047049,0.292424,0.064996,0.100568,-0.682927,0.130937,0.021554,2020-01-01 01:03:00,WALKING_DOWNSTAIRS
4,0.225618,-0.013047,-0.103685,-0.189633,-0.67405,-0.602043,-0.82034,-0.008993,-0.988655,-0.698891,...,-0.843761,0.008899,-0.189814,-0.075739,-0.130241,-0.059553,-0.011284,-0.077751,2020-01-01 01:04:00,SITTING


Let’s extract the major sensor names out of the 563 column names.

In [24]:
sensors = set()
for col in data.columns:
    sensors.add(col.split("-")[0])
sensors

{'Activity',
 'angle(X,gravityMean)',
 'angle(Y,gravityMean)',
 'angle(Z,gravityMean)',
 'angle(tBodyAccJerkMean),gravityMean)',
 'angle(tBodyAccMean,gravity)',
 'angle(tBodyGyroJerkMean,gravityMean)',
 'angle(tBodyGyroMean,gravityMean)',
 'date_time',
 'fBodyAcc',
 'fBodyAccJerk',
 'fBodyAccMag',
 'fBodyBodyAccJerkMag',
 'fBodyBodyGyroJerkMag',
 'fBodyBodyGyroMag',
 'fBodyGyro',
 'tBodyAcc',
 'tBodyAccJerk',
 'tBodyAccJerkMag',
 'tBodyAccMag',
 'tBodyGyro',
 'tBodyGyroJerk',
 'tBodyGyroJerkMag',
 'tBodyGyroMag',
 'tGravityAcc',
 'tGravityAccMag'}

As you can observe most of the features are related to acceleration, gyrometer readings and angle. 
Now, Let’s check if any class imbalance is present in the Activity column. 

In [69]:
data['Activity'].value_counts()

LAYING                16762
WALKING               16728
WALKING_UPSTAIRS      16675
STANDING              16645
WALKING_DOWNSTAIRS    16627
SITTING               16563
Name: Activity, dtype: int64

We can infer that the labels in the activity columns are distributed almost fairly, indicating the absence of class imbalance.  

Next, let’s perform the label encoding on the 'Activity' column of the DataFrame using the LabelEncoder from scikit-learn. 

In [70]:
le = LabelEncoder()
data['Activity'] = le.fit_transform(data['Activity'])

Let’s check the mapping of labels to the activity types. 

In [72]:
for label_code,count in sorted((data['Activity'].value_counts()).items()):
    label_name = le.inverse_transform([label_code])
    print(f"{(label_code)} - {label_name}")

0 - ['LAYING']
1 - ['SITTING']
2 - ['STANDING']
3 - ['WALKING']
4 - ['WALKING_DOWNSTAIRS']
5 - ['WALKING_UPSTAIRS']


We could observe that 0 maps to Laying, 1 map to Sitting, 2 maps to standing, 3 maps to walking, 4 maps to walking_downstairs, and 5 maps to walking_upstairs.  
So, let’s now proceed to data preparation!

## 2. Data preparation

Let’s now prepare a dataset for machine learning. Let’s separate the feature variable and target variable.
Also let’s restrict our features to sensor reading by dropping the 'date_time' feature. 

In [73]:
X = data.drop(['date_time','Activity'] , axis = 1)
y = data['Activity']
X.shape

(100000, 561)

We still have 561 features in the dataset which may result in computational complexity, and potentially lead to overfitting. To avoid these challenges let's define a function “get_top_k_features” to extract the most relevant features. The function takes three arguments namely X, Y and k. Where

* X is the Input feature,
* Y is the Target variable, and
* K is the number the features that we want

We will be using a decision tree based model called “ExtraTreesClassifier” to get the most important features, since tree based models have a feature importance attribute. Next, the sort function sorts the features in descending order of importance. Finally, we are extracting the top k feature names and storing them in a list.


In [74]:
def get_top_k_features(X, y, k):
    clf = ExtraTreesClassifier(n_estimators=150)
    clf = clf.fit(X, y)
    feature_df = pd.DataFrame(data=(X.columns, clf.feature_importances_)).T.sort_values(by=1, ascending=False)
    cols = feature_df.head(k)[0].values
    return cols

Let’s make a dataset with 10 features. Feel free to use a different number of features to improve your results.

In [75]:
top_10_features = get_top_k_features(X, y, k=10)


Don’t worry if this process takes a little bit of time. Since, this is a large dataset building models and tuning models for this data will take some time. Now, let’s print the top 10 extracted features. 

In [31]:
print('top_10_features:', top_10_features)

top_10_features: ['tGravityAcc-energy()-X' 'angle(X,gravityMean)' 'tGravityAcc-min()-X'
 'tGravityAcc-mean()-X' 'tGravityAcc-max()-X' 'tGravityAcc-min()-Y'
 'tGravityAcc-mean()-Y' 'tGravityAcc-max()-Y' 'angle(Y,gravityMean)'
 'tBodyAcc-max()-X']


Lets Subset the original input features X based on the extracted, top 10 features.


In [76]:
X_10 = X[top_10_features]

Now that we have trimmed our dataset, let’s split the dataset into train and test sets. 

In [33]:
x_train1 , x_test1 , y_train1 , y_test1 = train_test_split(X_10, y, random_state=42, test_size=0.25)
x_train1.head()

Unnamed: 0,tGravityAcc-energy()-X,"angle(X,gravityMean)",tGravityAcc-min()-X,tGravityAcc-mean()-X,tGravityAcc-max()-X,tGravityAcc-min()-Y,tGravityAcc-mean()-Y,tGravityAcc-max()-Y,"angle(Y,gravityMean)",tBodyAcc-max()-X
98980,-0.565443,0.242414,-0.101913,-0.20078,-0.009404,0.017509,0.206466,0.128182,-0.067901,-0.309997
69824,-0.675749,0.018377,-0.010389,-0.071852,-0.22797,0.559071,0.291151,0.431229,-0.424456,-0.417692
9928,0.483607,-0.13369,0.262396,0.412074,0.006954,-0.207874,-0.077072,-0.051626,0.067471,-0.008146
75599,0.15345,-0.198148,0.24578,0.281915,0.504597,-0.006724,-0.1486,-0.192616,0.159822,0.004341
95621,0.101471,-0.235995,0.414753,0.201402,0.396144,-0.210965,-0.202314,-0.218439,0.137819,0.130354


Now, our dataset is ready. Let’s try to build a few models and compare their performance to select the best-performing model. 

## 4. Model Training

Here we are going to use Logistic regression, decision tree and random Forest to build models and compare the results. Feel free to try other machine learning models as well. 

First, let’s import the models

In [77]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

## 4.1 Model training with top 10 features

To store and organise the results obtained from different model experiments for the top 10 features, let’s create an empty dictionary as model_result_10f.

In [35]:
model_result_10f = {}

Let’s start with a logistic regression model, evaluate its accuracy on both the training and testing sets, and then store the results in the dictionary, model_result_10f. We will also store helpful remarks for each model to keep a track of model experimentation. In this case we have added the remark as “Model with top 10 features without standard scaling. Maximum iterations = 1000.”

In [36]:
lr1 =  LogisticRegression(max_iter=1000)
lr1.fit(x_train1, y_train1)
train_accuracy = round(lr1.score(x_train1, y_train1)*100,2)
test_accuracy = round(lr1.score(x_test1, y_test1)*100,2)
print("Training Accuracy", train_accuracy)
print("Test Accuracy", test_accuracy)
model_result_10f['Logistic_Regression'] = {'Train_accuracy': train_accuracy,'Test_accuracy':test_accuracy,'Remark':'Model with top 10 features without standard scaling. Maximum iterations = 1000.'}

Training Accuracy 72.2
Test Accuracy 71.67


For the logistic regression model:
The training accuracy is ___  percent and the test accuracy is ____. The logistic regression accuracy is quite low. 

Let’s do the same exercise for the decision tree classifier and store the results in the dictionary.

In [37]:
dt1 =  DecisionTreeClassifier(random_state=42)
dt1.fit(x_train1, y_train1)
train_accuracy = round(dt1.score(x_train1, y_train1)*100,2)
test_accuracy = round(dt1.score(x_test1, y_test1)*100,2)
print("Training Accuracy", train_accuracy)
print("Test Accuracy", test_accuracy)
model_result_10f['Decision_Tree'] = {'Train_accuracy': train_accuracy,'Test_accuracy':test_accuracy, "Remark":"Model with top 10 features without standard scaling. Random_state 42."}

Training Accuracy 100.0
Test Accuracy 65.63


For the DecisionTreeClassifier model:
The training accuracy is ___ percent and the test accuracy is ___. 
This shows a very high level of overfitting, so we should try random forest. 

Just like the previous two models, here we are training a random forest classifier. 


In [38]:
rfc1 =  RandomForestClassifier(n_estimators=100, random_state=60)
rfc1.fit(x_train1, y_train1)
train_accuracy = round(rfc1.score(x_train1, y_train1)*100,2)
test_accuracy = round(rfc1.score(x_test1, y_test1)*100,2)
print("Training Accuracy", train_accuracy)
print("Test Accuracy", test_accuracy)
model_result_10f['RandomForest'] = {'Train_accuracy': train_accuracy,'Test_accuracy':test_accuracy,"Remark":"Model with top 10 features without standard scaling. 100 Decision trees. Random_state 42." }

Training Accuracy 100.0
Test Accuracy 74.01


For the RandomForestClassifier model:
The training accuracy is ____ percent and the test accuracy is ____. This is a minor improvement over the decision tree, but this performance is not as expected. So, let’s try to increase the performance by adding more features. 

## 4.2 Model training with top 12 features

Now, we will increase the number of features to 12 and print them.

In [39]:
top_12_features = get_top_k_features(X, y, k=12)
print("top_12_features:", top_12_features)

top_12_features: ['tGravityAcc-energy()-X' 'tGravityAcc-mean()-X' 'tGravityAcc-max()-X'
 'tGravityAcc-min()-X' 'angle(X,gravityMean)' 'tGravityAcc-min()-Y'
 'tGravityAcc-mean()-Y' 'tGravityAcc-max()-Y' 'angle(Y,gravityMean)'
 'tBodyAcc-max()-X' 'tGravityAcc-energy()-Y' 'fBodyAcc-entropy()-X']


Let's Subset the original input features X_12 based on the extracted, top 12 features.

In [40]:
X_12 = X[top_12_features]

Now that we have our desired features, let’s split the dataset into train and test sets. 

In [41]:
x_train2 , x_test2 , y_train2 , y_test2 = train_test_split(X_12, y, random_state=42, test_size=0.25)
x_train2.head()

Unnamed: 0,tGravityAcc-energy()-X,tGravityAcc-mean()-X,tGravityAcc-max()-X,tGravityAcc-min()-X,"angle(X,gravityMean)",tGravityAcc-min()-Y,tGravityAcc-mean()-Y,tGravityAcc-max()-Y,"angle(Y,gravityMean)",tBodyAcc-max()-X,tGravityAcc-energy()-Y,fBodyAcc-entropy()-X
98980,-0.565443,-0.20078,-0.009404,-0.101913,0.242414,0.017509,0.206466,0.128182,-0.067901,-0.309997,-0.50709,-0.018885
69824,-0.675749,-0.071852,-0.22797,-0.010389,0.018377,0.559071,0.291151,0.431229,-0.424456,-0.417692,0.646694,-0.792774
9928,0.483607,0.412074,0.006954,0.262396,-0.13369,-0.207874,-0.077072,-0.051626,0.067471,-0.008146,-0.256825,-0.166796
75599,0.15345,0.281915,0.504597,0.24578,-0.198148,-0.006724,-0.1486,-0.192616,0.159822,0.004341,-0.198441,0.431915
95621,0.101471,0.201402,0.396144,0.414753,-0.235995,-0.210965,-0.202314,-0.218439,0.137819,0.130354,-0.057358,0.669723


To store and organise the results obtained from different model experiments for the top 12 features, let’s create an empty dictionary as model_result_12f.

In [42]:
model_result_12f= {}

Since, we have already observed that the random forest model gave better results for the k10 feature. We will only try random forest models for K12 features. 

In [43]:
rfc2 =  RandomForestClassifier(n_estimators=50)
rfc2.fit(x_train2, y_train2)
train_accuracy = round(rfc2.score(x_train2, y_train2)*100,2)
test_accuracy = round(rfc2.score(x_test2, y_test2)*100,2)
print("Training Accuracy", train_accuracy)
print("Test Accuracy", test_accuracy)
model_result_12f['RandomForest'] = {'Train_accuracy': train_accuracy,'Test_accuracy':test_accuracy,"Remark": "Model with top 12 features without standard scaling and 50 Decision trees"}

Training Accuracy 99.99
Test Accuracy 80.43


For the K12 Random Forest Classifier model:

The training accuracy is ___ percent and the test accuracy is ___. We could observe a significant jump in the model performance with the increased feature variables. Feel free to try a larger number of features to improve the model performance further. Let’s now compare all the computed models for K10 and K12 features.  

## 5. Model comparison

Let’s tabulate and print the various model results obtained using top 10 features. 

In [44]:
model_result_10f = pd.DataFrame(model_result_10f).T
model_result_10f

Unnamed: 0,Train_accuracy,Test_accuracy,Remark
Logistic_Regression,72.2,71.67,Model with top 10 features without standard sc...
Decision_Tree,100.0,65.63,Model with top 10 features without standard sc...
RandomForest,100.0,74.01,Model with top 10 features without standard sc...


Similarly let’s print the model results obtained for top 12 features. 

In [45]:
model_result_12f = pd.DataFrame(model_result_12f).T
model_result_12f

Unnamed: 0,Remark,Test_accuracy,Train_accuracy
RandomForest,Model with top 12 features without standard sc...,80.43,99.99


From the table, it is evident that the RandomForest model with top 12 features has performed better than other models. It has the highest validation accuracy. But there is still scope for improvement. Let’s carry out hyperparameter tuning for this model to improve it further.

### Hyper parameter tuning for K12 Random forest model 

We will use GridSearchCV to perform hyperparameter tuning. In the below code GridSearchCV searches for the best combination of n_estimators, and max_depth using 3-fold cross-validation. 

In [46]:
from sklearn.model_selection import GridSearchCV
param_grid = {
            "n_estimators": [
                50,
                100,
                150,
            ],
            "max_depth": [15, 20]
        }
CV_rfc = GridSearchCV(estimator=rfc2, param_grid=param_grid, cv=3)
CV_rfc.fit(x_train2, y_train2)

Our best Hyperparameters are: 
* max_depth = 20
* n_estimator = 150

In [47]:
CV_rfc.best_estimator_

Now, let’s check the performance of the random forest model after hyperparameter tuning.

In [48]:
model_result_12f_tuned={}

In [49]:
train_accuracy = round(CV_rfc.score(x_train2, y_train2)*100,2)
test_accuracy = round(CV_rfc.score(x_test2, y_test2)*100,2)
print("Training Accuracy", train_accuracy)
print("Test Accuracy", test_accuracy)
model_result_12f_tuned['Hyperparameter_tuned_RandomForest'] = {'Train_accuracy': train_accuracy,'Test_accuracy':test_accuracy,"Remark": "Model with top 12 features without standard scaling and Hyperparameter tuned"}

Training Accuracy 94.91
Test Accuracy 80.94


In [50]:
model_result_12f_tuned = pd.DataFrame(model_result_12f_tuned).T
model_result_12f_tuned

Unnamed: 0,Remark,Test_accuracy,Train_accuracy
Hyperparameter_tuned_RandomForest,Model with top 12 features without standard sc...,80.94,94.91


The training accuracy of the model is ____ and test accuracy is ____. We can conclude that the model is not overfitting and we could observe a small improvement in test accuracy.

## Saving Model Results 

Let’s store the model results in a new directory named “model_results”.

In [87]:
os.mkdir('model_results/')

Let’s save the model result of top 10 and top 12 features of both hyper parameters tuned model and the non-tuned model in our model_results directory. 

In [88]:
import joblib
## Let's save the 10 features of model results.
joblib.dump(model_result_10f, "./model_results/model_result_10f.joblib")
## Let's save the top 12 features of model results.
joblib.dump(model_result_12f, "./model_results/model_result_12f.joblib")
## Let's save the top 12 features of hyper parameter tuned model results.
joblib.dump(model_result_12f_tuned, "./model_results/model_result_12f_tuned.joblib")

['./model_results/model_result_12f_tuned.joblib']

Now, you can check that the new models are saved successfully for future reference.

## END

## 6. Model registration


Now, let's come to the most exciting part of this notebook which is model registration. In MLOps model registry is an important component as discussed in the level 0 architecture, it helps in the easy retrieval and reuse of the model in the future. 

First, let’s create two directories or folders named "model_registry" and "model_features” for saving models and the features used to train the models respectively.

In [93]:
## Change Drive path to your Folder
os.mkdir('model_registry/')
os.mkdir('model_features/')

### 6.1 Saving Models 

Let’s save three models. The two best performing models which are random forest based and the logistic regression model for diversity. We will use the joblib library to accomplish the same. Make sure that you name your model in a manner that your future self as well as others are able to understand the details about the model.  

In [104]:
joblib.dump(le, "./model_features/encoder_weights.joblib")

['./model_features/encoder_weights.joblib']

In [94]:
## K10 logistic regression model 
joblib.dump(lr1, "./model_registry/K10-Logistic Regression.joblib")
## K12 random forest model 
joblib.dump(rfc2, "./model_registry/K12-random_forest.joblib")
## K12 hyper parameter tuned random forest model 
joblib.dump(CV_rfc, "./model_registry/K12-tuned-random_forest.joblib")

['./model_registry/K12-tuned-random_forest.joblib']

Now, let’s save the features of the model. We will define a variable name “K10_Feature” and "K12_tuned_Feature" to save the feature columns of K10 and K12 features, used for training the model. They are saved to the directory: model_features. 

### 6.2 Saving feature names 

In [95]:
K10_Feature = np.array(x_train1.columns)
joblib.dump(K10_Feature, "./model_features/K10_train_features.joblib")
K12_Feature = np.array(x_train2.columns)
joblib.dump(K12_Feature, "./model_features/K12_train_features.joblib")

['./model_features/K12_train_features.joblib']

In [105]:
K12_Feature

array(['tGravityAcc-energy()-X', 'tGravityAcc-mean()-X',
       'tGravityAcc-max()-X', 'tGravityAcc-min()-X',
       'angle(X,gravityMean)', 'tGravityAcc-min()-Y',
       'tGravityAcc-mean()-Y', 'tGravityAcc-max()-Y',
       'angle(Y,gravityMean)', 'tBodyAcc-max()-X',
       'tGravityAcc-energy()-Y', 'fBodyAcc-entropy()-X'], dtype=object)

### 6.3 Saving datasets

In [96]:
K10_dataset = np.array(x_train1)
joblib.dump(K10_dataset, "./model_features/K10_train_dataset.joblib")
K12_dataset = np.array(x_train2)
joblib.dump(K12_dataset, "./model_features/K12_train_dataset.joblib")

['./model_features/K12_train_dataset.joblib']

Now, we have successfully saved both model registry and model features. Now that you have saved your models you can use this to make predictions on new data any time in the future.

## 9. Model Prediction

Lets load some new data to make predictions. For evaluating the new data performance we would be using the stored models and features. 
Don’t forget to update the file path of the new data, as we did earlier. 

In [97]:
## load new data set 
new_data = pd.read_csv('/Users/harish/Desktop/Human Acivity Recognition/Data/new_data.csv')
new_data.shape

(610, 562)

The new data contains 610 rows and 562 columns. 

We would need to do similar pre-processing on new data as we did on the train dataset. Let’s load the feature names to trim the dataset to 12 features and let’s also load the best performing model. 

In [98]:
## Load Features and model weight
train_features = joblib.load("./model_features/K12_train_features.joblib")

model = joblib.load("./model_registry/K12-tuned-random_forest.joblib")

Let’s trim the data. 

In [99]:
new_data_features = new_data[train_features]

Sometimes we encounter issues in a new dataset that we did not encounter in the training dataset, like missing values. To ensure that the data is in the appropriate format for making predictions let's fill the missing values using the fillna() function.

In [100]:
new_data_features.fillna(0,inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_data_features.fillna(0,inplace = True)


Now, that the new dataset is ready, let's carry out the prediction for the model. The trained model is used to predict the labels for the new data, and new_data_features using the predict() function. In the code below the predicted labels are assigned to the variable y_prediction. The predicted labels are then transformed back to the original activity labels using the inverse_transform() method of the LabelEncoder (le) and stored in y_prediction_label. Finally, the prediction labels are added as a new column named "Prediction_label" to the DataFrame, new_data.

In [101]:
y_prediction = model.predict(new_data_features)
y_prediction_label = le.inverse_transform(y_prediction)
new_data['Prediction_label'] = y_prediction_label

Let’s have a look at the first few rows of the data frame. As you can see, the actual activity and the predicted activities are matching for the visible rows. 

In [63]:
new_data.head()

Unnamed: 0,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,tBodyAcc-max()-X,...,fBodyBodyGyroJerkMag-kurtosis(),"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)",Activity,Prediction_label
0,0.288585,-0.020294,-0.132905,-0.995279,-0.983111,-0.913526,-0.995112,-0.983185,-0.923527,-0.934724,...,-0.710304,-0.112754,0.0304,-0.464761,-0.018446,-0.841247,0.179941,-0.058627,STANDING,STANDING
1,0.278419,-0.016411,-0.12352,-0.998245,-0.9753,-0.960322,-0.998807,-0.974914,-0.957686,-0.943068,...,-0.861499,0.053477,-0.007435,-0.732626,0.703511,-0.844788,0.180289,-0.054317,STANDING,STANDING
2,0.279653,-0.019467,-0.113462,-0.99538,-0.967187,-0.978944,-0.99652,-0.963668,-0.977469,-0.938692,...,-0.760104,-0.118559,0.177899,0.100699,0.808529,-0.848933,0.180637,-0.049118,STANDING,SITTING
3,0.279174,-0.026201,-0.123283,-0.996091,-0.983403,-0.990675,-0.997099,-0.98275,-0.989302,-0.938692,...,-0.482845,-0.036788,-0.012892,0.640011,-0.485366,-0.848649,0.181935,-0.047663,STANDING,SITTING
4,0.276629,-0.01657,-0.115362,-0.998139,-0.980817,-0.990482,-0.998321,-0.979672,-0.990441,-0.942469,...,-0.699205,0.12332,0.122542,0.693578,-0.615971,-0.847865,0.185151,-0.043892,STANDING,SITTING


Let's check the performance of the model on the new dataset. 

We are assigning the “Activity” column from the new data to the “y_test” variable and the “Prediction_label” column from the test data to the “y_pred” variable

In [102]:
y_test = new_data['Activity'].astype(str)
y_pred = new_data['Prediction_label'].astype(str)

In [103]:
print(classification_report(y_pred, y_test))

                    precision    recall  f1-score   support

            LAYING       1.00      0.99      1.00       112
           SITTING       0.91      0.91      0.91        99
          STANDING       0.91      0.93      0.92       112
           WALKING       0.78      0.82      0.80       120
WALKING_DOWNSTAIRS       0.94      0.99      0.96        69
  WALKING_UPSTAIRS       0.77      0.68      0.72        98
               nan       0.00      0.00      0.00         0

          accuracy                           0.88       610
         macro avg       0.76      0.76      0.76       610
      weighted avg       0.88      0.88      0.88       610



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


From the evaluation report we can observe that the model has resulted in pretty high accuracy, precision, recall and f1-Score. This indicates that our model is performing quite well. The model is working particularly well for some activities like laying, sitting and standing.

## End