# Multiclass Classification of HAR Data
- <b> HAR stands for Human Activity Recognition<b>.
- Human activities such as sit, stand, walk, ... etc. can be measured using cell phones
and smart watch sensors.
- One of these senseros is the <b>accelerometer</b>.
- The given data contains x,y, and z accelerometer readings
as well as, cell phone model and device (we have more than one device of the same model).
- The target label is the column named <b>gt</b>.
- The target label contains <b>6</b> different human activities.

## The Objective is:
### To build a classifier to predict human activity (gt) from the input features.
    
#### Please follow the following steps:
1. Read the <b>MoreReduced_Phones_accel_cleaned.csv</b> file.
2. Show the <b>Schema</b> of the data.
3. Remove the first column and keep the others (x,y,z,User,Model,Device,gt)
4. Display the summary statistics of the data.
5. Check for null in each column.
6. If the data contains null drop the rows that contain null values.
7. Display the count of each class.
8. Train/Test split with 70% Train and 30% test.
Use <b>Stratified Sampling</b> to obtain a balanced ratios of classes.
<b>You can use the code below</b>.
9. Perform the required features engineering.
10. Use any classifier of your choice.
11. Create a <b>Pipeline</b> that contains all feature engineering steps and the classifier.
12. Train the <b>Pipeline Model</b> using trainig data.
13. Evaluate the <b>Pipeline Model</b> using test data <b>(use f1-score)</b>.
14. As long as HAR data is not easy to classify, <b>the required f1-score should not less
    than 0.4</b>.

In [139]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
!tar xf spark-3.1.1-bin-hadoop3.2.tgz
!pip install -q findspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"

In [3]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
spark.conf.set("spark.sql.repl.eagerEval.enabled", True) # Property used to format output tables better
spark
sc = spark.sparkContext
sc

In [4]:
from IPython.display import display, HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

##1-Read the MoreReduced_Phones_accel_cleaned.csv file.

In [140]:
df=spark.read.csv('/content/drive/MyDrive/MoreReduced_Phones_accel_cleaned.csv',header=True,inferSchema=True)
df.show(5)

+------+----------+-------------------+-----------------+----+----------+------------+---+
|   _c0|         x|                  y|                z|User|     Model|      Device| gt|
+------+----------+-------------------+-----------------+----+----------+------------+---+
| 75757|  0.766135|-0.9193620000000001|         9.959755|   a|samsungold|samsungold_1|  1|
|124359|-5.1510773|        -0.75042725|         6.159439|   b|    nexus4|    nexus4_2|  2|
| 18271|-6.0142345|        0.095768064|7.690175999999999|   a|        s3|        s3_1|  4|
| 40271|-6.2247925|         0.05607605|7.952057000000001|   a|    nexus4|    nexus4_1|  4|
|  7355| -3.370994|           1.072589|         7.048442|   b|samsungold|samsungold_2|  3|
+------+----------+-------------------+-----------------+----+----------+------------+---+
only showing top 5 rows



##2-Show the Schema of the data.

In [141]:
df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- x: double (nullable = true)
 |-- y: double (nullable = true)
 |-- z: double (nullable = true)
 |-- User: string (nullable = true)
 |-- Model: string (nullable = true)
 |-- Device: string (nullable = true)
 |-- gt: integer (nullable = true)



##3-Remove the first column and keep the others (x,y,z,User,Model,Device,gt)




In [142]:
df=df.select("x", "y", "z", "User", "Model", "Device", "gt")

In [143]:
df.show(5)

+----------+-------------------+-----------------+----+----------+------------+---+
|         x|                  y|                z|User|     Model|      Device| gt|
+----------+-------------------+-----------------+----+----------+------------+---+
|  0.766135|-0.9193620000000001|         9.959755|   a|samsungold|samsungold_1|  1|
|-5.1510773|        -0.75042725|         6.159439|   b|    nexus4|    nexus4_2|  2|
|-6.0142345|        0.095768064|7.690175999999999|   a|        s3|        s3_1|  4|
|-6.2247925|         0.05607605|7.952057000000001|   a|    nexus4|    nexus4_1|  4|
| -3.370994|           1.072589|         7.048442|   b|samsungold|samsungold_2|  3|
+----------+-------------------+-----------------+----+----------+------------+---+
only showing top 5 rows



##4-Display the summary statistics of the data.

In [144]:
df.describe().show()

+-------+-------------------+-------------------+-------------------+-----+----------+------------+------------------+
|summary|                  x|                  y|                  z| User|     Model|      Device|                gt|
+-------+-------------------+-------------------+-------------------+-----+----------+------------+------------------+
|  count|              40944|              40944|              40944|40944|     40944|       40944|             40944|
|   mean|-1.6171014976873441|0.14958031448959583|   8.90737169789521| null|      null|        null|2.5647469714732316|
| stddev| 3.9526104137374483| 1.5492538575847232| 2.2600429846372188| null|      null|        null|1.7549087875346667|
|    min|-27.202834999999997|         -11.015778|-1.9919509999999998|    a|    nexus4|    nexus4_1|                 0|
|    max| 15.475926999999999|           9.346847| 28.749159999999996|    i|samsungold|samsungold_2|                 5|
+-------+-------------------+-------------------

##5-Check for null in each column.

In [145]:
from pyspark.sql.functions import col, sum

df.select([sum(col(c).isNull().cast("int")).alias(c) for c in df.columns]).show()

+---+---+---+----+-----+------+---+
|  x|  y|  z|User|Model|Device| gt|
+---+---+---+----+-----+------+---+
|  0|  0|  0|   0|    0|     0|  0|
+---+---+---+----+-----+------+---+



##6-If the data contains null drop the rows that contain null values.

In [146]:
#No NULL Values
#df = df.dropna()

##7-Display the count of each class.

In [147]:
from pyspark.sql.functions import count
df.groupBy("gt").count().show()


+---+-----+
| gt|count|
+---+-----+
|  1| 7258|
|  3| 6500|
|  5| 7957|
|  4| 6725|
|  2| 5784|
|  0| 6720|
+---+-----+



##8-Train/Test split with 70% Train and 30% test. Use Stratified Sampling to obtain a balanced ratios of classes. You can use the code below.

In [148]:
gt_distinct = df.select('gt').distinct().collect()

train_Df = df.sampleBy('gt',fractions={gt_distinct[i]['gt']: 0.7
                                       for i in range(len(gt_distinct))},seed=1)

test_DF = df.subtract(train_Df)

#9-Perform the required features engineering.

In [151]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StringIndexer, OneHotEncoder

In [152]:
num_cols = ["x", "y", "z"]
cat_cols = ["User", "Model", "Device"]

In [153]:
indexers = [
    StringIndexer(inputCol=column, outputCol=f"{column}_index", handleInvalid="keep")
    for column in cat_cols
]

In [154]:
assembler = VectorAssembler(
    inputCols=["x", "y", "z","User_index", "Model_index", "Device_index"],
    outputCol="features"
)

In [155]:
label_indexer = StringIndexer(inputCol="gt", outputCol="label")

##10-Use any classifier of your choice.

##Trying Logistic Regression

In [157]:
from pyspark.ml.classification import LogisticRegression
classifier = LogisticRegression(labelCol="label", featuresCol="features")

##11-Create a Pipeline that contains all feature engineering steps and the classifier.
##12-Train the Pipeline Model using trainig data.
##13-Evaluate the Pipeline Model using test data (use f1-score).

In [158]:
# Create the pipeline
pipeline = Pipeline(stages=indexers + [assembler,label_indexer, classifier])

In [159]:
# Train the pipeline model
model = pipeline.fit(train_Df)

In [160]:
# Make predictions on the test data
predictions = model.transform(test_DF)

In [161]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# Evaluate the predictions using the F1 score
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="f1")
f1_score = evaluator.evaluate(predictions)

print("F1 Score:", f1_score)

F1 Score: 0.32399046303143997


##The F1 Score is less Than 0.40 so we will try another Model

##Trying Random Forest

In [162]:
from pyspark.ml.classification import RandomForestClassifier
classifier = RandomForestClassifier(labelCol="label", featuresCol="features", seed=42)

In [163]:
# Create the pipeline
pipeline = Pipeline(stages=indexers + [assembler,label_indexer, classifier])

In [164]:
# Train the pipeline model
model = pipeline.fit(train_Df)

In [165]:
# Make predictions on the test data
predictions = model.transform(test_DF)

In [166]:
# Evaluate the predictions using the F1 score
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="f1")
f1_score = evaluator.evaluate(predictions)

print("F1 Score:", f1_score)

F1 Score: 0.4889686067533028


##14-As long as HAR data is not easy to classify, the required f1-score should not less than 0.4.

## The F1_score is better than 0.40