# Hackshop---Machine Learning for Application Classification

February 2020

In this hackshop, we apply our machine learning skills to solve the application classification problem. Important difference from the workshop: we will use a bigger dataset which has more features to deal with a multi-classification problem.

You are supposed to prepare a machine learning model using the procedure we have explained in the "Big Data" and "Machine Learning" workshops (e.g. data clearning, data wrangling, feature seletion, model selection and hyper-paremeter tuning). Eventually, we will evaluate the result by accuracy and confusion matrix.

**Your goal: Produce an effective, efficient machine learning model that can classify running Android app _with the highest accuracy_**.

Have fun!

In [0]:
import pandas as pd
import numpy as np
import os
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing
%matplotlib inline

In [0]:
pip install wget

In [0]:
import wget
url = 'https://www.cs.odu.edu/~yhe/data/sherlock_apps_yhe_test.csv'  
wget.download(url,out='hackshop.csv')

In [0]:
df=pd.read_csv("hackshop.csv")

**Task 1: Check the dataset and find out the answer:**

1. How many columns and rows does it have?
2. What's the names of the columns?
3. What's the datatypes of those columns?
 
Hint: `df.shape`, `df.info()`

**Task 2: Do this dataset have any missing value and in what column it is?**

Hint: isna(),sum()

**Task 3: What should we do to deal with these misssing data?**

Hint: fillna(),del,loc(),dropna()
* Find records of the missing values
* check the records index

**Task 4: check the datatype of the features(numeric or categoirc)**

Hint: describe(),`.T` transposes the output; Rows become columns vice versa

**Task 5: Check the distribution of the features by plot, is there any suspicious feature?**

Hint: sns.countplot(),sns.distplot(),plt.hist(),sns.barplot(),sns.boxplot(),sns.scatterplot(),sns.jointplot()

You can refer to the examples below:
1. sns.countplot(x='ApplicationName', data=df)
plt.xticks(rotation=90)
2. sns.barplot(x='ApplicationName', y='CPU_USAGE', data=df)
plt.xticks(rotation=90)

**Task 6: check the correlation between different features and plot the heatmap**

Hint: corr(),sns.heatmap()

**Task 7: Up to now, you have a overall static background knowledge about this dataset.**

 Consider about which features you want to keep and which you'd like to remove. After that, separate your label from features.
Hint:copy(),drop(axis=1)


**Task 8: Is there any categorical feature? If yes, what is the best way to deal
with it?**

Hint: one-hot embedding, pd.get_dummies()

**Task 9: Use feature scaling bring all the features into a similar order of magnitude**

Hint: StandardScaler()

**Task 10: Before using a model to train, we need split data to train and dev at first**

In [0]:
from sklearn.model_selection import train_test_split

**Task 11: Let's use decisiontree algorithmn in this experiment. We have already given you the template of the model and evaluation function. Could you get a best score by tuning the hyper-parameters?**

In [0]:
from sklearn.tree import DecisionTreeClassifier
model_dtc = DecisionTreeClassifier(criterion='entropy',
                                   max_depth=6,
                                   min_samples_split=8)
model_dtc.fit(train_FM_label, train_L_label)

In [0]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix
def model_evaluate(model,train_FM,dev_FM,train_L,dev_L):
    train_L_pred = model.predict(train_FM)
    dev_L_pred = model.predict(dev_FM)
    print("Evaluation of training set by using model:",type(model).__name__)
    print("accuracy_score:",accuracy_score(train_L, train_L_pred))
    print("No of correct:",accuracy_score(train_L, train_L_pred, normalize=False))
    #print("precision_score:",precision_score(train_L, train_L_pred))
    #print("recall_score:",recall_score(train_L, train_L_pred))
    print("confusion_matrix:","\n",confusion_matrix(train_L, train_L_pred))
    print("Evaluation of development set")
    print("accuracy_score:",accuracy_score(dev_L, dev_L_pred))
    print("confusion_matrix:","\n",confusion_matrix(dev_L, dev_L_pred))
    return 

In [0]:
model_evaluate(model_dtc,train_FM_label,dev_FM_label,train_L_label,dev_L_label)

**Task 12: Let's use logistic regression method to do it again, could you find a better solutio?**

In [0]:
from sklearn.linear_model import LogisticRegression
model_lr = LogisticRegression()
model_lr.fit(train_FM_label,train_L_label)

In [0]:
model_evaluate(model_lr,train_FM_label,dev_FM_label,train_L_label,dev_L_label)

If you got to this point and have a result produced---Congratulations, you got a working machine-learning model.
But wait, don't go yet!
You have only did the first step of the modeling, but have not even optimized improved it yet.

## Challenge Questions:

1. Can we improve the accuracy of the prediction? Hint: tweak the parameters used when creating the `DecisionTreeClassifier` object.

2. Can we come up with smaller set of features that can get the same accuracy?