# DAT210x - Programming with Python for DS

## Module6- Lab6

In [0]:
import pandas as pd
import time
from sklearn.ensemble import RandomForestClassifier as RandomForestClassifier
from sklearn.model_selection import train_test_split as train_test_split


### How to Get The Dataset

Grab the DLA HAR dataset from:

- http://groupware.les.inf.puc-rio.br/har
- http://groupware.les.inf.puc-rio.br/static/har/dataset-har-PUC-Rio-ugulino.zip
- A cached copy of the dataset is included in the course repository.

After extracting it out, load up the dataset into dataframe named `X` and do your regular dataframe examination:

In [2]:
!unzip dataset-har-PUC-Rio-ugulino.zip


Archive:  dataset-har-PUC-Rio-ugulino.zip
  inflating: dataset-har-PUC-Rio-ugulino.csv  


Encode the gender column such that: `0` is male, and `1` as female:

In [29]:
X = pd.read_csv("dataset-har-PUC-Rio-ugulino.csv", sep=";", decimal=",")
print(X.dtypes)
print(X.describe())
X["gender"] = X["gender"].map({"Man":0,"Woman":1})

user                   object
gender                  int64
age                     int64
how_tall_in_meters    float64
weight                  int64
body_mass_index       float64
x1                      int64
y1                      int64
z1                      int64
x2                      int64
y2                      int64
z2                      int64
x3                      int64
y3                      int64
z3                      int64
x4                      int64
y4                      int64
z4                     object
class                  object
dtype: object


Clean up any columns with commas in them so that they're properly represented as decimals:

In [28]:
X.loc[X['z4'] == '-14420-11-2011 04:50:23.713']

Unnamed: 0,user,gender,age,how_tall_in_meters,weight,body_mass_index,x1,y1,z1,x2,y2,z2,x3,y3,z3,x4,y4,z4,class


In [0]:
X.drop(X.index[122076], axis=0, inplace=True)

Let's take a peek at your data types:

In [44]:
X.dtypes

user                    object
gender                   int64
age                      int64
how_tall_in_meters     float64
weight                   int64
body_mass_index        float64
x1                       int64
y1                       int64
z1                       int64
x2                       int64
y2                       int64
z2                       int64
x3                       int64
y3                       int64
z3                       int64
x4                       int64
y4                       int64
z4                       int64
class                 category
dtype: object

Convert any column that needs to be converted into numeric use `errors='raise'`. This will alert you if something ends up being problematic.

In [0]:
X.z4 = pd.to_numeric(X.z4, errors='raise')
X["class"] = X["class"].astype('category', errors="raise")

If you find any problematic records, drop them before calling the `to_numeric` methods above.

Okay, now encode your `y` value as a Pandas dummies version of your dataset's `class` column:

In [0]:
y = pd.get_dummies(X["class"])

In fact, get rid of the `user` and `class` columns:

In [0]:
X.drop(["class", "user"], axis=1, inplace=True)

Let's take a look at your handy-work:

In [48]:
X.describe()

Unnamed: 0,gender,age,how_tall_in_meters,weight,body_mass_index,x1,y1,z1,x2,y2,z2,x3,y3,z3,x4,y4,z4
count,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0
mean,0.612044,38.264925,1.639712,70.819431,26.188535,-6.649319,88.293591,-93.164449,-87.827956,-52.065911,-175.055647,17.423517,104.517056,-93.881641,-167.641211,-92.625235,-159.650985
std,0.487286,13.183821,0.05282,11.296557,2.995781,11.616273,23.895881,39.409487,169.435606,205.160081,192.817111,52.635546,54.155987,45.38977,38.311336,19.968653,13.22102
min,0.0,28.0,1.58,55.0,22.0,-306.0,-271.0,-603.0,-494.0,-517.0,-617.0,-499.0,-506.0,-613.0,-702.0,-526.0,-537.0
25%,0.0,28.0,1.58,55.0,22.0,-12.0,78.0,-120.0,-35.0,-29.0,-141.0,9.0,95.0,-103.0,-190.0,-103.0,-167.0
50%,1.0,31.0,1.62,75.0,28.4,-6.0,94.0,-98.0,-9.0,27.0,-118.0,22.0,107.0,-90.0,-168.0,-91.0,-160.0
75%,1.0,46.0,1.71,83.0,28.6,0.0,101.0,-64.0,4.0,86.0,-29.0,34.0,120.0,-80.0,-153.0,-80.0,-153.0
max,1.0,75.0,1.71,83.0,28.6,509.0,533.0,411.0,473.0,295.0,122.0,507.0,517.0,410.0,-13.0,86.0,-43.0


You can also easily display which rows have nans in them, if any:

In [49]:
X[pd.isnull(X).any(axis=1)]

Unnamed: 0,gender,age,how_tall_in_meters,weight,body_mass_index,x1,y1,z1,x2,y2,z2,x3,y3,z3,x4,y4,z4


Create an RForest classifier named `model` and set `n_estimators=30`, the `max_depth` to 10, `oob_score=True`, and `random_state=0`:

In [0]:
model = RandomForestClassifier(oob_score=True)

Split your data into `test` / `train` sets. Your `test` size can be 30%, with `random_state` 7. Use variable names: `X_train`, `X_test`, `y_train`, and `y_test`:

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=7, test_size=.3)

### Now the Fun Stuff

In [57]:
print("Fitting...")
s = time.time()

# TODO: train your model on your training set

model.fit(X_train, y_train)

print("Fitting completed in: ", time.time() - s)

Fitting...
Fitting completed in:  5.333916664123535


  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])


Display the OOB Score of your data:

In [58]:
score = model.oob_score_
print("OOB Score: ", round(score*100, 3))

OOB Score:  99.308


In [61]:
print("Scoring...")
s = time.time()

# TODO: score your model on your test set

score = model.oob_score_
print("Score: ", round(score*100, 3))
print("Scoring completed in: ", time.time() - s)

Scoring...
Score:  99.308
Scoring completed in:  0.0008289813995361328


At this point, go ahead and answer the lab questions, then return here to experiment more --

Try playing around with the gender column. For example, encode `gender` `Male:1`, and `Female:0`. Also try encoding it as a Pandas dummies variable and seeing what changes that has. You can also try dropping gender entirely from the dataframe. How does that change the score of the model? This will be a key insight on how your feature encoding alters your overall scoring, and why it's important to choose good ones.

In [0]:
# .. your code changes above ..

After that, try messing with `y`. Right now its encoded with dummies, but try other encoding methods to what effects they have.