# DAT210x - Programming with Python for DS

## Module6- Lab6

In [1]:
import pandas as pd
import time
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

### How to Get The Dataset

Grab the DLA HAR dataset from:

- http://groupware.les.inf.puc-rio.br/har
- http://groupware.les.inf.puc-rio.br/static/har/dataset-har-PUC-Rio-ugulino.zip
- A cached copy of the dataset is included in the course repository.

After extracting it out, load up the dataset into dataframe named `X` and do your regular dataframe examination:

In [2]:
X = pd.read_csv('Datasets/dataset-har-PUC-Rio-ugulino.csv', sep = ';', decimal = ',',low_memory=False)
X.head()

Unnamed: 0,user,gender,age,how_tall_in_meters,weight,body_mass_index,x1,y1,z1,x2,y2,z2,x3,y3,z3,x4,y4,z4,class
0,debora,Woman,46,1.62,75,28.6,-3,92,-63,-23,18,-19,5,104,-92,-150,-103,-147,sitting
1,debora,Woman,46,1.62,75,28.6,-3,94,-64,-21,18,-18,-14,104,-90,-149,-104,-145,sitting
2,debora,Woman,46,1.62,75,28.6,-1,97,-61,-12,20,-15,-13,104,-90,-151,-104,-144,sitting
3,debora,Woman,46,1.62,75,28.6,-2,96,-57,-15,21,-16,-13,104,-89,-153,-103,-142,sitting
4,debora,Woman,46,1.62,75,28.6,-1,96,-61,-13,20,-15,-13,104,-89,-153,-104,-143,sitting


Encode the gender column such that: `0` is male, and `1` as female:

In [3]:
X.gender = X.gender.map({'Man': 0, 'Woman': 1})
#X['gender'] = X['gender'].map({'Man': 0, 'Woman': 1})
X.head()

Unnamed: 0,user,gender,age,how_tall_in_meters,weight,body_mass_index,x1,y1,z1,x2,y2,z2,x3,y3,z3,x4,y4,z4,class
0,debora,1,46,1.62,75,28.6,-3,92,-63,-23,18,-19,5,104,-92,-150,-103,-147,sitting
1,debora,1,46,1.62,75,28.6,-3,94,-64,-21,18,-18,-14,104,-90,-149,-104,-145,sitting
2,debora,1,46,1.62,75,28.6,-1,97,-61,-12,20,-15,-13,104,-90,-151,-104,-144,sitting
3,debora,1,46,1.62,75,28.6,-2,96,-57,-15,21,-16,-13,104,-89,-153,-103,-142,sitting
4,debora,1,46,1.62,75,28.6,-1,96,-61,-13,20,-15,-13,104,-89,-153,-104,-143,sitting


Clean up any columns with commas in them so that they're properly represented as decimals:

In [4]:
# Done! Pd.read_csv(decimal = ',') takes care of it.

#How can I convert a string like "123,456.908" to float number 123456.908 in Python? Thanks a lot.
#float("123,456.908".replace(',',''))

Let's take a peek at your data types:

In [5]:
X.dtypes

user                   object
gender                  int64
age                     int64
how_tall_in_meters    float64
weight                  int64
body_mass_index       float64
x1                      int64
y1                      int64
z1                      int64
x2                      int64
y2                      int64
z2                      int64
x3                      int64
y3                      int64
z3                      int64
x4                      int64
y4                      int64
z4                     object
class                  object
dtype: object

Convert any column that needs to be converted into numeric use `errors='raise'`. This will alert you if something ends up being problematic.

In [6]:
X.z4 = pd.to_numeric(X.z4, errors = 'coerce')
X.isnull().sum() #one was coerced into a Nan, need to drop below
X.dropna(axis = 0, how = 'any', inplace = True)
X.isnull().sum()
#print(X[['class']])

user                  0
gender                0
age                   0
how_tall_in_meters    0
weight                0
body_mass_index       0
x1                    0
y1                    0
z1                    0
x2                    0
y2                    0
z2                    0
x3                    0
y3                    0
z3                    0
x4                    0
y4                    0
z4                    0
class                 0
dtype: int64

If you find any problematic records, drop them before calling the `to_numeric` methods above.

Okay, now encode your `y` value as a Pandas dummies version of your dataset's `class` column:

In [7]:
y = X[['class']]
y = pd.get_dummies(y)
y.head()

Unnamed: 0,class_sitting,class_sittingdown,class_standing,class_standingup,class_walking
0,1,0,0,0,0
1,1,0,0,0,0
2,1,0,0,0,0
3,1,0,0,0,0
4,1,0,0,0,0


In fact, get rid of the `user` and `class` columns:

In [8]:
X.drop(labels = ['user', 'class'], axis = 1, inplace = True)

Let's take a look at your handy-work:

In [9]:
X.describe()

Unnamed: 0,gender,age,how_tall_in_meters,weight,body_mass_index,x1,y1,z1,x2,y2,z2,x3,y3,z3,x4,y4,z4
count,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0
mean,0.612044,38.264925,1.639712,70.819431,26.188535,-6.649319,88.293591,-93.164449,-87.827956,-52.065911,-175.055647,17.423517,104.517056,-93.881641,-167.641211,-92.625235,-159.650985
std,0.487286,13.183821,0.05282,11.296557,2.995781,11.616273,23.895881,39.409487,169.435606,205.160081,192.817111,52.635546,54.155987,45.38977,38.311336,19.968653,13.22102
min,0.0,28.0,1.58,55.0,22.0,-306.0,-271.0,-603.0,-494.0,-517.0,-617.0,-499.0,-506.0,-613.0,-702.0,-526.0,-537.0
25%,0.0,28.0,1.58,55.0,22.0,-12.0,78.0,-120.0,-35.0,-29.0,-141.0,9.0,95.0,-103.0,-190.0,-103.0,-167.0
50%,1.0,31.0,1.62,75.0,28.4,-6.0,94.0,-98.0,-9.0,27.0,-118.0,22.0,107.0,-90.0,-168.0,-91.0,-160.0
75%,1.0,46.0,1.71,83.0,28.6,0.0,101.0,-64.0,4.0,86.0,-29.0,34.0,120.0,-80.0,-153.0,-80.0,-153.0
max,1.0,75.0,1.71,83.0,28.6,509.0,533.0,411.0,473.0,295.0,122.0,507.0,517.0,410.0,-13.0,86.0,-43.0


You can also easily display which rows have nans in them, if any:

In [10]:
X[pd.isnull(X).any(axis=1)]

Unnamed: 0,gender,age,how_tall_in_meters,weight,body_mass_index,x1,y1,z1,x2,y2,z2,x3,y3,z3,x4,y4,z4


Create an RForest classifier named `model` and set `n_estimators=30`, the `max_depth` to 10, `oob_score=True`, and `random_state=0`:

In [11]:
#from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators = 30, max_depth = 10, oob_score = True, random_state = 0)

Split your data into `test` / `train` sets. Your `test` size can be 30%, with `random_state` 7. Use variable names: `X_train`, `X_test`, `y_train`, and `y_test`:

In [12]:
#from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 7)


### Now the Fun Stuff

In [13]:
print("Fitting...")
s = time.time()

# TODO: train your model on your training set

model = forest.fit(X_train, y_train)

print("Fitting completed in: ", time.time() - s)

Fitting...
Fitting completed in:  7.153017520904541


Display the OOB Score of your data:

In [14]:
score = model.oob_score_
print("OOB Score: ", round(score*100, 3))

OOB Score:  98.744


In [19]:
print("Scoring...")
s = time.time()

# TODO: score your model on your test set

score = model.score(X_test, y_test)

print("Score: ", round(score*100, 3))
print("Scoring completed in: ", time.time() - s,"seconds")

Scoring...
Score:  95.687
Scoring completed in:  0.6076157093048096 seconds


At this point, go ahead and answer the lab questions, then return here to experiment more --

Try playing around with the gender column. For example, encode `gender` `Male:1`, and `Female:0`. Also try encoding it as a Pandas dummies variable and seeing what changes that has. You can also try dropping gender entirely from the dataframe. How does that change the score of the model? This will be a key insight on how your feature encoding alters your overall scoring, and why it's important to choose good ones.

In [16]:
# .. your code changes above ..

After that, try messing with `y`. Right now its encoded with dummies, but try other encoding methods to what effects they have.