# CS 1656 – Introduction to Data Science 

## Instructor: Alexandros Labrinidis / Teaching Assistant: Evangelos Karageorgos
### Additional credits: Xiaoting Li, Tahereh Arabghalizi, Evangelos Karageorgos, Zuha Agha, Anatoli Shein, Phuong Pham
## Recitation 11: Regression and Decision Trees

In this recitation, we will learn how to do regression and classification with decision trees using scikit-learn python package.

In [1]:
import pandas as pd
import numpy as np
from sklearn import linear_model, tree, metrics
import matplotlib.pyplot as plt
%matplotlib inline

## Linear Regression
LinearRegression fits a linear model with coefficients w = (w_1, ..., w_p) to minimize the residual sum of squares between the observed responses in the dataset, and the responses predicted by the linear approximation.

LinearRegression will take in its fit method arrays X, y and will store the coefficients w of the linear model in its coef_ member.

We will now go through an example of linear regression on bike sharing dataset. 

In [2]:
df = pd.read_csv('http://data.cs1656.org/bike_share.csv')
df.head()

Unnamed: 0,instant,season,hr,holiday,weekday,workingday,weathersit,temp,temp_feels,hum,windspeed,cnt
0,1,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,16
1,2,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,40
2,3,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,32
3,4,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,13
4,5,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,1


The attributes of the dataset are as follows:

    - instant: record index
	- season : season (1:spring, 2:summer, 3:fall, 4:winter)
	- hr : hour (0 to 23)
	- holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
	- weekday : day of the week (0 to 6)
	- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
	+ weathersit : 
		- 1: Clear
		- 2: Misty, Cloudy
		- 3: Light Snow, Light Rain
		- 4: Heavy Rain, Ice Pallets
	- temp : Normalized temperature in Celsius. The values are divided by 41 (max)
	- temp_feels: Normalized feeling temperature in Celsius. The values are divided by 50 (max)
	- hum: Normalized humidity. The values are divided by 100 (max)
	- windspeed: Normalized wind speed. The values are divided by 67 (max)
	- cnt: count of total rental bikes including both casual and registered

Our target variable, `y`, is _cnt_. We will use a single attribute as input feature for this example and will select _temp_ as our input feature `X`. You will be using all attributes in one of your tasks. 

### Subsample
As our dataset consists of more than 17000 rows, we will randomly subsample our dataset to select 1000 rows.

In [3]:
df_subsample = df.sample(1000)
df_subsample.head()

Unnamed: 0,instant,season,hr,holiday,weekday,workingday,weathersit,temp,temp_feels,hum,windspeed,cnt
1329,1330,1,16,0,1,1,3,0.42,0.4242,1.0,0.1343,42
17098,17099,4,5,0,4,1,2,0.3,0.3182,0.7,0.1045,35
9010,9011,1,7,1,1,0,1,0.1,0.1364,0.54,0.1045,33
6708,6709,4,2,0,3,1,2,0.56,0.5303,0.83,0.2836,2
11057,11058,2,21,0,2,1,1,0.44,0.4394,0.3,0.4478,206


### Train & Test Split
We will split the subsample into 90% training set and 10% test set by slicing the first 900 rows for training and using the rest for testing. As fit takes numpy arrays as input we will use _values_ function to convert our Dataframe column into numpy array and use double brackets in order to make the arrays two-dimensional.

In [None]:
train = df_subsample.iloc[1:900]
train_x = train[['temp']].values
train_y = train[['cnt']].values

test=df_subsample.iloc[901:]
test_x = test[['temp']].values
test_y = test[['cnt']].values
print (type(train_x), type(train_y), type(test_x), type(test_y))

### Fit
To fit our linear regression model, apply the following function. Note that the fit function takes numpy array of the format [num_samples,num_features].

In [None]:
# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(train_x, train_y)

Now that we have fit our linear regression model onto the training data, our estimated model coefficients are stored in _coeff_ attribute of our model.

In [None]:
# The coefficients
print('Coefficients: \n', regr.coef_)

### Predict
We will now use our trained linear regression model to make predictions on our test set. Our model will take temperature attribute, _temp_, of our test data and will make predictions on the count of people who are bike sharing, given by _cnt_.

In [None]:
predict_y = regr.predict(test_x)
# Printing  predicted and actual values side by side fro comparison
np.column_stack((predict_y,test_y))

### Mean Squared Error
Looks like some of our model's predictions are not good. Now, let's measure the difference between our predicted and actual values by calculating the mean squared error. 

In [None]:
meansq_error = np.mean((predict_y - test_y) ** 2)
print ("Mean squared error: %.2f" % meansq_error)

As expected our mean squared error is high, which means our model is not good. Can we improve it? What if we use more training data? Or more features? 
### Plot
We can also visualize the difference between our predictions and actual values by plotting.

In [None]:
plt.scatter(test_x, test_y,  color='black', linewidth=1)
plt.plot(test_x, predict_y, color='blue', linewidth=3)
plt.show()

## Decision Trees
Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

We will go through an example of binary classfication using decision trees on titanic survival dataset.

In [None]:
dt = pd.read_csv('http://data.cs1656.org/titanic.csv')
dt.head()

The attributes of the dataset are as follows:
    - survival        Survival
                    (0 = No; 1 = Yes)
    - pclass          Passenger Class
                    (1 = 1st; 2 = 2nd; 3 = 3rd)
    - name            Name
    - sex             Sex
    - age             Age
    - sibsp           Number of Siblings/Spouses Aboard
    - parch           Number of Parents/Children Aboard
    - embarked        Port of Embarkation
                    (C = Cherbourg; Q = Queenstown; S = Southampton)
Our target class variable is _Survived_, whether  the passenger survived or not. We will use only a subset of attributes that take discreet values to build our decision tree.

To fit a decision tree model, we will have to convert the categoricalvalues into numerical values. As the only categorical attribute we will use is _Sex_, we will only need to convert that column into numerical values using the following commands. 

In [None]:
dt['Sex'] = dt['Sex'].replace(['female', 'male'], [1, 2])
dt.head()

### Train & Test Split
We will split our data into train and test set using the first 800 rows for training and the rest for testing.

In [None]:
dt_train_x = dt.iloc[:800][['Pclass','Sex','SibSp']].values
dt_train_y = dt.iloc[:800][['Survived']].values

dt_test_x = dt.iloc[801:][['Pclass','Sex','SibSp']].values
dt_test_y = dt.iloc[801:][['Survived']].values

### Fit
We will now fit our decision tree model onto the training set.

In [None]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(dt_train_x, dt_train_y)

### Predict

In [None]:
dt_predict_y = clf.predict(dt_test_x)
## comparing predicted and actual values
np.column_stack((dt_predict_y,dt_test_y))

### Accuracy
We can measure the accuracy of our prediction by using the following commands.

In [None]:
accuracy = metrics.accuracy_score(dt_test_y,dt_predict_y)
accuracy

## Tasks

** Task 1** 

Do linear regression over a sample of 1000 rows of bike share counts, _cnt_, using _weekday_, as input feature. Calculate the mean squared error by using first 900 rows for training and the rest for testing. Return the mean squared error.


In [None]:
import pandas as pd
import numpy as np
from sklearn import linear_model, tree, metrics
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_csv('http://data.cs1656.org/bike_share.csv')
df_subsample = df.sample(1000)
df_subsample.head()

train = df_subsample.iloc[0:900]
train_x = train[['weekday']].values
train_y = train[['cnt']].values
test = df_subsample.iloc[900:]
test_x = test[['weekday']].values
test_y = test[['cnt']].values
print(type(train_x), type(train_y), type(test_x), type(test_y))

# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(train_x, train_y)

# The coefficients
print('Coefficients: \n', regr.coef_)

predict_y = regr.predict(test_x)
# Printing predicted and actual values side by side fro comparison

np.column_stack((predict_y, test_y))
print(np.column_stack((predict_y, test_y)))

meansq_error = np.mean((predict_y - test_y)**2)
print("Mean squared error: %.2f" % meansq_error)


** Task 2.1**

Repeat Task 1 using all atttributes except instant (also, scatter plot is not required in this task). Is the mean squared error higher or lower? Is it better to use all attributes?

In [None]:
import pandas as pd
import numpy as np
from sklearn import linear_model, tree, metrics
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_csv('http://data.cs1656.org/bike_share.csv')
df_subsample = df.sample(1000)
df_subsample.head()

train = df_subsample.iloc[0:900]
train_x = train[[
   'season', 'hr', 'holiday', 'weekday', 'workingday', 'weathersit',
   'temp', 'temp_feels', 'hum', 'windspeed'
]].values
train_y = train[['cnt']].values
test = df_subsample.iloc[900:]
test_x = test[[
   'season', 'hr', 'holiday', 'weekday', 'workingday', 'weathersit',
   'temp', 'temp_feels', 'hum', 'windspeed'
]].values
test_y = test[['cnt']].values
print(type(train_x), type(train_y), type(test_x), type(test_y))

# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(train_x, train_y)

# The coefficients
print('Coefficients: \n', regr.coef_)

predict_y = regr.predict(test_x)
# Printing predicted and actual values side by side fro comparison

np.column_stack((predict_y, test_y))
print(np.column_stack((predict_y, test_y)))

meansq_error = np.mean((predict_y - test_y)**2)
print("Mean squared error: %.2f" % meansq_error)

** Task 2.2**

Comparing the results of task 1 and task 2.1, is it better to use all attributes? Why?

The mean squared error is lower. It's better to use all attributes as it shows a lower margin of error from the larger input size.

** Task 3**

You will use bank-data.csv as input for this task. Use decision trees to do binary classification of mortgage{yes,no} using region, sex and married attributes as input features. Use the first 500 rows for training and the rest for testing. Measure the accuracy of your classification. Return the accuracy.

In [None]:
import pandas as pd
import numpy as np
from sklearn import linear_model, tree, metrics
import matplotlib.pyplot as plt
%matplotlib inline

dt = pd.read_csv('http://data.cs1656.org/bank-data.csv')
print(dt.head())
print(list(dt.columns))
#titanic columns only
#dt['Sex'] = dt['Sex'].replace(['female', 'male'], [1, 2])
#bank data
dt['sex'] = dt['sex'].replace(['FEMALE', 'MALE'], [1, 2])
dt['married'] = dt['married'].replace(['YES', 'NO'], [1, 0])
dt['mortgage'] = dt['mortgage'].replace(['YES', 'NO'], [1, 0])
dt['region'] = dt['region'].replace(
   ['INNER_CITY', 'TOWN', 'RURAL', 'SUBURBAN'], [1, 2, 3, 4])

dt_train_x = dt.iloc[:500][['region', 'sex', 'married']].values
dt_train_y = dt.iloc[:500][['mortgage']].values
dt_test_x = dt.iloc[500:][['region', 'sex', 'married']].values
dt_test_y = dt.iloc[500:][['mortgage']].values

clf = tree.DecisionTreeClassifier()
clf = clf.fit(dt_train_x, dt_train_y)

dt_predict_y = clf.predict(dt_test_x)
## comparing predicted and actual values
np.column_stack((dt_predict_y, dt_test_y))
print(np.column_stack((dt_predict_y, dt_test_y)))
accuracy = metrics.accuracy_score(dt_test_y, dt_predict_y)
