# 02. Choosing the right estimator / algorithm for our problem

- Classification - prediciting whether sample is one thing or another

- Regression - predicting continue value


[Choosing a right estimator MAP](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)

#### Tips:

1. If you have **structured data, use Ensemble methods**.
2. If you have **Unstructured data, use deep learning or transfer learning**.

# 2.1) Regression Problem

In [2]:
import pandas as pd
import numpy as np

![boston](boston_data_set.png)

In [4]:
# import boston housing dataset
from sklearn.datasets import load_boston

boston = load_boston()

In [5]:
# boston

In [6]:
boston_df = pd.DataFrame(boston['data'], columns=boston['feature_names'])

In [7]:
boston_df['target'] = pd.DataFrame(boston['target'])

In [8]:
boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [9]:
boston_df.isnull().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
target     0
dtype: int64

## Ridge Model

In [14]:
from sklearn.linear_model import Ridge

# Features and Lables
X = boston_df.drop('target', axis=1) 
y = boston_df['target']

In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [16]:
# as there is no missing data and columns are already encoded in numerical value, we can directly train model
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train, y_train)

Ridge()

In [17]:
# check the score on test data
ridge_model.score(X_test, y_test)

0.6662221670168521

--------------

Let's refer back to Map and try another algorithm 
[Choosing a right estimator MAP](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)

## Random Forest

In [1]:
from sklearn.ensemble import RandomForestRegressor

In [18]:
# Features and Lables
X = boston_df.drop('target', axis=1) 
y = boston_df['target']

# split data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# train model
rf_model = RandomForestRegressor(n_estimators=100)
rf_model.fit(X_train, y_train)

RandomForestRegressor()

In [19]:
rf_model.score(X_test, y_test)

0.8858981731631403

It seems like random forest model is performing better than ridge model.

----------

# 2.1) Classification Problem

check path: [Choosing a right estimator MAP](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)

In [6]:
heart_disease = pd.read_csv('../00.datasets/heart-diseases.csv')
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [7]:
len(heart_disease)

303

## Linear SVC

In [8]:
from sklearn.svm import LinearSVC

In [9]:
X = heart_disease.drop('target', axis=1)
y = heart_disease['target']

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [19]:
linear_svc = LinearSVC(max_iter=10000000)
linear_svc.fit(X_train, y_train)

LinearSVC(max_iter=10000000)

In [20]:
# checking accuracy
linear_svc.score(X_test, y_test)

0.8688524590163934

## Random Forest Classifier

In [21]:
from sklearn.ensemble import RandomForestClassifier

In [27]:
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)

rfc.score(X_test, y_test)

0.8524590163934426