# Scikit-Learn Classification

- Pandas Documentation: http://pandas.pydata.org/
- Scikit Learn Documentation: http://scikit-learn.org/stable/documentation.html
- Seaborn Documentation: http://seaborn.pydata.org/


In [2]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

## 1. Read data from Files

In [3]:
df = pd.read_csv('../data/geoloc_elev.csv')

## 2. Quick Look at the data

In [4]:
type(df)

pandas.core.frame.DataFrame

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 5 columns):
lat       1500 non-null float64
lon       1500 non-null float64
elev      1500 non-null float64
source    1500 non-null object
target    1500 non-null int64
dtypes: float64(3), int64(1), object(1)
memory usage: 58.7+ KB


In [6]:
df.head()

Unnamed: 0,lat,lon,elev,source,target
0,0.106264,0.068264,0.542477,S,1
1,0.099569,0.132094,0.722289,C,1
2,-0.775751,-0.814161,0.21476,S,0
3,-0.159833,0.040773,0.478576,S,1
4,-0.096395,0.02142,0.270322,C,1


In [7]:
df.tail()

Unnamed: 0,lat,lon,elev,source,target
1495,1.371969,-0.051412,0.340901,C,0
1496,1.163256,-0.024625,0.001898,S,0
1497,1.347938,0.020778,0.608316,Q,0
1498,1.26606,-0.016751,1.674323,Q,0
1499,1.105951,-0.076857,0.018016,S,0


In [8]:
df.describe()

Unnamed: 0,lat,lon,elev,target
count,1500.0,1500.0,1500.0,1500.0
mean,-0.002624,-0.002507,0.789127,0.333333
std,0.690768,0.687576,0.610569,0.471562
min,-1.680394,-1.672896,0.000198,0.0
25%,-0.441686,-0.43004,0.29582,0.0
50%,0.004606,-0.003988,0.65002,0.0
75%,0.416209,0.4307,1.130405,1.0
max,1.668836,1.628833,3.25398,1.0


In [9]:
df['source'].value_counts()

C    555
Q    476
S    469
Name: source, dtype: int64

In [10]:
df['target'].value_counts()

0    1000
1     500
Name: target, dtype: int64

## 3. Visual exploration

In [11]:
import seaborn as sns

In [None]:
sns.pairplot(df, hue='target')

  binned = fast_linbin(X, a, b, gridsize) / (delta * nobs)
  FAC1 = 2*(np.pi*bw/RANGE)**2


<seaborn.axisgrid.PairGrid at 0x11e93d710>

## 4. Define target

In [None]:
y = df['target']
y.head()

## 5. Feature engineering

In [None]:
raw_features = df.drop('target', axis='columns')
raw_features.head()

### 1-hot encoding

In [None]:
X = pd.get_dummies(raw_features)
X.head()

## 6. Train/Test split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
    test_size = 0.3, random_state=0)

## 7. Fit a Decision Tree model

In [None]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(max_depth=3, random_state=0)
model.fit(X_train, y_train)

## 8. Accuracy score on benchmark, train and test sets

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

y_pred = model.predict(X_test)

In [None]:
cm = confusion_matrix(y_test, y_pred)

pd.DataFrame(cm,
             index=["Miss", "Hit"],
             columns=['pred_Miss', 'pred_Hit'])

In [None]:
print(classification_report(y_test, y_pred))

## 10. Feature Importances

In [None]:
importances = pd.Series(model.feature_importances_, index=X.columns)
importances.plot(kind='barh')

## 11. Display the decision boundary

In [None]:
hticks = np.linspace(-2, 2, 101)
vticks = np.linspace(-2, 2, 101)
aa, bb = np.meshgrid(hticks, vticks)
not_important = np.zeros((len(aa.ravel()), 4))
ab = np.c_[aa.ravel(), bb.ravel(), not_important]

c = model.predict(ab)
cc = c.reshape(aa.shape)

ax = df.plot(kind='scatter', c='target', x='lat', y='lon', cmap='bwr')
ax.contourf(aa, bb, cc, cmap='bwr', alpha=0.2)

## Exercise 


Iterate and improve on the decision tree model. Now you have a basic pipeline example. How can you improve the score? Try some of the following:

1. change some of the initialization parameters of the decision tree re run the code.
    - Does the score change?
    - Does the decision boundary change?
2. try some other model like Logistic Regression, Random Forest, SVM, Naive Bayes or any other model you like from [here](http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html)
3. what's the highest score you can get?