# Machine Learning Seminar

**Supervised Learning**

The dataset contains a set of known labels or results to be predicted, i.e., the objective to be infer is known. For instance, spam vs non-spam data or whether a transaction is fraudulent or not. 

The training procedure consists on showing samples to the model and emiting a prediction. If the answer is wrong, the models is corrected to by updating its parameters. The training process continues until some stopping criteria is met: certain performance level or limitation of resources among others. 

Models in this category can be divided in two:


*   Classification: The labels of the dataset are a categorical set. 
*   Regression: The target is a continous value. 


**Unsupervised Learning**

The dataset does not contanin labels to predict. The objective consists on identifying underlying relations of the data. Samples can be grouped based on similarity. 

The main examples belongs to clustering and dimensionality reduciton. 


**Reinforcement Learning**

## Import packages

In [0]:
%matplotlib inline

import io
import time
import itertools
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.svm import SVC
from sklearn.cluster import KMeans
from sklearn.neighbors import NearestNeighbors
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV 
from sklearn.linear_model import LogisticRegression
from sklearn import datasets, metrics, preprocessing

from keras.models import Sequential
from keras.optimizers import rmsprop
from keras.layers import Dense, Activation

## Load data

In [0]:
tr_split = 0.7

# Load data
titanic = pd.read_excel('titanic.xls')

# Calculate train/test size
rows, cols = titanic.shape
tr_size = int(rows*tr_split)
te_size = rows - tr_size

# Split dataset
train_df = titanic.iloc[:tr_size]
test_df = titanic.iloc[tr_size:]

## Data analysis and visualization



*   What is our target?
*   What features are categorical?
*   What features are numerical?





In [0]:
train_df.head(n=3)

Some feautes might be useful but if the contain null values or other errors, it might be a good idea to curate them or remove in some cases.

In [0]:
train_df.info()

In [0]:
# train_df.Age.value_counts()

We want to predict who will survive, let's start by analyzing the target features (Survived)

In [0]:
display(train_df.survived.unique())
print('*'*30)
display(train_df.survived.value_counts())

Target class is slightly unbalanced but it should not be an issue. If classes are 95%/5%, we would need to preprocess the number of samples in each class. There are several alternatives to solve the issue:

* Downsample the overrepresented class
* Oversample the underrepresented class
* Modify the class weights of the classifier

Pandas allows to select, group or filter with a similar syntax as SQL. Let's analyze the survival probabilities by class and/or sex. 

Note: *Group by* only works with discrete features.

**What is the average probability of survival based on class?**

In [0]:
train_df[['pclass', 'survived']].groupby(['pclass']).mean()

**What is the average probability of survival based on class AND sex?**

In [0]:
train_df[['pclass', 'survived', 'sex']].groupby(['pclass', 'sex']).mean()

In [0]:
g = sns.FacetGrid(train_df, col='survived')
_ = g.map(plt.hist, 'age', bins=20)

In [0]:
grid = sns.FacetGrid(train_df, col='survived')
_ = grid.map(sns.barplot, 'sex', 'fare', alpha=.5, ci=None)

In [0]:
for i in train_df.survived.unique():
    sns.distplot(train_df['fare'][train_df.survived==i], kde=1,label='{}'.format(i))

_ = plt.legend()

Before feeding the training data to the classifier is a good practice to study the correlation between features. If some features have very high correlation, we might be able to remove them. Futhermore, is some features are highly correlated with our target class, presumably those would have a higher impact on the final decission. 

In [0]:
corr = train_df.corr()
sns.heatmap(corr, cmap="YlGnBu", annot=False)

# For values, put annot=True

We are about to split our features and target in different arrays. To decide which features will be used for training, we need to look at their whether we have the complete information. If null values are present, we need to decide if we fill those values or remove the entire sample

In [0]:
train_df.isna().sum()

## Dataset preprocessing

Many of the algorithms do not accept text features, let's convert the categories to discrete numerical values.

In [0]:
train_df.loc[:, 'sex'] = train_df.sex.map({'male':1, 'female':0}, )
train_df.loc[:, 'embarked'] = train_df.embarked.map( {'S': 0, 'C': 1, 'Q': 2})

test_df.loc[:, 'sex'] = test_df.sex.map({'male':1, 'female':0}, )
test_df.loc[:, 'embarked'] = test_df.embarked.map( {'S': 0, 'C': 1, 'Q': 2})

In [0]:
train_df = train_df.dropna(subset=['embarked', 'fare'])
test_df = test_df.dropna(subset=['embarked', 'fare'])

To facilitate our task, let's only keep those features without null values. 

In [0]:
x_train = train_df[['pclass', 'sex', 'sibsp','parch', 'fare', 'embarked']]
y_train = train_df['survived']

x_test = test_df[['pclass', 'sex', 'sibsp','parch', 'fare', 'embarked']]
y_test = test_df['survived']

In [0]:
#scaler = preprocessing.StandardScaler()
#scaler.fit(x_train.fare)

#x_train.loc[:, 'embarked'] = scaler.transform(x_train[['fare']])
#x_test.loc[:, 'embarked'] = scaler.transform(x_test.fare)

In [0]:
# scaler.fit(x_train)

## Supervised Learning

In [0]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')
        
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.grid(None)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()


### Logistic Regression 

In [0]:
lr = LogisticRegression()
lr.fit(x_train, y_train)

y_pred = lr.predict(x_test)

In [0]:
cm = metrics.confusion_matrix(y_test, y_pred)
accuracy = metrics.accuracy_score(y_test, y_pred)

print("Accuracy %.2f" % (accuracy*100))
plot_confusion_matrix(cm, classes= ['class 0', 'class 1'], normalize=True)

### SVM

It is a discriminative classifier which finds a hyperplane that separates the samples in two groups. The algorith outputs the hyperplane with the maximum margin solution.


$\hat{y}(x) = sgn(\sum_{i=1}^d x_iw_i + b)$

<br>

A SVM is only able to find separate in two groups if they are linearly separable. In the following picut

However, if the data is more complex a non linear classifier can be built using the **kernel trick**. In this case, the dot product is replaced by a non linear **kernel function**. The maximum margin hyperplane is now fit in this new feature space, usually a high dimensional space. 

$\hat{y}(x) = sgn(\sum_{i=1}^d x_iw_i + b)$

In [0]:
svc = SVC()
svc.fit(x_train, y_train)

y_pred = svc.predict(x_test)

In [0]:
cm = metrics.confusion_matrix(y_test, y_pred)
accuracy = metrics.accuracy_score(y_test, y_pred)

print("Accuracy %.2f" % (accuracy*100))
plot_confusion_matrix(cm, classes= ['class 0', 'class 1'], normalize=True)

### Decission trees

Builds a tree during training time that is used during testing time to go from observations of the features (branches) to decissions (leaves). If the labels at the leaves are continous, the classifier is called regression trees. 





*   Boosted trees: 
*   Bootstrap aggregated: 



In [0]:
decision_tree = DecisionTreeClassifier()
decision_tree.fit(x_train, y_train)

y_pred = decision_tree.predict(x_test)

In [0]:
cm = metrics.confusion_matrix(y_test, y_pred)
accuracy = metrics.accuracy_score(y_test, y_pred)

print("Accuracy %.2f" % (accuracy*100))
plot_confusion_matrix(cm, classes= ['class 0', 'class 1'], normalize=True)

### Parameter search

In [0]:
dict_classifiers = {
    "Logistic Regression": 
            {'classifier': LogisticRegression(),
                'params' : [{
                             'penalty': ['l1','l2'],
                             'C': [0.001,0.01,0.1,1,10,100,1000]
                            }]
            },
    "Linear SVM": 
            {'classifier': SVC(),
                 'params': [{
                             'C': [1, 10],
                             'gamma': [0.001, 0.0001],
                             'kernel': ['linear']
                            }]
            },
    "Decision Tree":
            {'classifier': DecisionTreeClassifier(),
                 'params': [{
                             'max_depth':[3, None]
                            }]
            }
}

In [0]:
for key, classifier in dict_classifiers.items():
        
  grid = GridSearchCV(classifier['classifier'], 
                      classifier['params'],
                      refit=True,
                      cv= 2, # 9+1
                      scoring = 'accuracy', # scoring metric
                      n_jobs = -1)
  estimator = grid.fit(x_train, y_train)

  train_score = estimator.score(x_train,y_train)
  test_score = estimator.score(x_test, y_test)
    
  print(key)
  print(train_score)
  print(test_score)        

### Feature Importance

In [0]:
classif= RandomForestClassifier()
classif.fit(X_train,label_train)
df = pd.DataFrame()
df['ft'] = data.columns[:-2]
df['importance'] = classif.feature_importances_
df = df.sort_values(by= 'importance', ascending = False)
display(df)

In [0]:
decision_tree = DecisionTreeClassifier()
decision_tree.fit(x_train, y_train)

In [0]:
decision_tree = DecisionTreeClassifier()
decision_tree.fit(x_train, y_train)

df = pd.DataFrame()
df['ft'] = data.columns[:-2]
df['importance'] = classif.feature_importances_
df = df.sort_values(by= 'importance', ascending = False)
display(df)

In [0]:
ft_importance = pd.DataFrame(data={'Feature':x_train.columns, 'Importance':decision_tree.feature_importances_})
ft_importance = ft_importance.sort_values(by='Importance', ascending = False)

In [0]:
sns.barplot(x="Feature", y="Importance", data=ft_importance)

## Unsupervised Learning

In [0]:
iris = datasets.load_iris()
X = pd.DataFrame(data=iris.data, columns=iris.feature_names)
y = pd.DataFrame(data=iris.target, columns=['species'])

### Visualization

In [0]:
ft_1 = 'sepal length (cm)'
ft_2 = 'sepal width (cm)'

plt.scatter(X[ft_1], X[ft_2], c=y.species.values, cmap="YlGnBu")
plt.xlabel(ft_1)
plt.ylabel(ft_2)
plt.title(ft_1 + ' vs ' + ft_2)

### K-Mean

Separates samples in n groups of equal variance based on the distance of the sample to the mean of the cluster. 


$\sum_{i=0}^n min(||x_i-\mu_j||^2)$


In [0]:
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
labels = kmeans.labels_

ft_1 = 'sepal length (cm)'
ft_2 = 'sepal width (cm)'

plt.scatter(X[ft_1], X[ft_2], c=labels.astype(np.float), cmap="YlGnBu")
plt.xlabel(ft_1)
plt.ylabel(ft_2)
plt.title(ft_1 + ' vs ' + ft_2)

### Nearest Neighbors

In [0]:
nn = NearestNeighbors(n_neighbors=2, algorithm='ball_tree')
nn.fit(X)
distances, indices = nn.kneighbors(X)

In [0]:
indices

## Neural Networks

In [0]:
tr_samples, tr_features = x_train.shape
te_samples, _ = x_test.shape

In [0]:


model = Sequential()
model.add(Dense(128, input_shape=(n_features,), activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

opt = rmsprop(lr=0.0001, decay=1e-6)

model.compile(loss='binary_crossentropy',
              optimizer=opt,
              metrics=['accuracy'])


In [0]:
model.fit(x_train, y_train, epochs=10, batch_size=64)

y_prob = model.predict(x_test)

In [0]:
y_pred = np.where(y_prob > 0.5, 1, 0)

display(metrics.confusion_matrix(y_test, y_pred))
display(metrics.accuracy_score(y_test, y_pred))

## Kaggle Competition

To practice the tecniques we have learned today, we will do a competition on https://www.kaggle.com/. 

Please register and join the following challenge **(TODO)**. The aim is to obtain the highest accuracy using any of the models we have learned or new ones that you would like to try. 


We have added a sample submission and the corresponding code as a reference. 

## Extra

# References

*  https://www.kaggle.com/amitkumarjaiswal/beginner-s-tutorial-to-titanic-using-scikit-learn