# Data analysis and machine learning in Python


This notebook contains sample code for working with:
* Pandas dataframes
* EDA with Seaborn
* Building a simple classification model (using a decision tree & random forest in scikit learn)


## Import Libraries

To add functionality to your Python session, a series of libraries (most importantly scikit-image and scikit-learn are imported)

In [1]:
# Basic libraries
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter("ignore", UserWarning)
warnings.simplefilter("ignore", RuntimeWarning)
warnings.simplefilter("ignore", DeprecationWarning)

# Sklearn
## Data
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

## Models
from sklearn import tree
from sklearn import ensemble
from sklearn.model_selection import GridSearchCV

## Metrics
from sklearn.metrics import accuracy_score, confusion_matrix


# Plotting
#import graphviz
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display


## PART 1: Pandas

We start by loading the Iris dataset (make sure the file is in your working directory).

In [9]:
df = pd.read_csv("iris.csv", delimiter= ";")

Insect the loaded dataset (stored as a `DataFrame` object).

In [None]:
df

Extract one column (which is returned as a series) and index that column.

In [None]:
# select one column
X = df["SepalLength"]

# slice the Sepal Lengths of the first 10 flowers
X[0:10]

Apply Numpy-style (implicit) indexing.

In [None]:
df.iloc[1:4, 0:2]

Apply explicit indexing

In [None]:
df.loc[145:148, ['PetalLength', 'PetalWidth']]

Select `SepalLength` and `PetalLength` for all setosa species 

In [None]:
# make the selection and store the result in s
s = df.loc[df['Species'] == "setosa", ['SepalLength', 'PetalLength']]
# show the first few rows of s (to avoid that the screen is filled with data)
s.head()

Add a new column (containing an approximation of the area of the Sepals)

In [None]:
# update the data frame
df['area'] = np.pi * df['SepalWidth'] / 2 * df['SepalLength'] / 2
# show the first few rows
df.head()

Cast the first 4 columns to a numerical numpy array and the last column to a string-array.

In [24]:
X = df.iloc[:, 0:4].to_numpy()
Y = df.iloc[:, 4].to_numpy()

Compute the mean for all numerical variables

In [None]:
df.mean(numeric_only = True)


Compute the mean for the SepalWidth variable per species.

In [None]:
df.groupby('Species')['SepalWidth'].mean()

### **EXERCISES** 

Take the following actions / answer the following questions using the IRIS dataset:
1. Extract the SepalWidth of the first 5 flowers in the dataset
2. How many setosa flowers have a SepalLength greater than 5.0 ?
3. Add a new column to the iris dataset called SepalRatio, it should contain the ratio of the SepalLength to the SepalWidth


In [None]:
# your answer goes here

Take the following actions / answer the following questions using the TITANIC dataset:
1. How many passendgers did survive the shipbreak?
2. What is the average age of all male survivers?
3. Remove all pasengers from the dataframe for which the deck is unknown?
4. Is it true that lower class passengers had lower probability of surviving?


In [None]:
# your answer goes here
import seaborn as sns
titanic = sns.load_dataset('titanic')
titanic

## PART 2: EDA

We will use the TIPS dataset to illstrate how to use Seaborn

First create a `relplot`:

In [None]:
import seaborn as sns
tips = sns.load_dataset('tips')
sns.relplot(
    data= tips,
    x=    "total_bill", 
    y=    "tip", 
    col=  "time",
    hue=  "smoker", 
    style="smoker", 
    size= "size"
)

Create a `displot` for the Penguins dataset:

In [None]:
penguins = sns.load_dataset('penguins')
sns.displot(penguins,
            x=    "bill_length_mm",
            y=    "bill_depth_mm", 
            hue=  "species", 
            kind= "kde", 
            bw_adjust = 1.5)


Create a `catplot` for the TIPS and TITANIC datasets


In [None]:
sns.catplot(data=tips, 
                x="day", 
                y="total_bill", 
                hue="smoker", 
                col = "time", 
                kind="box")

In [None]:
g = sns.catplot(data=titanic,
                x = "class", 
                y = "age",
                estimator = np.median,
                errorbar = ('ci', 95),
                hue = "sex",
                kind = "bar")
g.set_ylabels("Age (years)")
g.fig.subplots_adjust(top = 0.9)
g.fig.suptitle("Median age as function of class and sex")

### **EXERCISES**

Question: Consider the titanic dataset

1. It is often stated that lower-class passengers had a reduced chance of surviving. Make a graph that illustrates this using the data.

2. Is there a relationship between the age and the probability of survival? Note that the class might be a confounding variable here.


In [None]:
# your answer goes here

## PART 3: Machine learning

### Loading a toy dataset

To illustrate the concepts of this class, we will use the Wisconsin Breastcancer dataset, a dataset that contains measurements of microscopic images of tumors. The goal is to predict if these tumors are *benign* or *malignant*. 

In [2]:
# Load dataset
breast_cancer_data = load_breast_cancer()
predictors = breast_cancer_data['data']
labels = breast_cancer_data['target']

# Print description of the dataset (in case you want some more info)
# print(breast_cancer_data['DESCR'])

We will make the usual split in train and test data.

In [3]:
# Parameters
seed = 0

# Train - Test Split
X_train, X_test, y_train, y_test = train_test_split(predictors, 
                                                    labels, 
                                                    random_state=seed)

### 1. Decision Trees

Tree-based methods are implemented in the submodule ``tree`` and classification trees are implemeted by the ``DecisionTreeClassifier`` class of that submodule.

#### 1.1 Decision trees with default parameters

Decision tree classifiers come (as most classifiers in sklearn) with a set of default settings for the hyperparameters. The colde sample below shows how such a default tree can be built and tested.

NOTE: the parameter ``random_state`` sets the seed of the random generator used by the ``DecisionTreeClassifier`` instance. On rare occasions, two potential splits can be equally good and in that case thetree induction algorithm will decide randomly which split to use (using a random number generator). As this is a random process, different attempts will lead to different trees. Fixing the seed avoids this problem.  

In [None]:
# Create decision tree classifier object
decision_tree_classifier = tree.DecisionTreeClassifier(random_state=seed,
                                                       min_samples_split= 5)

# Fit the training data to the classifier
decision_tree_classifier = decision_tree_classifier.fit(X_train, y_train)

# Calculate accuracy of the train and test sets
train_predictions = decision_tree_classifier.predict(X_train)
test_predictions = decision_tree_classifier.predict(X_test)

print("Train set accuracy is:", np.sum(y_train == train_predictions)/len(y_train))
print("test set accuracy is:", np.sum(y_test == test_predictions)/len(y_test))                                                 

#### 1.2 Decision trees with cost complexity pruning

Cost-complexity pruning is a for of model tuning that focuses on finding an optimal value for the cost-complexity parameter $\alpha$. As opposed to hyperparameters we have been tuning in the past (such as the regularization parameter $\alpha$ for ridge regression) sklearn provides a method (``cost_complexity_pruning_path``) that is capable of generating a series of $\alpha$'s that should be searched during a grid search (one can show that including additional values in the grid search is not relevant).

In the following code fragment ``path.ccp_alphas`` is an array of relevant $\alpha$'s to try.

In [None]:
# Call built-in method to compute the pruning path during Minimal Cost-Complexity Pruning.
path = decision_tree_classifier.cost_complexity_pruning_path(X_train, y_train)
path.ccp_alphas

In a next step, a grid search can be used to find the optimal value for $\alpha$ (cross-validation).

In [None]:
# create GridSearchCV instance
mdl_cv = GridSearchCV(decision_tree_classifier,
                      param_grid = {'ccp_alpha' : path.ccp_alphas},
                      cv = 10)

# perform the grid search
mdl_cv.fit(X_train, y_train)

Look at the best alpha found

In [None]:
mdl_cv.best_params_

Make predictions on the test set and compute accuracy.

In [None]:
# make predictions using and compute accuracy (using a built-in function this time)
predictions = mdl_cv.predict(X_test)
print(accuracy_score(y_test, predictions))

# make predictions using and compute confusion matrix
confusion_matrix(y_test, predictions)

Visualize the tree

In [None]:
tree.plot_tree(mdl_cv.best_estimator_)

#### 1.3 Regression trees

Regression trees can be built in the same way. They are implemented by the ``DecisionTreeRegressor`` class of the ``tree`` submodule of ``sklearn``.

### 2 Random forests

Random forests are a simple but powerful extension to classification and regression trees. 

Random forests are implemented by the ``RandomForestClassifier`` and ``RandomForestRegressor`` classes in the ``ensemble`` submodule.

In [None]:
# Create random forest classifier object
random_forest_classifier = ensemble.RandomForestClassifier(max_features=5,
                                                           min_samples_split=10)

# Fit the training data to the classifier
random_forest_classifier = random_forest_classifier.fit(X_train, y_train)

# Calculate accuracy of the train and test sets
train_predictions = random_forest_classifier.predict(X_train)
test_predictions = random_forest_classifier.predict(X_test)

print("Train set accuracy is:", np.sum(y_train == train_predictions)/len(y_train))
print("test set accuracy is:", np.sum(y_test == test_predictions)/len(y_test)) 

### **EXERCISES**

1. Train a classification tree on the iris dataset, that allows you to predict the species. Test the performance of the classifier and study the effect of the hypererparameter `min_samples_split` on the test accuracy.

2. Train a Random Forest classifier on the iris dataset, that allows you to predict the species. Compare its accuracy with that of the classification tree you built before

In [None]:
# your answer goes here
