<!--BOOK_INFORMATION-->
This notebook contains both adapted and unmodified material from: 
[Introduction to Machine Learning with Python](https://learning.oreilly.com/library/view/introduction-to-machine/9781449369880/)
by Sarah Guido, Andreas C. Müller; the content is available [on GitHub](https://github.com/amueller/introduction_to_ml_with_python/blob/master/02-supervised-learning.ipynb).

# 10 Machine Learing Part 2  : Supervised Learning 
# Review Exercises - Solutions

We are going to use a data set that shows the population of 6 different species of algae in a lake and some environmental parameters such as chemical content of the water, and time of year by season. 

We will use the machine learning models we have studied today to try to extract information about the relationship between these environmental parameters and the population of algae. 

The data `sample_data/analysis.csv` contains measurements of river chemical concentrations and algae densities.

200 samples, 18 columns

__Column headings:__

- First 11 columns : 
<br>Season, river size, fluid velocity, 7 chemical concentrations (A-G)

- Last 8 columns : 
<br>Population of 8 different kinds of algae (a-h). 

The data set also contains some missing data:
- empty fields 
- string labels `'XXXXX'`.

# Review Exercise 1 : Data Cleaning

1. Import the data from `sample_data/analysis.csv` 
1. Replace all missing data labelled `'XXXXX'` with `np.nan` values.
1. Determine how many null values there are in each column. Choose an appropriate method to deal with the null values in the data set. 
1. The numbers representing the chemical concentrations are `string`s not numerical data, as they appear (you can check this by running the method `.dtypes` on your `DataFrame`). Convert all data to floating point data.

In [4]:
# Review Exercise 1
# Example Solution

import pandas as pd
import numpy as np

algae = pd.read_csv('sample_data/analysis.csv',                             # 1. Import data
                    names = ['Season', 'river_size', 'fluid_velocity', 
                             'A', 'B', 'C', 'D', 'E', 'F', 'G',
                             'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])   


# 2. Replace all occurrences of XXXXXXX with numpy nan
algae = algae.replace({'XXXXXXX': np.nan})

# 3. number of null values in each column, remove null values
print(algae.isnull().sum()) 
algae.dropna(inplace=True)

#4. convert all to floating point data
algae.loc[:, 'A':] = algae.loc[:, 'A':].astype(float)

Season             0
river_size         0
fluid_velocity     0
A                  1
B                  2
C                 10
D                  2
E                  2
F                  2
G                  2
a                 12
b                  0
c                  0
d                  0
e                  0
f                  0
g                  0
h                 17
dtype: int64


# Review Exercise 2 : Linear Regression

We will use linear regression to find the relationship between the chemical content of the water and different species of algae.




Import the data from `sample_data/analysis.csv`
<br>

1. Remove all columns that are __not__ chemical concentrations.   
1. Make 2 data frames : one for algae species a and one for algae species b (remove all other species)
1. Import and instantiate linear regression model `sklearn.linear_model.LinearRegression`
1. For each data frame (algae a, algae b):
    1. split the data into test and training data
    1. train the model on the data
    1. print the `score` for each model  

1. Which algae, a or b, is predicted with the highest accuracy using this model? 

In [2]:
# Review Exercise 2
# Example Solution

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression                           # 3. Import linear regression model

lr = LinearRegression()                                                     # Instantiate model   

# 1. remove all non-chemical columns
algae = algae.loc[:, 'A':]

# 2. data frames for algae a and b
algae_a = algae.loc[:, :'a']
algae_b = algae.loc[:, :'b'].drop(columns=['a'])

data = [algae_a, algae_b]

for d in data:                                                           # 4. For each data frame ...
    X_train, X_test, y_train, y_test = train_test_split(d.loc[:,:'G'],   # A. split into test and train data
                                                        d.iloc[:, -1],
                                                        random_state=42) 
    
    
    lr.fit(X_train, y_train)         # B. fit model to training data 
    
    print(lr.score(X_test, y_test))  # C. print the score for each model
    
algae.head()

0.2783004665712848
0.3290248571923503


Unnamed: 0,A,B,C,D,E,F,G,a,b,c,d,e,f,g,h
0,8.0,9.8,60.8,6.238,578.0,105.0,170.0,50.0,0.0,0.0,0.0,0.0,34.2,8.3,0.0
1,8.35,8.0,57.75,1.288,370.0,428.75,558.75,1.3,1.4,7.6,4.8,1.9,6.7,0.0,2.1
2,8.1,11.4,40.02,5.33,346.66699,125.667,187.05701,15.6,3.3,53.6,1.9,0.0,0.0,0.0,9.7
3,8.07,4.8,77.364,2.302,98.182,61.182,138.7,1.4,3.1,41.0,18.9,0.0,1.4,0.0,1.4
4,8.06,9.0,55.35,10.416,233.7,58.222,97.58,10.5,9.2,2.9,7.5,0.0,7.5,4.1,1.0


# Review Exercise 3 : Converting for use in a machine learning model

Import and clean the data by running the code you wrote for Exercise 1 again.

#### Feature Extraction
Make a new column called `'max'` that shows the species of algae with the highest population at each data point. <br>Example : If `a=50, b=0, c=10, d=0, e=0, f=12, g=22, h=0` --> New column value = `a`.
<br>To achieve this you can run the method `.idxmax(axis=1)` on columns `'a':'h'` of the data set.

#### Integer data to represent string data
Columns `'Season'`,  `'river_size'`, `'fluid_velocity'` and `'max'` contain text string data. 
<br>Machines can only understand numerical data. 
<br>Convert the string data to integer IDs representing each unique value in the colunm. 

#### Normalization 
Split the data into test and train data. <br>Use column `'max'` as the __targets__ of the data set (the parameter we are trying to predict).
<br>Standardize the feature data using __Z-score normalization__.
<br>$$x' = \frac{x - \bar{x}}{\sigma}$$




In [5]:
# Review Exercise 3
# Example Solution
from sklearn import preprocessing     # import the model
lb = preprocessing.LabelBinarizer()   # instantiate the model

# Feature Extraction
algae['max'] = algae.loc[:,'a':'h'].idxmax(axis=1)

# Integer data to represent string data
for i in ['Season',  'river_size', 'fluid_velocity']:# , 'max']:
    
    binarized = pd.DataFrame(lb.fit_transform(algae[i]), 
                             columns=lb.classes_, 
                             index=algae.index)
    
    algae = algae.join(binarized, rsuffix='_')
    
    algae.drop(i, axis=1, inplace=True)

In [6]:
# # a list of unique string values
# max_ = list(algae['max'].unique())

# # loop through each item in list and replace string item with a string number in DataFrame
# for n, m in enumerate(max_):
#     print(n,m)
#     algae['max'] = algae['max'].str.replace(m, str(n), regex=True)

# # convert string numerical data to integer
# algae['max'] = algae['max'].astype(int)
# #display(algae.head())

In [7]:
values = algae.drop(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'max'], axis=1)

# # Normalization
X_train, X_test, y_train, y_test = train_test_split(values,   # A. split into test and train data
                                                    algae['max'],
                                                    random_state=42)



values.head()

Unnamed: 0,A,B,C,D,E,F,G,autumn,spring,summer,winter,large_,medium,small_,high__,low___,medium_
0,8.0,9.8,60.8,6.238,578.0,105.0,170.0,0,0,0,1,0,0,1,0,0,1
1,8.35,8.0,57.75,1.288,370.0,428.75,558.75,0,1,0,0,0,0,1,0,0,1
2,8.1,11.4,40.02,5.33,346.66699,125.667,187.05701,1,0,0,0,0,0,1,0,0,1
3,8.07,4.8,77.364,2.302,98.182,61.182,138.7,0,1,0,0,0,0,1,0,0,1
4,8.06,9.0,55.35,10.416,233.7,58.222,97.58,1,0,0,0,0,0,1,0,0,1


In [8]:
mean_on_train = X_train.mean(axis=0)   # mean training set
std_on_train = X_train.std(axis=0)     # standard deviation training set


X_train_scaled = (X_train - mean_on_train) / std_on_train # subtract the mean, scale inverse std --> mean=0 and std=1
X_test_scaled = (X_test - mean_on_train) / std_on_train   # SAME transformation on the test set




# Review Exercise 4 : Neural Network
We will use a neural network to predict which algae population will be highest based on a given set of environmental parameters and compare the accuracy to the decision tree. 

1. Import and instantiate decision tree classifier model `sklearn.neural_network.MLPClassifier`

1. Fit the model to the training data and print the score

1. Print the model to the *scaled* training data and print the score

1. What is the impact of scaling the data on the accuracy of the model?



In [9]:
# Review Exercise 4
# Example Solution

from sklearn.neural_network import MLPClassifier # 1. import model
mlp = MLPClassifier(random_state=0)   


mlp.fit(X_train, y_train)                       # 2. fit the model to the training data
print(f"score = {mlp.score(X_test, y_test)}")   


mlp.fit(X_train_scaled, y_train)                # 3. fit model to SCALED traning data
print(f"score (scaled): {mlp.score(X_test_scaled, y_test)}")

score = 0.5476190476190477
score (scaled): 0.5714285714285714




# Review Exercise 5 : Decision Tree

We will use a decision tree to predict which algae population is likely to be dominant based on a given set of environental parameters.

1. Import and instantiate decision tree classifier model `sklearn.tree.DecisionTreeClassifier`

1. fit the model to the training data

1. print the `score`
    
1. Which feature is the most important in determining which algae will be dominant? 

In [11]:
# Review Exercise 5
# Example Solution

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier     # 1. Import and instantiate decision tree classifier                      # 6. Import decision tree model

tree = DecisionTreeClassifier()                                              
    
tree.fit(X_train, y_train)                          # 2. fit model to training data 
    
print(f"score = {tree.score(X_test, y_test)}")      # 3. print the score for each model  


for f, t in zip(algae.loc[:,:'G'].columns,          # 4. which feature is the most important?
                tree.feature_importances_):
    print(f, '\t\t', round(t, 3)) 

    
# 4. feature importance as bar chart
n_features = X_train.shape[1] # number of features

plt.barh(np.arange(n_features), 
         tree.feature_importances_, # use the method feature_importance_ as above
         align='center')

plt.yticks(np.arange(n_features), 
           values.columns)

plt.xlabel("Feature importance")
plt.ylabel("Feature")
plt.ylim(-1, n_features)

score = 0.40476190476190477
A 		 0.123
B 		 0.109
C 		 0.139
D 		 0.065
E 		 0.064
F 		 0.161
G 		 0.256


(-1, 17)