## Loading the basic libraries

In [75]:
import numpy as np
import pandas as pd
from bokeh.io import output_notebook,show
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource
output_notebook()

## Getting the penguin data

The penguin data comes from [this study](https://github.com/allisonhorst/palmerpenguins).  You can learn about the meaning of the various measurements,
as well as more about the underlying research, at that site.

### Load the data into a pandas dataframe

In [3]:
penguins_df = pd.read_csv('../../data/penguins/penguins_raw.csv')

### Do some cleaning of the data

In [10]:
penguins_df.head()

Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments,type,color
0,PAL0708,1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,2007-11-11,39.1,18.7,181.0,3750.0,MALE,,,Not enough blood for isotopes.,Adelie,blue
1,PAL0708,2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,2007-11-11,39.5,17.4,186.0,3800.0,FEMALE,8.94956,-24.69454,,Adelie,blue
2,PAL0708,3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,2007-11-16,40.3,18.0,195.0,3250.0,FEMALE,8.36821,-25.33302,,Adelie,blue
3,PAL0708,4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,2007-11-16,,,,,,,,Adult not sampled.,Adelie,blue
4,PAL0708,5,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,2007-11-16,36.7,19.3,193.0,3450.0,FEMALE,8.76651,-25.32426,,Adelie,blue


#### Simplify some of the fields

Make a new dataframe which is simpler:

- Boil the species down to Adelie, Gentoo, Chinstrap by keeping just the first work of the Species field; call this "type"
- Assign a color field for later use: Adelie blue, Gentoo green, Chinstrap red
- Only keep the key measurements of Culmen Length and Depth, Body Mass, and Flipper Length as well as the sample number
- Drop any samples with missing data

In [64]:
simple_df=pd.DataFrame()

In [65]:
simple_df['Sample']=penguins_df['Sample Number']
simple_df['Type']=penguins_df['Species'].apply(lambda x: x.split()[0])

In [66]:
simple_df['Color']=penguins_df['type'].map({'Adelie':'blue','Gentoo':'green','Chinstrap':'red'})

#### Simplify some of the field names

In [67]:
simple_df['C-Length'] = penguins_df['Culmen Length (mm)']
simple_df['C-Depth'] = penguins_df['Culmen Depth (mm)']
simple_df['Mass'] = penguins_df['Body Mass (g)']
simple_df['Flipper'] = penguins_df['Flipper Length (mm)']

In [68]:
simple_df.head()

Unnamed: 0,Sample,Type,Color,C-Length,C-Depth,Mass,Flipper
0,1,Adelie,blue,39.1,18.7,3750.0,181.0
1,2,Adelie,blue,39.5,17.4,3800.0,186.0
2,3,Adelie,blue,40.3,18.0,3250.0,195.0
3,4,Adelie,blue,,,,
4,5,Adelie,blue,36.7,19.3,3450.0,193.0


In [69]:
simple_df.shape

(344, 7)

In [70]:
simple_df.dropna().shape

(342, 7)

In [71]:
simple_df = simple_df.dropna()

#### Make some scatter plots to explore the data

Look at the plots and evaluate where the clouds of data points might be linearly separable.  The example in the notes
was Culmen Depth vs Body Mass for Adelie and Gentoo penguins.  Are there other candidates?

In [72]:
data = ColumnDataSource(simple_df)

In [89]:
f=figure(title='Culmen Length vs Body Mass')
f.scatter(x='Mass',y='C-Length',color='Color',source=data,legend_group='Type')
f.legend.location='top_left'
show(f)

In [90]:
f=figure(title='Culmen Depth vs Body Mass')
f.scatter(x='Mass',y='C-Depth',color='Color',source=data,legend_group='Type')
show(f)

In [91]:
f=figure(title='Flipper Length vs Body Mass')
f.scatter(x='Mass',y='Flipper',color='Color',source=data,legend_group='Type')
show(f)

## An SVM example

Let's consider the case of Gentoo vs Chinstrap penguins, using Flipper length and body mass.  The associated scatter plot looks like this.

In [108]:
gc_df = simple_df[(simple_df.Type=='Gentoo') | (simple_df.Type=='Chinstrap')]

In [110]:
data2 = ColumnDataSource(gc_df)

In [113]:
f=figure(title='Flipper length vs Body Mass')
f.scatter(x='Mass',y='Flipper',color='Color',legend_group='Type',source=data2)
show(f)

Notice that the red and green points are not *linearly separable.*  So a straightforward SVM classifier isn't going to be able
to separate the two types of penguins based on their flipper length and body mass.  To look more closely, let's plot the convex hulls of the two sets.

In [114]:
from scipy.spatial import ConvexHull

First extract the data points corresponding to the Gentoo penguins

In [311]:
pts = gc_df[gc_df['Type']=='Gentoo'][['Mass','Flipper']].values
pts2 = gc_df[gc_df['Type']=='Chinstrap'][['Mass','Flipper']].values

The sklearn `ConvexHull` function computes the convex hull of a set of points.  T

In [312]:
hull1=ConvexHull(pts)
hull2 = ConvexHull(pts2)

`hull.vertices` are the indices of the vertex points in the original array pts, in counterclockwise order.  So to draw the convex hull we need to draw a series of lines.

In [313]:
f=figure()
f.scatter(x='Mass',y='Flipper',color='Color',source=data2)
x0=hull.vertices[-1] # the last element
for x in hull.vertices:
    f.line(x=[pts[x0][0],pts[x][0]],y=[pts[x0][1],pts[x][1]],line_width=2,line_dash='dashed')
    x0=x
x0 = hull2.vertices[-1]
for x in hull2.vertices:
    f.line(x=[pts2[x0][0],pts2[x][0]],y=[pts2[x0][1],pts2[x][1]],line_width=2,line_dash='dashed')
    x0=x
show(f)

Notice that some of the red points are inside the convex hull of the green points, and the two convex hulls overlap.

Nevertheless, one can come up with a linear classifier (support vector machine) that mostly works.

In [316]:
from sklearn.svm import LinearSVC

In [317]:
classes=gc_df['Type'].map({'Gentoo':0,'Chinstrap':1}).values
pts = gc_df[['Mass','Flipper']].values

In [319]:
classifier = SVC(kernel='linear').fit(X=pts,y=classes)

The parameter `classifier.coef_` is a two-d array [A,B] and `classifier.intercept_` is a number C; together
they define the line Ax+By+C=0 that is the "best" separating hyperplane

In [320]:
classifier.coef_

array([[-0.00687571, -0.93496601]])

In [321]:
classifier.intercept_

array([223.87704146])

We can add this line to the picture above.

In [322]:
x=np.linspace(2500,7000,100)

In [323]:
y=(223.87704-.00687571*x)/(.934966)

In [324]:
f.line(x=x,y=y)

In [325]:
show(f)

Some red points are above the line, and some green are below; but this is in some sense the "best" separating line.

This line is found by considering the reduced convex hulls of the two sets.  If X is a set of points, then the reduced convex hull with parameter 0<K<1 is the set of convex combinations of points from X where the weight attached to any point is at most K.  By
shrinking K you can get the two reduced convex hulls to separate and then find the line between them.

Following the ideas outlined above, look at some of the other combinations of variables to see if they are linearly separable and what the classifying hyperplane looks like.