## A look at some of the algorithms in the sklearn library

The scikit-learning library has a large collection of algorithms useful for machine learning and optimization.  Here are few examples of the things you can do.

### Linear Regression

In [2]:

import numpy as np
from bokeh.plotting import figure
from bokeh.io import  show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource

output_notebook()

Let's create some data for a linear model.  The basic rule is 
$$
y = \sum a_{i}x_{i} + b+ \epsilon
$$
where $\epsilon$ is a normally distributed random variable.  Let's suppose $N=3$, the $a_{i}$ are 1, 2, and 3, and the $x_{i}$ are uniformly distributed between 0 and 1.  We'll use the `LinearRegression` class to fit the model.

In [3]:
N=100
x = np.random.uniform(-10,10,size=(N,3))
b = .7
y = (x @ np.array([-3,1,2]) + b)+ np.random.normal(0,1,N)
y_true = x @ np.array([-3,1,2]) + b
mse = np.linalg.norm(y-y_true)/N
print(mse)

0.10668516097372419


In [4]:
from sklearn.linear_model import LinearRegression

In [5]:
L = LinearRegression()

In [6]:
L.fit(x,y)

In [7]:
print(f"coeffs = {L.coef_}, intercept = {L.intercept_}")

coeffs = [-2.97241325  1.01039452  2.0146124 ], intercept = 0.6822679222330252


In [8]:
u = np.array([[2,2,2]])
L.predict(u)

array([0.78745527])

In [9]:
error = np.linalg.norm(L.predict(x)-y)/N

In [10]:
L.score(x,y)

0.9972960535172569

In [11]:
np.linalg.norm(L.predict(x)-y)/N

0.10497287472266592

### Classification

In [12]:
x0 = np.random.multivariate_normal([0, 0], [[1, .75],[.75, 1]], 100)
x1 = np.random.multivariate_normal([1, -4], [[1, 0],[0, 1]], 100)
x2 = np.random.multivariate_normal([-2,-2],[[1,-.5],[-.5,1]],100)


In [22]:
F = figure()
F.scatter(x=x0[:,0],y=x0[:,1],color='red')
F.scatter(x=x1[:,0],y=x1[:,1],color='blue')
F.scatter(x=x2[:,0],y=x2[:,1],color='green')
show(F)


In [16]:
from sklearn.neighbors import KNeighborsClassifier
colors = ['red','blue','green']
X = np.vstack([x0,x1,x2])
Y = np.array([0]*100+[1]*100+[2]*100)
training_data = ColumnDataSource(data=dict(x=X[:,0],y=X[:,1],color=['red']*100+['blue']*100+['green']*100))


KN = KNeighborsClassifier(n_neighbors=5)
KN.fit(X,Y)
KN

In [17]:
y0 = np.random.multivariate_normal([0, 0], [[1, .75],[.75, 1]], 10)
y1 = np.random.multivariate_normal([1, -4], [[1, 0],[0, 1]], 10)
y2 = np.random.multivariate_normal([-2,-2],[[1,-.5],[-.5,1]],10)

In [27]:
test = np.array([-1,-2.5]).reshape(1,2)
predicted = KN.predict(test)
neighbors = KN.kneighbors(test)
neighbors[1][0]

array([211, 264, 200, 281,  83])

In [28]:

datadict = {'x':test[:,0],'y':test[:,1],'color':[colors[i] for i in predicted]}
source = ColumnDataSource(data=datadict)
print(source.to_df())
print(neighbors[1])
G=figure()
G.scatter(x='x',y='y',color='color',source=source,size=10)
G.scatter(x='x',y='y',color='color',source=training_data)
G.scatter(x=X[neighbors[1][0]][:,0],y=X[neighbors[1][0]][:,1],size=10,alpha=.3)

show(G)

     x    y  color
0 -1.0 -2.5  green
[[211 264 200 281  83]]
