## Relationship amongst - Problem complexity, model complexity and data set size.  
What are these things?  Problem complexity comes with the data.  It may drive you to use a complicated model in order to match the complexity of the problem, but you might not always be able to use a complex model.  Why is that?  Simple models have few parameters to adjust.  Complicated models have many parameters to adjust.  The size of your data set determines how complicated a model you can fit.  If you have a lot of ROWs of data, then you can determine solid values for a lot of parameters.  If you don't have a lot of ROWs of data, then you can only fit a simple model.  The code below illustrates this.  


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from math import ceil
from sklearn.datasets import load_iris
from pandas.tools.plotting import scatter_matrix 

class KNN_tutorial(object):

	def __init__(self):
		self.data = []
		self.models = None
		self.X_train, self.X_test, self.y_train, self.y_test = None, None, None, None	
		self.predictions = None	
		self.scores = None
		self.k = None
		self.centers = []

	def load_data(self,s = 500,center_scale=0.7,center_points=5,surround_points=20,point_scale=0.3,centers = [[0,1],[1,0]], labels = [1,0]):
		'''
		Input:
			- s: (int) The value which the random seed will be set to
			- center_scale: (float) The standard deviation for the distribution of points around the two specified center points
			- center_points: (int) Number of points around each of the initial center points
			- surround_points: (int) Number of points to surround the points generated around the centers 
			- point_scale: (float) Standard deviation for the points surrounding the points generated around the centers
			- centers: (2D Array) Location of the first center points
			- labels: (Array) Labels for the algorithm to classify
		Output:
			- Randomly generated data created from the given parameters
		'''
		self.data = []
		self.models = None
		self.X_train, self.X_test, self.y_train, self.y_test = None, None, None, None	
		self.predictions = None	
		self.scores = None
		self.k = None
		self.centers = []
		np.random.seed(seed = s)
	
		for center,label in zip(centers,labels):
			generate_points = np.random.normal(loc=center,
							   scale=center_scale,
							   size = (center_points,2))

			generate_points_labels = np.insert(generate_points,2,label,axis = 1)
			self.centers.append(generate_points_labels)

			clusters = []
			for center_point in generate_points:
				normal_clusters = np.random.normal(loc=center_point,
								   scale=point_scale,
								   size = (surround_points,2))

				normal_clusters = np.insert(normal_clusters,2,label,axis = 1)
				clusters.append(normal_clusters)

			clusters = np.array(clusters).reshape(-1,3)
			self.data.append(clusters) 

		self.data = np.array(self.data).reshape(-1,3)
		self.centers = np.array(self.centers).reshape(-1,3)
		np.random.shuffle(self.data)		

	def plot_data(self):
		plt.scatter(self.data[:,0],self.data[:,1],c=self.data[:,2],cmap=plt.cm.Blues)
		plt.show()

	def plot_centers(self):
		plt.scatter(self.centers[:,0],self.centers[:,1],c=self.centers[:,2],cmap=plt.cm.Blues)
		plt.show()	

	def KNN_fit(self,k):
		self.X_train, self.X_test, self.y_train, self.y_test = \
					train_test_split(self.data[:,:2],self.data[:,2])
		self.k = np.array(k)	
		self.models = []
		for k_neighbors in k:
			knn = KNeighborsClassifier(n_neighbors = k_neighbors) 
			knn.fit(self.X_train,self.y_train)
			self.models.append(knn)


	def KNN_predict(self,data):
		self.predictions = []
		for model in self.models:
			predictions = model.predict(data)
			all_predictions.append(predictions)

	def KNN_score(self):
		self.scores = []				
		for model in self.models:
			train_scores = model.score(self.X_train,self.y_train)
			test_scores = model.score(self.X_test,self.y_test)
			self.scores.append([train_scores,test_scores])
		self.scores = np.array(self.scores) 

	def plot_scores(self):
		plt.plot(len(self.data) - self.k, 1 - self.scores[:,0], c='r', label='Training Data')
		plt.plot(len(self.data) - self.k, 1 - self.scores[:,1], c='g', label='Test Data')
		plt.legend()
		plt.xlabel('Complexity')
		plt.ylabel('Error')
		plt.show()
		
	def plot_decision_boundary(self):
		rows = ceil(len(self.k)/2.)
		h = 0.02
		x_min, x_max = self.data[:,0].min() - 1, self.data[:,0].max() + 1
		y_min, y_max = self.data[:,1].min() - 1, self.data[:,1].max() + 1
		xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
				     np.arange(y_min, y_max, h))

		
		for i,model in enumerate(self.models):
			plt.subplot(rows,2,i + 1)
			plt.title('Number of Neighbors: ' + str(self.k[i]))
			plt.scatter(self.data[:,0],self.data[:,1],c=self.data[:,2], cmap=plt.cm.Blues)
			Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
			Z = Z.reshape(xx.shape)
			plt.contour(xx,yy,Z)		
		plt.tight_layout(pad=0.8)
		plt.show()

	def load_wine_data(self):
		pass

	def iris_scatter_matrix(self):
		iris = load_iris()
		column_names = iris.feature_names 
		iris_data = pd.DataFrame(iris.data,columns = column_names)
		iris_target = iris.target
		scatter_matrix(iris_data,c=iris_target) 
		plt.show()

In [9]:
knn = KNN_tutorial()
knn.load_data(centers=[[1,0],[0,1],[1,1]],center_scale=0.4,point_scale=0.3,s=631,center_points=5,surround_points=20,labels=[0,1,2])
knn.KNN_fit([2,5,6,7,51])
knn.KNN_score()

In [11]:
knn.plot_scores()


## Machine Learning with `sklearn`

`sklearn` is a best-in-breed machine learning library for Python that we will use extensively in this class.  It also has one of the best APIs designs out there (with a [paper](http://arxiv.org/pdf/1309.0238.pdf) even written about the design) and is very modular and flexible.  As such it has a bit of a learning curve, but once you can think in the `sklearn` way for one algorithm/model you can apply that general knowledge to any model.

In [2]:
import pandas as pd
import numpy as np

### Getting Data

Typically you have an external dataset that you will be working with and even if it is clean, you will need to manipulate/transform it to create features.  And as such you will load your dataset with something like `numpy` or `pandas`

We will be performing a simple linear regression on a Lending Club [dataset](https://www.lendingclub.com/info/download-data.action) of interest rates for individual loans.  To start we will need to slightly prepare our data with `pandas` to get it ready for our model.

In [5]:
df = pd.read_csv('loanf.csv')

df.head()

Unnamed: 0,Interest.Rate,FICO.Score,Loan.Length,Monthly.Income,Loan.Amount
6,15.31,670,36,4891.67,6000
11,19.72,670,36,3575.0,2000
12,14.27,665,36,4250.0,10625
13,21.67,670,60,14166.67,28000
21,21.98,665,36,6666.67,22000


In [3]:
np.sum(df.isnull())

Interest.Rate     0
FICO.Score        0
Loan.Length       0
Monthly.Income    1
Loan.Amount       0
dtype: int64

In [6]:
df = df.dropna(axis=0)

In [7]:
np.sum(df.isnull())

Interest.Rate     0
FICO.Score        0
Loan.Length       0
Monthly.Income    0
Loan.Amount       0
dtype: int64

#### Getting a feature matrix

Remember from lecture that for any machine learning model we have **Features** (or a feature matrix) and a **Target** (or response/dependent variable from statistics parlance).  In the `sklearn` API we need to separate these from our initial data matrix.

> NOTE: `sklearn` expects as input a `numpy` array/matrix. Often if you pass in a `DataFrame` Python can convert/coerce the DataFrame into a `numpy` array alright, but it is a best practice to do this conversion yourself

In [8]:
features = df.iloc[:, 1:]
features.head()

Unnamed: 0,FICO.Score,Loan.Length,Monthly.Income,Loan.Amount
6,670,36,4891.67,6000
11,670,36,3575.0,2000
12,665,36,4250.0,10625
13,670,60,14166.67,28000
21,665,36,6666.67,22000


In [9]:
labels = df.iloc[:, 0]
labels.head()

6     15.31
11    19.72
12    14.27
13    21.67
21    21.98
Name: Interest.Rate, dtype: float64

In [11]:
X = features.as_matrix()
y = labels.as_matrix()

In [12]:
print "Features: \n", X
print "\n\nLabels: \n", y

Features: 
[[   670.       36.     4891.67   6000.  ]
 [   670.       36.     3575.     2000.  ]
 [   665.       36.     4250.    10625.  ]
 ..., 
 [   810.       36.     9250.    27000.  ]
 [   765.       36.     7083.33  25000.  ]
 [   740.       60.     8903.25  16000.  ]]


Labels: 
[ 15.31  19.72  14.27 ...,   6.62  10.75  14.09]


### The API

`sklearn` has a **very** Object Oriented interface and it is import to be aware of this when building models.  It is important to note that (almost) every model/transform/object in `sklearn` is an `Estimator` object.  What is an `Estimator`?

In [13]:
class Estimator(object):
  
    def fit(self, X, y=None):
        """Fit model to data X (and y)"""
        self.some_attribute = self.some_fitting_method(X, y)
        return self
            
    def predict(self, X_test):
        """Make prediction based on passed features"""
        pred = self.make_prediction(X_test)
        return pred
    
model = Estimator()

The `Estimator` class defines a `fit()` method as well as a `predict()` method.  For an instance of an `Estimator` stored in a variable `model`:

* `model.fit`: fits the model with the passed in training data.  For supervised models, it also accepts a second argument `y` that corresponds to the labels (`model.fit(X, y)`.  For unsupervised models, there are no labels so you only need to pass in the feature matrix (`model.fit(X)`)
    > Since the interface is very OO, the instance itself stores the results of the `fit` internally.  And as such you must always `fit()` before you `predict()` on the same object.
* `model.predict`: predicts new labels for any new datapoints passed in (`model.predict(X_test)`) and returns an array equal in length to the number of rows of what is passed in containing the predicted labels.

There are 3(ish) types of subclass of estimator:

* Supervised
* Unsupervised
* Feature Processing

#### Supervised

Supervised estimators in addition to the above methods typically also have:

* `model.predict_proba: For classifiers that have a notion of probability (or some measure of confidence in a prediction) this method returns those "probabilities".  The label with the highest probability is what is returned by the `model.predict()` mehod from above.
* `model.score`: For both classification and regression models, this method returns some measure of validation of the model (which is configurable).  For example, in regression the default is typically R^2 and classification it is accuracy.

#### Unsupervised

Some estimators in the library implement what is referred to as the **transformer** interface.  Unsupervised in this case refers to any method that does not need labels, including (but not limited to) unsupervised classifiers, preprocessing (like tf-idf), dimensionality reduction, etc.

The **transformer** interface defines (usually) two additional methods:

* `model.transform`: Given an unsupervised model, transform the input into a new basis (or feature space). This accepts on argument (usually a feature matrix) and returns a matrix of the input transformed. Note: You need to `fit()` the model before you transform it.
* `model.fit_transform`: For some models you may not need to `fit()` and `transform()` separately.  In these cases it is more convenient to do both at the same time.  And that is precisely what `fit_transform()` does!

### Let's see this in action!

We will be trying to predict the loan **interest rate** based on the FICO score, loan length, monthly income, and loan amount:

$$Interest.Rate = \beta_0 + \beta_1 \cdot FICO.Score + \beta_2 \cdot Loan.Length + \beta_3 \cdot Monthly.Income + \beta_4 \cdot Loan.Amount$$

In [14]:
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [16]:
print "The training split: \n"
print len(X_train), len(y_train)
print "\n\nThe testing split: \n"
print len(X_test), len(y_test)

The training split: 

1874 1874


The testing split: 

625 625


In [17]:
# create an instance of an estimator
clf = LinearRegression()

# fit the estimator (notice I do not save any return value in a variable)
clf.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [18]:
# predict (but only after we have trained!)
predictions = clf.predict(X_test)
print len(predictions)

625


In [19]:
# The coefficients
print 'Coefficients: \n', clf.coef_
# The mean square error
print("\n\nResidual sum of squares: %.2f"
      % np.mean((predictions - y_test) ** 2))

# Explained variance score: 1 is perfect prediction
print('\n\nVariance score: %.2f' % clf.score(X_test, y_test))

Coefficients: 
[ -8.61337860e-02   1.37349607e-01  -3.62467583e-05   1.45157401e-04]


Residual sum of squares: 4.28


Variance score: 0.75
