# Introduction to Machine Learning with SciKit Learn

**Machine Learning**

Jay Urbain, PhD

References:

- [scikit-learn](http://scikit-learn.org/stable/)

### SciKit Learn
We will be using [SciKit Learn](http://scikit-learn.org/stable/) and related Python libraries for developing several machine learning applications.

Scikit-learn provides a range of supervised and unsupervised learning algorithms in Python. 

The library is built on the SciPy (Scientific Python) library that must be installed before you can use scikit-learn. In addition, we will be using a number of additional libraries. All of these libraries are included with your Anaconda installation. The libraries are:

- **NumPy**: Base n-dimensional array package
- **SciPy**: Fundamental library for scientific computing
- **Matplotlib**: Comprehensive 2D/3D plotting
- **IPython**: Enhanced interactive console
- **Sympy**: Symbolic mathematics
- **Pandas**: Data structures and analysis

*Note: Although the interface is Python, c-libraries are leveraged for performance such as numpy for arrays and matrix operations, LAPACK, LibSVM and the careful use of cython.*

------------

### Groups of machine learning models provided by scikit-learn include:
- **Clustering**: for grouping unlabeled data such as KMeans.
- **Cross Validation**: for estimating the performance of supervised models on unseen data.
- **Datasets**: for test data sets and for generating data sets with specific properties for investigating model behavior.
- **Dimensionality Reduction**: for reducing the number of attributes in data for
summarization, visualization and feature selection.
- **Ensemble methods**: for combining the predictions of multiple supervised models.
- **Feature extraction**: for defining attributes in image and text data.
- **Feature selection**: for identifying meaningful attributes from which to create supervised
models.
- **Parameter Tuning**: for getting the most out of supervised models.
- **Manifold Learning**: for summarizing and depicting complex multi-dimensional data.
- **Supervised Models**: a vast array of supervised models including generalized linear models, discriminant analysis, naive bayes, lazy methods, neural networks, support vector machines and decision trees.

### SciKit Learn Example Datasets

The *scikit-learn* library is packaged with five datasets. These datasets are useful for getting 
a handle on a given machine learning algorithm or library feature before using it in your own work.
The code snipet below demonstrates how to load each of the five pre-packaged datasets. 

In [3]:
#Load the packaged datasets

from sklearn.datasets import load_boston # this is the housing data we've used in class
from sklearn.datasets import load_iris 
from sklearn.datasets import load_diabetes 
from sklearn.datasets import load_digits 
from sklearn.datasets import load_linnerud 

import numpy as np

# Boston house prices dataset(13x506,reals,regression) 
boston = load_boston()
print( type(boston))
print(boston.data.shape)

#Irisflowerdataset(4x150,reals,multi-labelclassification)
iris = load_iris()
print(iris.data.shape)

#Diabetesdataset(10x442,reals,regression)
diabetes = load_diabetes()
print(diabetes.data.shape) 

#Hand-writtendigitdataset(64x1797,multi-labelclassification)
digits = load_digits()
print(digits.data.shape) 

#Linnerudpsychologicalandexercisedataset(3x20,3x20multivariateregression) 
linnerud = load_linnerud()
print(linnerud.data.shape)



<class 'sklearn.datasets.base.Bunch'>
(506, 13)
(150, 4)
(442, 10)
(1797, 64)
(20, 3)


### Additional Online Resource for Machine Learning:

- SciKit Learn's own Documentation is very good and includes their own [tutorial](http://scikit-learn.org/stable/tutorial/index.html):

- Another usefull SciKit Learn [tutorial](https://github.com/jakevdp/sklearn_tutorial):

- [free online book](http://robotics.stanford.edu/people/nilsson/MLBOOK.pdf) by Stanford professor Nils J. Nilsson. *Note: pretty out of date*.



