# Introduction to Machine Learning, with Application in Scikit-Learn

*Paul Paczuski [pavopax.com](http://pavopax.com)*

## How can we use a *machine* to *learn* to solve a task (recognize handwritten digits)?

![eight](digit.png)

*The above is a handwritten digit, converted to a computer-readable format. Which digit do you think this is?*

In this machine learning demo, our objective is:
* Given images of handwritten digits, ask the machine to tell us what digits are shown

The steps in the process:
* Read in the raw data
* Learn the pattern (**fit a model**) using a bunch of known digit images and their labels (**training data**)
* Use the learned pattern on new data, to determine what digits are shown (**predict**)

We can perform this task with just a few lines of Python scikit-learn code.

## Read in the Raw Data

In [None]:
# set-up and load data
from sklearn import datasets

digits = datasets.load_digits()

In [None]:
print digits.DESCR

In [None]:
digits.data.shape

In [None]:
digits.data[0]

In [None]:
digits.data

In [None]:
digits.target

## Fit the Model 
This is the "learn" part (Machine *Learning*).

We'll use a machine learning algorithm to find the patterns in this data ("fit the model").

We want to learn this relationship:

    Given a matrix of numbers -> which digit is it?

In [None]:
# here is one handwritten digit
digits.data[0]

In [None]:
# Let's pull out some black box algorithm to do this
from sklearn import svm

classifier = svm.SVC(gamma=0.001, C=100.)

In [None]:
# Now, FIT the model (learn the pattern of the data)
# (This "changes" the object "classifier")
classifier.fit(digits.data[:-1], digits.target[:-1])  

## Use the Model for Prediction

Now we can predict new values.

In particular, we can ask our classifier (algorithm) what is the digit of our last image in the digits dataset, which we have not used to train the classifier:

In [None]:
print digits.data[-1:]

In [None]:
# the algorithm's answer (prediction) is...
classifier.predict(digits.data[-1:])[0]

![eight](digit.png)

What do you think?

## Summary 

Here are the key steps in this process:

1. Read in the data
2. **Fit** the model using some algorithms
3.  Use the model to **predict** on new data

# Two Types of Machine Learning

Below are the main types of machine learning, as well as some example algorithms.

## Supervised Learning
- When the target/outcome/y is known
- There are two classes of models, depending on the type of the target:
    - classification: when outcome is categorical
    - regression: when outcome is continuous
- **Example algorithms**: linear regression, logistic regression, support vector machine, random forest, k-nearest neighbors

## Unsupervised Learning
- When the target/outcome/y is unknown (we want to find "clusters" in the data)
- There are two main uses of unsupervised learning:
  - clustering
  - dimension reduction
- **Example algorithms**: k-means clustering, affinity propagation clustering, principal component analysis

# Resources

## Essential quick-starts

[Machine learning introduction/tutorial (with scikit-learn code)](http://scikit-learn.org/stable/tutorial/basic/tutorial.html)

[Python - "immediately useful tools"](http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html)

[10 minutes to Pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html)

[Pandas cheat sheet](blog.quandl.com/cheat-sheet-for-data-analysis-in-python)

[ggplot - easy plotting in python](http://ggplot.yhathq.com)

## More Tutorials

[Supervised learning introduction (with scikit-learn code)](http://scikit-learn.org/stable/tutorial/statistical_inference/supervised_learning.html)

[Unsupervised learning introduction (with scikit-learn code)](http://scikit-learn.org/stable/tutorial/statistical_inference/unsupervised_learning.html)

## References

Canonical Textbook on Machine Learning:

* [[free pdf!] The Elements of Statistical Learning: Data Mining, Inference, and Prediction](http://statweb.stanford.edu/~tibs/ElemStatLearn/)

A newer and less technical version:

* [[free pdf!] An Introduction to Statistical Learning with Applications in R](http://www-bcf.usc.edu/~gareth/ISL/)

[scikit-learn tutorials](http://scikit-learn.org/stable/tutorial/index.html)

[scikit-learn user guide](http://scikit-learn.org/stable/user_guide.html)

[Choosing the right estimator](http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)

[scikit-learn - documentation](http://scikit-learn.org/stable/documentation.html)


# Acknowledgments

The source of the scikit-learn demo:
* http://scikit-learn.org/stable/tutorial/basic/tutorial.html