# sci-kit learn Tutorials

## An introduction to machine learning with scikit-learn

In general, a learning problem considers a set of n samples of data and then tries to predict properties of unknown data. If each sample is more than a single number and, for instance, a multi-dimensional entry (aka multivariate data), is it said to have several attributes or features.

We can separate learning problems in a few large categories:

- supervised learning, in which the data comes with additional attributes that we want to predict (Click here to go to the scikit-learn supervised learning page).This problem can be either:

 - classification: samples belong to two or more classes and we want to learn from already labeled data how to predict the class of unlabeled data. An example of classification problem would be the handwritten digit recognition example, in which the aim is to assign each input vector to one of a finite number of discrete categories. Another way to think of classification is as a discrete (as opposed to continuous) form of supervised learning where one has a limited number of categories and for each of the n samples provided, one is to try to label them with the correct category or class.

 - regression: if the desired output consists of one or more continuous variables, then the task is called regression. An example of a regression problem would be the prediction of the length of a salmon as a function of its age and weight.

- unsupervised learning, in which the training data consists of a set of input vectors x without any corresponding target values. The goal in such problems may be to discover groups of similar examples within the data, where it is called clustering, or to determine the distribution of data within the input space, known as density estimation, or to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization (Click here to go to the Scikit-Learn unsupervised learning page).

## A tutorial on statistical-learning for scientific data processing

### Statistical learning: the setting and the estimator object in scikit-learn

http://scikit-learn.org/stable/tutorial/statistical_inference/settings.html

- Datasets

- Estimators objects

### Supervised learning: predicting an output variable from high-dimensional observations

http://scikit-learn.org/stable/tutorial/statistical_inference/supervised_learning.html

- Nearest neighbor and the curse of dimensionality
- Linear model: from regression to sparsity
- Support vector machines (SVMs)

### Model selection: choosing estimators and their paramters

http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html

- Score, and cross-validated scores
- Cross-validation generators
- Grid-search and cross-validated estimators

### Unsupervised learning: seeking representations of the data

http://scikit-learn.org/stable/tutorial/statistical_inference/unsupervised_learning.html

- Clustering: grouping observations together
- Decompositions: from a signal to components and loadings

### Putting it all together

http://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html

- Pipelining
- Face recognition with eigenfaces
- Open problem: Stock Market Structure


### Finding help

http://scikit-learn.org/stable/tutorial/statistical_inference/finding_help.html

- The project mailing list
- Q&A communities with Machine Learning practitioners

## Working with Text Data

### Tutorial setup

### Loading the 20 newsgroups dataset

### Extracting features from text files

### Training a classifier

### Building a pipeline

### Evaluation of the performance on the test set

### Parameter tuning using grid search

### Exercise 1: language identification

### Exercise 2: sentiment analysis on movie reviews

### Exercise 3: CLI text classification utility

### Where to from here

