# Scikit-learn overview and Machine learning the problem setting

## Agenda

- What is scikit-learn, what are its benefits/drawbacks? 
- What is machine learning and how does it work?

## Scikit-learn

![scikit-learn algorithm map](images/02_sklearn_algorithms.png)

## Benefits and drawbacks of scikit-learn

### Benefits:

- **Consistent interface** to machine learning models
- Rapid **integration**
- Provides many **tuning parameters** but with **sensible defaults**
- Good **documentation**
- Rich set of functionality for **companion tasks**
- **Active community** for development and support

### Potential drawbacks:

- Harder to get started with machine learning (**steep learning curve**)
- Less emphasis on **model interpretability** and breadth of the models covered

## Machine learning

We will level up the field and define all the concepts we will be using 

One ML definition: "Machine learning is the semi-automated extraction of knowledge from data"

- **Knowledge from data**: Starts with a question that might be answerable using data
- **Automated extraction**: A computer provides the insight
- **Semi-automated**: Requires many smart decisions by a human

### Learning problems

In general, a learning problem considers a set of n **samples** of data and then tries to predict properties of unknown data. 

If each sample is more than a single number and, for instance, a multi-dimensional entry (aka multivariate data), it is said to have several attributes or **features**.

ML concepts

- Samples (observations, example, instance, record)
- Features (predictor, attribute, independent variable, input, regressor, covariate)
- Response (target, outcome, label, dependent variable)
- Training, validation and testing set

**Supervised learning**: Making predictions using labeled data (learning a function that maps an input to an output based on example input-output pairs)

**Regression**

A regression problem is when the output variable is a real or continuous value

- Example predicting the temperature based on the historical values
- There is an outcome value that we are trying to predict

![Weather Forecast](images/01_predicting_temp.png)

**Classification**

A classification problem is when the output variable is a category
    
- Example: Is a given email "spam" or "ham"?
- There is a class outcome we are trying to predict

![Spam filter](images/01_spam_filter.png)

**Unsupervised learning**: Extracting structure from data (self-organized learning - find previously unknown patterns in data set without pre-existing labels)

- Example: Segment student population into clusters that exhibit similar behaviors
- There is no "right answer"

![Clustering](images/01_clustering.png)

### Predictive modeling

High-level steps:

1. First, preprocess the data - make sure that data is ready to enter the ML pipeline

2. Then, train a **machine learning model** using **labeled data**

    - "Labeled data" has been labeled with the outcome
    - "Machine learning model" learns the relationship between the attributes of the data and its outcome

3. Finally, make **predictions** on **new (unseen) data** for which the label is unknown

![Supervised learning](images/01_supervised_learning.png)

