# Introduction to Machine Learning
### Workshop 2 of DASIL's series on "Data Science with Python"
### Created by Martin Pollack

In this Jupyter notebook we will give you a quick introduction on how to fit machine learning models in Python with the `scikit-learn` package. 

Next week we will go much more into depth.

In [10]:
#!pip install sklearn
#!pip install pandas
import pandas as pd
from sklearn import datasets as datasets

### Supervised Learning - Regression

Remember that in a regression problem the outcome variable is numeric and continuous. However, the predictor variables can either be continuous or discrete.

An example of a regression problem can be found in the diabetes dataset within sklearn. Our outcome is a quantitative measure of disease progression that takes on numbers between 25 and 346.

In [13]:
diabetes = datasets.load_diabetes(as_frame=True)

In [17]:
print(min(diabetes.target))
print(max(diabetes.target))

25.0
346.0


### Supervised Learning - Classification

Now let's look at a classification problem, where the outcome can only take on 2 or more discrete values. But of course our predictors can be either continuous or discrete.

The iris dataset in `scikit-learn` is a famous example. Here the outcome can take on one of three plant types, labeled 0, 1, or 2.

In [20]:
iris = datasets.load_iris(as_frame=True)

In [22]:
iris.target.value_counts()

0    50
1    50
2    50
Name: target, dtype: int64

### Unsupervised Learning - Clustering

Lastly we want to consider an unsupervised learning problem, where we don't actually have an outcome at all, or our data is "unlabeled." Instead of predicting something we just want to find patterns and structure in our data.

Our data can be unlabeled for two reasons:

• First, maybe our data does not have well-defined groupings. An example might be a company's customers: there are not clear and distinct groups that we can put people in. 

• Second, maybe the label of our data is missing. Suppose you are a wine vendor and you ordered three types of wine from your supplier. When you receive your wine shipment, however, you realize that the labels were not put on. You may want to learn about how the different wine bottles are related to one another to make an educated guess on which is what wine type.

Our example below falls in this second case.

In [25]:
wine = datasets.load_wine(as_frame=True)

In [29]:
wine.data.columns

Index(['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium',
       'total_phenols', 'flavanoids', 'nonflavanoid_phenols',
       'proanthocyanins', 'color_intensity', 'hue',
       'od280/od315_of_diluted_wines', 'proline'],
      dtype='object')