# ACT-W Machine Learning 101

We're going to walk through the full process of building an initial model on a toy problem. Afterward, we'll provide you a related problem for you to solve on your own (with our help if you need any).

The problem we'll be looking at is predicting whether a wine is a red wine or white wine, depending on a variety of features. Afterwards, you can take the same dataset and predict the perceived quality of the wine.

All data has been taken from the [UCI ML repository](http://archive.ics.uci.edu/ml/datasets/Wine+Quality).

In [None]:
%ls

## Loading Data

To load data, we'll use a helpful library called `pandas` which has many helper methods for loading and manipulating data. You can find more information on using pandas from this [pandas tutorial](http://pandas.pydata.org/pandas-docs/stable/10min.html).

`pandas` stores data in `DataFrame` objects -- this is basically a table. Each row is a record (in our case a wine), and each column is a different feature of a wine.

In [None]:
import pandas as pd

reds = pd.read_csv("winequality-red.csv", sep=";")
print(reds.head())

We can also print out the column names for all the features we have:

In [None]:
source_cols = reds.columns
print(source_cols)

## Preparing the data

The dataset is split into two sets: one for reds and one for whites. Let's add another column to distinguish between red and white so we can join them later. We'll let a red wine be `1` and a white wine be `0`.

In [None]:
reds['winetype'] = 1
print(reds.head())

Now we can load in the white wines and do what we did with the red wines:

In [None]:
whites = pd.read_csv("winequality-white.csv", sep = ";")
whites['winetype'] = 0

Now we will merge the two datasets and pick the labels winetype and features for each of the wines.

In [None]:
allwines = pd.concat([reds, whites])
print(allwines.head())
print(allwines.tail())
print(allwines.shape)

In [None]:
label_columns = ['winetype', 'quality']
feature_columns = [col for col in allwines.columns if col not in label_columns]

winetype_labels = allwines[['winetype']]
winetype_features = allwines[feature_columns]

print("Labels shape: ", winetype_labels.shape)
print("Features shape: ", winetype_features.shape)

## Model Selection

Let's pick a model for training a model to classify wines based on color. 

http://scikit-learn.org/stable/tutorial/machine_learning_map/

In order to train a Linear SVM, we need to generate training data and then verify the accuracy of the model using test data. So we will use the data we have and split it into train and test data.

In [None]:
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

winetype_feautures, winetype_labels = shuffle(winetype_features, winetype_labels)
train_data, test_data, train_labels, test_labels = train_test_split(winetype_features, winetype_labels, test_size=0.35)

## Model Training

Now let's pick up the model and train it with the data.

Note: we use the method `ravel` from `numpy` here to reshape our labels. Because of how we extracted the data, it's actually a _column_, but we just need it to be a flat array. `ravel` takes care of that for us simply. It will still work without it, but you'll get a deprecation warning. `numpy` is also a helpful library, you can find more about in the [tutorial](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html).

We'll also be using the `sklearn` library. This has tons of machine learning models implemented with a simple interface. You can find more about [sklearn here](http://scikit-learn.org/stable/).

In [None]:
from sklearn.svm import SVC
import numpy as np

clf = SVC()
clf.fit(train_data, np.ravel(train_labels))
predicted_labels = clf.predict(test_data)

## Model Evaluation

In [None]:
print("Model score: {0}".format(clf.score(test_data, test_labels)))

## Your Turn

Now let's think about training a model that predicts the *quality* of the wine. Remember whenGo through the same steps!

1. Load Data
2. Prepare Data
3. Train/Test Split
4. Select Model
5. Train Model
6. Evaluate Model

What's the best score you can get?

## Going Further

What do you do after you have an initial model? Well, maybe it's good enough to deploy. Or, maybe there's some room for improvement.

Something many practitioners engage in is error analysis: take a look at what your model doesn't do so well on and try to figure out what the underlying issue is; maybe the issue is noisy data, maybe it's a faulty assumption of your model, or maybe you could be using better features. It's often hard to do this very effectively unless you know a lot about the underlying model -- so this might be an exercise you return to.

If you're looking to go deeper, here are some resources to help you:

* [Kaggle](https://www.kaggle.com/)
* [Udacity ML Nanodegree](https://www.udacity.com/course/machine-learning-engineer-nanodegree--nd009)
* [Coursera ML](https://www.coursera.org/learn/machine-learning)
* [Machine Learning: A Probabilistic Perspective](https://www.amazon.com/Machine-Learning-Probabilistic-Perspective-Computation/dp/0262018020) -- WARNING: math heavy
* [Deep Learning Book](http://www.deeplearningbook.org/)
* [OpenAI Gym](https://gym.openai.com/)