# Beginning Machine Learning with scikit-learn

## Getting our Hands Dirty

In this lesson, we go through several techniques within scikit-learn, many of which we return to explore in more detail in subsequent lessons.  Having a sense of the overall steps and results one sees in a machine learning task provides a good reference to more in-depth exploration later.

Whenever we perform supervised learning, our workflow will resemble the diagram here.  That is, we need to divide our data into training and testing sets, and within that, many "columns" of data are known as *features* and just one is known as the *target*.  The difference between classification and regression is simply whether the target is categorical or continuous.  Some similar models exist for both types of target, other are specific to one or the other.

<img src='img/supervised_workflow.png' width=40% align="left"/>

## Machines Learning about Humans Learning about Machine Learning

I gave the first tutorial at AnacondaCON 2018, on machine learning with scikit-learn. I spoke there to about 120 attendees.

The attendees of my session were an excellent group of learners and experts. But I decided I wanted to know even more about these people than I could find by looking at their faces and responding to their questions. So I asked them to complete a slightly whimsical form at the end of the 3 hour  tutorial. Just who are these people, and what can scikit-learn tell us about which of them benefitted most from the tutorial?

In the interest open data science, the collection of answers given by attendees is available under a [CC-BY-SA 4.0 license](https://creativecommons.org/licenses/by-sa/4.0/legalcode), and is part of the [GitHub repository for this course](https://github.com/DavidMertz/ML-Live-Beginner). The anonymized data is [available as a CSV file](https://github.com/DavidMertz/ML-Live-Beginner/blob/master/data/Learning%20about%20Humans%20learning%20ML.csv). 

The attendees of this course are well described as:

> **"Humans learning about machines learning about humans learning about machine learning."**

It would be great to collect a larger dataset for future revisions of this analysis.  Please complete this [Machine Learning with scikit-learn survey](https://goo.gl/pghpzD).  Updated results will appear on the [GitHub repository](https://github.com/DavidMertz/ML-Live-Beginner) from time to time.

## The Whimsical Dataset

Data never arrives at the workstation of a data scientist quite clean, no matter how much validation is attempted in the collection process. The respondent data is no exception. Using the familiar facilities in Pandas, we can improve the initial data before applying scikit-learn to it. In particular, I failed to validate the field "`Years of post-secondary education (e.g. BA=4; Ph.D.=10)`" as a required integer. Also, the "`Timestamps`" added by the form interface are gratuitous for these purposes—they are all within a couple minutes of each other, but the order or spacing is unlikely to have any value to our models.

Let's start to look at the data:

In [1]:
import pandas as pd
from os.path import join
import warnings
warnings.simplefilter("ignore")

In [2]:
# Read the data
fname = join('data', "Learning about Humans learning ML.csv")
humans = pd.read_csv(fname)

# Drop unused column
humans.drop('Timestamp', axis=1, inplace=True)

# Add an improved column
humans['Education'] = (humans[
    'Years of post-secondary education (e.g. BA=4; Ph.D.=10)']
                       .str.replace(r'.*=','')
                       .astype(int))

# Then drop the one it is based on
humans.drop('Years of post-secondary education (e.g. BA=4; Ph.D.=10)', 
            axis=1, inplace=True)

## Eyeballing Data

At the start of your work on a dataset, it is always useful to take a look at it to get a "feel" for the data. For this example, the dataset is small enough that it wouldn't be absurd to look at every single data point in it.  However, many of the datasets you will work with will have hundreds of thousands or millions of rows, and item by item examination is impossible.  For these cases, we need to look at representative values and aggregations of features.

If the dataset can be read as a Pandas DataFrame, overview inspection is particularly easy and friendly.

> **"90% of the time spent doing data analysis is doing data cleanup."** –Every Data Scientist

In [3]:
print("\n".join(humans.columns))

Favorite programming language
Favorite Monty Python movie
Years of Python experience
Have used Scikit-learn
Age
In the Terminator franchise, did you root for the humans or the machines?
Which is the better game?
How successful has this tutorial been so far?
Education


In [4]:
humans.head(4)

Unnamed: 0,Favorite programming language,Favorite Monty Python movie,Years of Python experience,Have used Scikit-learn,Age,"In the Terminator franchise, did you root for the humans or the machines?",Which is the better game?,How successful has this tutorial been so far?,Education
0,Python,Monty Python's Life of Brian,20.0,Yep!,53,Skynet is a WINNER!,"Tic-tac-toe (Br. Eng. ""noughts and crosses"")",8,12
1,Python,Monty Python and the Holy Grail,4.0,Yep!,33,Team Humans!,Chess,9,5
2,Python,Monty Python and the Holy Grail,1.0,Yep!,31,Team Humans!,Chess,10,10
3,Python,Monty Python and the Holy Grail,12.0,Yep!,60,Team Humans!,"Tic-tac-toe (Br. Eng. ""noughts and crosses"")",6,10


For convenience, let us give these shorter names to all the columns (and ones that are Python identifiers that we can use for attribute-style access.  There is nothing functional in this change, but it often makes later code look nicer.  

Looking at a few rows of data often can help correct or improve our understanding of the meaning, range, units, common values, etc. of the data we wish to construct models around. In a great many cases, common sense can prevent chasing down dead ends that take hours or days of needless time.

In [None]:
humans.columns = ['Fav_lang', 'Fav_movie', 'Experience', 'Sklearn', 
                  'Age', 'Humans_Machines', 'Fav_Game', 'Success', 'Education']
humans.head(4)

Looking at the metadata and a basic statistical aggregation of the data is generally useful also.  Pandas DataFrames provide a very easy way to look at this:

In [None]:
humans.describe(include=['int', 'float', 'object'])

## Data Cleanup

It would be useful to explore aspects of the (simple) data offline to get practice. In the summary view a few data quality issues jump out. This is universal to real world datasets. 

I am doubtful that two 3 year-olds were in my audience. More likely, a couple 30-somethings mistyped entering their ages. A 99 year-old is possible, but that also seems more likely to be a placeholder value used by some respondent. While the description of what is meant by the integer "Education" was probably underspecified, it still feels like the -10 years of education is more likely to be a data entry problem than an intended indicator.

However, **the data we have is the data we must analyze**.

In [None]:
humans[humans.Age == 3]

### One-hot Encoding

Several features of the data represent a small number of discrete categories.  For many or most algorithms, using one-hot enconding of categorical data is more effective than using raw categories or converting to integers. Basically, all those columns that have a small number of unique values—and specifically values that are not ordinal, even implicitly—are are categorical.

One-hot encoding makes less difference for the decision tree and random forest classifiers used in this lesson than it might for other classifiers and regressors, but it rarely hurts. We perform the encoding with `pandas.get_dummies()`, but you could equally use `sklearn.preprocessing.LabelBinarizer` to accomplish the same goal.

In [None]:
human_dummies = pd.get_dummies(humans)
list(human_dummies.columns)

## Classification: Choosing Features and a Target

Let us use scikit-learn to model the respondents. In particular, we would like to know whether other features of attendees are a good predictor of how successful they found the tutorial. A very common pattern you will see in machine learning based on starting DataFrames is to drop one column for the X features, and keep that one for the y target.

In my analysis, I felt a binary measure of success was more relevant than a scalar measure initially collected as a 1-10 scale. Moreover, if the target is simplified this way, it becomes appropriate to use a *classification* algorithm as opposed to a *regression* algorithm. It would be a mistake to treat the 1-10 scale as a categorical consisting of 10 independent labels—there is something inherently ordinal about these labels, although scikit-learn will happily calculate models as if there is not. On the other hand, responses to this ordinal question is generally non-uniform in distribution, usually with a clustering of values near the top values.

This is a place where subject matter judgement is needed by a data scientist.

In [None]:
X = human_dummies.drop("Success", axis=1)
y = human_dummies.Success >= 8

We selected a cutoff for success scores  >=8 will approximately evenly divide the data into "Yes" and "No" categories.

In [None]:
y.value_counts()

## Conventional Names and Shapes

In almost all machine learning discussions, you will see the names capital-X and lowercase-y for the feature set and the target. The idea here is that the capital stands for the independent variables, but in general one expects there to be multiple such feature variables. The target consists of just one dependent variable, and hence its lowercase. The feature set and the target will always have the same number of rows.

Using X and y to distinguish independent and dependent variables is widespread in many areas of mathematics. Moreover, you will often see the features within X named $x_1$, $x_2$, $x_3$, and so on in more academic texts.

In [None]:
y.head()

In [None]:
X.head()

## Train/Test Split

While using [sklearn.model_selection.StratifiedKFold](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) is a more rigorous way of evaluating a model, for quick-and-dirty experimentation, using `train_test_split()` is usually the easiest approach. In either case, the basic principle is that you want to avoid overfitting by training on different data than you use to test your model.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

print("Training features/target:", X_train.shape, y_train.shape)
print("Testing features/target:", X_test.shape, y_test.shape)

In a later lesson we return to more details about train/test splits.  For now, this creates relative independence of training data from the test set used for evaluation.  A deeper issue remains about whether the analyzed sample is truly representative of *all* the uncollected data of this type in the rest of the world.

In some sense, overfitting is a non-issue for this dataset if we think of it as *complete*—i.e. every response from a one-time event that can never be exactly repeated.  But in that strict sense of the particularity of the data, machine learning is irrelevant since we *have* every possible measurement.

We can visualize the several breakdowns of our individual data items:

<img src='img/train_test_split_matrix.png' width="66%"/>

## Choosing an Algorithm: Decision Trees and Random Forests

An interesting thing happened in trying a few models out. While `RandomForestClassifier` is incredibly powerful, and very often produces the most accurate predictions among all classifiers, for this particular data a single `DecisionTreeClassifer` does better. Viewers might want to think about why this turns out to be true and/or experiment with hyperparameters to find a more definite explanation; other classifiers might perform better still also, of course.

I will note that choosing the best max_depth for decision tree family algorithms is largely a matter of trial and error. You can search the space in a nice high level API using [sklearn.model_selection.GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html), but it often suffices to use a basic Python loop like:

```python
for n in range(1,20):
    tree = DecisionTreeClassifier(max_depth=n)
    tree.fit(X_train, y_train)
    print(n, tree.score(X_test, y_test))
```

In [None]:
for n in range(1,20):
    tree = DecisionTreeClassifier(max_depth=n)
    tree.fit(X_train, y_train)
    print(n, tree.score(X_test, y_test))

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=10, random_state=0)
rf.fit(X_train, y_train)
rf.score(X_test, y_test)

In [None]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(max_depth=7, random_state=0)
tree.fit(X_train, y_train)
tree.score(X_test, y_test)

In [None]:
tree.predict(X_test)

In [None]:
tree.predict_proba(X_test)

## Next Lesson

**Classification**: In the current lessson we cleaned up our dataset enough to being to fit and score a classification model.  In the next lesson we will look more deeply at our initial classifier, and beging to compare it to a variety of other classifiers available in scikit-learn.

<a href="Classification.ipynb"><img src="img/open-notebook.png" align="left"/></a>