# Machine Learning with scikit-learn

## Getting our Hands Dirty

Let's go through several techniques within scikit-learn, we will return to explore in more detail in subsequent lessons.  

Whenever we perform supervised learning, our workflow will resemble the diagram here:  Divide the data into training and testing sets, and within that,  "columns" of data are known as *features* and  one (sometimes can be more) is known as the *target*.  The difference between classification and regression is  whether the target is categorical or continuous.  

<!--- 
<img src='img/supervised_workflow.png' width=40% align="left"/>
-->
<img src='supervised_workflow.png' width=40% align="left"/>

## A Simple Dataset

Let's consider the following simple dataset, which is [a CSV file](./data/Learning%20about%20Humans%20learning%20ML.csv).

Data never comes in a *clean* way and we have to **clean** and **pre-process** the data.  Using the  facilities in Pandas, we can improve the initial data before applying scikit-learn to it. In particular, in this simple dataset,  the field "`Years of post-secondary education (e.g. BA=4; Ph.D.=10)`" as a required integer. Also, the "`Timestamps`" added by the form interface are really unnecessary.

Let's start to look at the data:

In [None]:
import pandas as pd
from os.path import join
import warnings       
warnings.simplefilter("ignore")  # ignore warning when processing

In [None]:
# Read the data using join function to cancatenate directory and file
#fname = join('data', "Learning about Humans learning ML.csv")
#humans = pd.read_csv(fname)

humans = pd.read_csv("Learning about Humans Learning ML.csv")

# show features of the dataframe
print('Before processing:\n')
print("\n".join(humans.columns))

# Drop unused column
humans.drop('Timestamp', axis=1, inplace=True)

# Add an improved column
humans['Education'] = (humans[
    'Years of post-secondary education (e.g. BA=4; Ph.D.=10)']
                       .str.replace(r'.*=','')
                       .astype(int))

# Then drop the one it is based on
humans.drop('Years of post-secondary education (e.g. BA=4; Ph.D.=10)', 
            axis=1, inplace=True)

# show features of the dataframe after we remove and add a feature
print('\n*******\nAfter processing:\n')
print("\n".join(humans.columns))



## Eyeballing Data

At the start of your work on a dataset, it is always useful to take a look at it to get a "feel" for the data. For this example, the dataset is small enough that it wouldn't be absurd to look at every single data point in it.  However, many of the datasets you will work with will have hundreds of thousands or millions of rows, and item by item examination is impossible.  For these cases, we need to look at representative values and aggregations of features.

If the dataset can be read as a Pandas DataFrame, overview inspection is particularly easy and friendly.

> **"90% of the time spent doing data analysis is doing data cleanup."** –Every Data Scientist

In [None]:
print("\n".join(humans.columns))  # show features again

In [None]:
humans.head(4)  # show the first four data points

For convenience, let us give these shorter names to all the columns.  There is nothing functional in this change, but it often makes later code look nicer.  

Looking at a few rows of data often can help correct or improve our understanding of the meaning, range, units, common values, etc. of the data we wish to construct models around. In a great many cases, common sense can prevent chasing down dead ends that take hours or days of needless time.

In [None]:
humans.columns = ['Fav_lang', 'Fav_movie', 'Experience', 'Sklearn', 
                  'Age', 'Humans_Machines', 'Fav_Game', 'Success', 'Education']
humans.head(4)

Looking at the metadata and a basic **statistical aggregation** of the data is generally useful also.  Pandas DataFrames provide a very easy way to look at this:

In [None]:
humans.describe(include=['int', 'float', 'object'])

## Data Cleanup

It is useful to explore aspects of the data offline. In the summary view, a few data quality issues jump out. This is universal to real world datasets. 

For example, it is unlikely  that two 3 year-olds were in the dataset. More likely, someone of 30-somethings mistyped entering their ages. A 99 year-old is possible, but that also seems more likely to be a placeholder value used by some data entry. While the description of what is meant by the integer "Education" was probably underspecified, it still feels like the -10 years of education is more likely to be a data entry problem than an intended indicator.

However, **the data we have is the data we must analyze**.

In [None]:
humans[humans.Age == 3]  # print out data with Age == 3

#humans[humans.Education == -10.0]

### One-hot Encoding

Several features of the data represent a small number of discrete categories.  For many or most algorithms, using one-hot enconding of categorical data is more effective than using raw categories or converting to integers. Basically, all those columns that have a small number of unique values—and specifically values that are not ordinal, even implicitly—are are categorical.

One-hot encoding makes less difference for the decision tree and random forest classifiers used than it might for other classifiers and regressors, but it rarely hurts. We perform the encoding with `pandas.get_dummies()`, but you could equally use `sklearn.preprocessing.LabelBinarizer` to accomplish the same goal.

In [None]:
human_dummies = pd.get_dummies(humans)
list(human_dummies.columns)  # see how we EXPAND some fields (e.g., Fav_lang)

In [None]:
# Let's display this processed data, now all categorical variables are indicators (0 or 1)
human_dummies.head(4)

## Classification: Choosing Features and a Target

Let us use scikit-learn to model the dataset. In particular, we would like to know whether other features of attendees are a good predictor of how successful they found this scikit-learn tutorial. A very common pattern you will see in machine learning based on starting DataFrames is to drop one (or more) column for the X features, and keep that one for the y target.

In my analysis, I felt a binary measure of success was more relevant than a scalar measure initially collected as a 1-10 scale. Moreover, if the target is simplified this way, it becomes appropriate to use a *classification* algorithm as opposed to a *regression* algorithm. It would be a mistake to treat the 1-10 scale as a categorical consisting of 10 independent labels—there is something inherently ordinal about these labels, although scikit-learn will happily calculate models as if there is not. On the other hand, responses to this ordinal question is generally non-uniform in distribution, usually with a clustering of values near the top values.

This is a place where **domain knowledge** is needed by a data scientist.

In [None]:
X = human_dummies.drop("Success", axis=1)  # Drop by "Success" feature, assign remaining to X
y = human_dummies.Success >= 8   # set y to be boolean. If score >= 8, it is positive

#print (X)    # use this to show input X
#print (y)    # use this to show target y

We selected a cutoff for success scores  >=8 will divide the data into "Yes" and "No" categories.

In [None]:
y.value_counts()

## Conventional Names and Shapes

In almost all machine learning discussions, you will see the names capital-X and lowercase-y for the feature set and the target. The idea here is that the capital stands for the independent variables, but in general one expects there to be multiple such feature variables. The target consists of just one dependent variable, and hence its lowercase. The feature set and the target will always have the same number of rows.

Moreover, you will often see the features within X named $x_1$, $x_2$, $x_3$, and so on in many machine learning texts.

In [None]:
y.head()  # see the top 5 classification class

#y.head(10)  # see the top 10 classification class

In [None]:
X.head()  # see the top 5 data points

## Train/Test Split

While using [sklearn.model_selection.StratifiedKFold](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) is a more rigorous way of evaluating a model, for quick-and-dirty experimentation, using `train_test_split()` is usually the easiest approach. In either case, the basic principle is that you want to avoid overfitting by training on different data than you use to test your model.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

print("Training features/target:", X_train.shape, y_train.shape)
print("Testing features/target:", X_test.shape, y_test.shape)

Later, we will return to more details about train/test splits.  For now, this creates relative independence of training data from the test set used for evaluation.  A deeper issue remains about whether the analyzed sample is truly representative of *all* the uncollected data of this type in the rest of the world.

Let's visualizeseveral breakdowns of our individual data items:

<!---
<img src='img/train_test_split_matrix.png' width="66%"/>
--->

<img src='train_test_split_matrix.png' width="66%"/>

## Choosing an Algorithm: Decision Trees and Random Forests

An interesting thing happened in trying a few **machine learning models** out. In this simple exercise, we will try `RandomForestClassifier` and `DecisionTreeClassifer`.

Note that choosing the best max_depth for decision tree family algorithms is largely a matter of trial and error. You can search the space in a nice high level API using [sklearn.model_selection.GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html), but it often suffices to use a basic Python loop like:

```python
for n in range(1,20):
    tree = DecisionTreeClassifier(max_depth=n)
    tree.fit(X_train, y_train)
    print(n, tree.score(X_test, y_test))
```

In [None]:
# Let's try RandomForest 
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=10, random_state=0)
rf.fit(X_train, y_train)
rf.score(X_test, y_test)

In [None]:
# Let's try DecisionTree
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(max_depth=7, random_state=0)
tree.fit(X_train, y_train)
tree.score(X_test, y_test)

In [None]:
# Let's use decision try to do **prediction**
tree.predict(X_test)

## Conclusion

**Classification**: In the current lessson we **cleaned up our dataset** enough to being to fit and try out a **classification model**.  What is interesting is that we will learn more deeply at our initial classifier, and seek to compare it to a variety of other classifiers available in scikit-learn.

