## Introducing the challenge

- Learn from the expert who won DrivenData's challenge
	- NLP
	- Feature engineering
	- Efficiency boosting hashing tricks
- Use data to have a social impact

- Budgets for schools are huge, complex, and not standardized
	- Hundreds of hours each year are spent manually labelling
- <u>Goal</u>: Build a ML algorithm that can automate the process.

- Budget data:
	- Line-item: 'Algebra books for 8th grade students'
	- Labels: 'Textbooks', 'Math', 'Middle School'
- This is a supervised learning algorithm

## Load and preview the data

~~~
import pandas as pd

sample_df = pd.read_csv('sample_data.csv')
~~~

## Encode labels as categories

- ML algorithms work on numbers, not strings
	- Need a numeric representation of these strings
- Strings can be slow compared to numbers
- In pandas, 'category' dtype encodes categorical data numerically
	- Can speed up code

~~~
sample_df.label = sample_df.label.astype('category')
~~~

- Dummy variable encoding
	- Also called a 'binary indicator' representation

~~~
dummies = pd.get_dummies(sample_df[['label']], prefix_sep='_')
~~~

## Encode labels as categories

~~~
categorize_label = lambda x: x.astype(category)

sample_df.label = sample_df[['label']].apply(categorize_label, axis=0)
~~~


## How do we measure success?

- Accuracy can be misleading when classes are imbalanced
- Metric used in this problem: log loss
	- It is a loss function
	- Measure of error
	- Want to minimize the error (unlike accuracy)

## Log loss binary classification

- Log loss for **binary** classification
	- Actual value: $y = \{1 = \textrm{ yes }, 0 = \textrm{ no }\}$
	- Prediction (probability that the value is $1$): $p$

$$
\textrm{ logloss } = -\displaystyle\frac{1}{N} \displaystyle\sum_{i=1}^{N} = y_i \log p_i + (1-y_i) \log (1-p_i)
$$

## Computing logloss

logloss.py
~~~
import numpy as np

def compute_log_loss(predicted, actual, eps=1e-14):
	predicted = np.clip(predicted, eps, 1-eps)
	loss = -1 * np.mean(actual * np.log(predicted) + (1 - actual) * np.log(1 - predicted))

	return loss
~~~

## It's time to build a model

- Train basic model on numeric data only
	- Want to go from raw data to predictions quickly
- Multi-class logistic regression
	- Train classifier on each label separately and use those to predict

## Splitting the multi-class dataset

- Recall: train-test split
	- Will not work here
	- may end up with labels in test set that never appear in training set
- Solution: `StratifiedShuffleSplit`
	- Only works with a single target variable
	- We have many target variables
	- `multilabel_train_test_split()`

~~~
data_to_train = df[NUMERIC_COLUMNS].fillna(-1000)

labels_to_use = pd.get_dummies(df[LABELS])

X_train, X_test, y_train, y_test = multilabel_train_test_split(data_to_train, labels_to_use, size=0.2, seed=123)
~~~

## Training the model

~~~
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

clf = OneVsRestClassifier(LogisticRegression())
clf.fit(X_train, y_train)
~~~

- `OneVsRestClassifier`:
	- treats each column of `y` independently
	- fits a separate classifier for each of the columns

## Predicting on houldout data

~~~
holdout = pd.read_csv('HoldoutData.csv', index_col=0)

holdout = holdout[NUMERIC_COLUMNS].fillna(-1000)

predictions = clf.predict_proba(holdout)
~~~

- if `.predict()` was used instead:
	- output would be $0$ or $1$
	- logloss penalizes being confident and wrong
		- worse performance compared to `.predict_proba()`

## Submitting your predictions as a csv

~~~
prediction_df = pd.DataFrame(columns=pd.get_dummies(df[LABELS], prefix_sep='__').columns, index=holdout.index,data=predictions)

prediction_df.to_csv('predictions.csv')

score = score_submission(pred_path='predictions.csv')
~~~


## A very brief introduction fo NLP

- Tokenization
	- Splitting a string into segments
	- Store segments as list
	- Examples:
		- Tokenize on whitespace
		- Tokenize on whitespace and punctuation

- Bag of words representation
	- counts the number of times a particular token appears
	- this approach discards information about the word order
		- 'Red, not blue' is the same as 'blue, not red'
	- can use n-grams

## Representing text numerically

- Bag of words
	- simple way to represent text in ML
	- discards information about grammar and word order
	- computes frequency of occurence

## Bag of words in Sklearn

- `CountVectorizer()`
	- tokenizes all the strings
	- builds a 'vocabulary'
	- counts the occurences of each token in the vocabulary

~~~
from sklearn.feature_extraction.text import CountVectorizer

TOKENS_BASIC = '\\S+(?=\\s+)'

df.Program_Description.fillna('', inplace=True)

vec_basic = CountVectorizer(token_pattern=TOKENS_BASIC)

vec_basic.fit(df.Program_Description)
~~~


## The pipeline workflow

- Repeatable way to go from raw data to trained model
- Pipeline object takes sequential list of steps
	- Output of one step is input to next step
- Each step is a tuple with two elements
	- Name: string
	- Transform: object implementing `.fit()` and `.transform()`
- Flexible: a step can itself be another pipeline!

- Instantiate simple pipeline with on step

~~~
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

pl = Pipeline([('clf', OneVsRestClassifier(LogisticRegression()))])
~~~

- Train an dtest with sample numeric data

~~~
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(sample_df[['numeric']], pd.get_dummies(sample_df['label']), random_state=2)

pl.fit(X_train, y_train)

accuracy = pl.score(X_test, y_test)
~~~

- Imputing values

~~~
from sklearn.preprocessing import Imputer

X_train, X_test, y_train, y_test = train_test_split(sample_df[['numeric','with_missing']], pd.get_dummies(sample_df['label']), random_state=2)

pl = Pipeline([('imp', Imputer()),
		('clf', OneVsRestClassifier(LogisticRegression()))
		])

pl.fit(X_train, y_train)
acc = pl.score(X_test, y_test)
~~~

- Preprocessing text features

~~~
from sklearn.feature_extraction.text import CountVectorizer

X_train, X_test, y_train, y_test = train_test_split(sample_df['text'], pd.get_dummies(sample_df['label']), random_state=2)

pl = Pipeline([('vec', CountVectorizer()),('clf', OneVsRestClassifier())])

pl.fit(X_train, y_train)
acc = pl.score(X_test, y_test)
~~~

- Preprocessing multiple dtypes
	- Want to use <u>all</u> available features in one pipeline
	- Problem
		- Pipeline steps for numeric and text preprocessing can't follow each other
	- Solution
		- `FunctionTransformer()` & `FeatureUnion()`


## FunctionTransformer

- Turns a Python function into an object that scikit-learn pipeline can understand
- Need to write two functions for pipeline preprocessing
	- Take entire DataFrame, return numeric columns
	- Take entire DataFrame, return text columns
- Can then preprocess numeric and text data in separate pipelines

~~~
X_train, X_test, y_train, y_test = train_test_split(sample_df[['numeric','with_missing','text']], pd.get_dummies(sample_df['label']), random_state=2)

from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion

get_text_data = FunctionTransformer(lambda x: x['text'], validate=False)

get_numeric_data = FunctionTransformer(lambda x: x[['numeric','with_missing']], validate=False)
~~~

- Putting it all together

~~~
numeric_pipeline = Pipeline([('selector', get_numeric_data), ('imputer', Imputer())])

text_pipeline = Pipeline([('selector', get_text_data), ('vectorizer', CountVectorizer())])

pipeline = Pipeline([
			('union', FeatureUnion([('numeric', numeric_pipeline), ('text', text_pipeline)])),
			('clf', OneVsRestClassifier(LogisticRegression()))
		    ])
~~~

## Learning from the expert: text preprocessing

- NLP tricks for text data
	- Tokenize on punctuation to avoid hyphens, underscores etc.
	- Include unigrams **and** bi-grams in the model to capture important information involving multiple tokens - e.g., 'middle school'

- N-grams and tokenization

~~~
vec = CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC, ngram_range=(1,2))
~~~

## Learning from the expert: interaction terms

- Example
	- 'English teacher for 2nd grade'
	- '2nd grade - budget for English teacher'
- Interaction terms mathematically describe when tokens appear together
- The math: $\beta_1 x_1 + \beta_2 x_2 + \beta_3 (x_1 \times x_2)$
- With scikit-learn

~~~
from sklearn.preprocessing import PolynomialFeatures

interaction = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)

interaction.fit_transform(x)
~~~

## Learning from the expert: hashing trick

- Adding new features may cause enormous increase in array size
- Hashing is a way of increasing memory efficiency
- Hash function limits possible outputs, fixing array size

~~~
from sklearn.feature_extraction.text import HashingVectorizer

vec = HashingVectorizer(norm=None, non_negative=True, token_pattern=TOKENS_ALPHANUMERIC, ngram_range=(1,2))
~~~

## The model that won it all

- NLP: range of n-grams, punctuation tokenization
- Stats: interaction terms
- Computation: hashing trick
- Model: `LogisticRegression` !


Full winning notebook available [here](https://github.com/datacamp/course-resources-ml-with-experts-budgets/blob/master/notebooks/1.0-full-model.ipynb).