# Lale User Study - March 2021 - Lale treatment

The goal of this study is to examine different dimensions of usability of machine learning (ML) pipelines. We will provide a number of sample pipelines and ask you to analyze and manipulate them during 4 tasks:
1. Understanding the pipeline
1. Refinement
1. Debugging
1. Refinement with Search

Before we start, we load a dataset and print a few rows to see what it looks like.

In [3]:
#load forest covertype data, downsample for faster experiments
import pandas as pd
from lale.datasets import covtype_df
from sklearn.model_selection import train_test_split
from lale.lib.lale import categorical

train_X = pd.read_pickle("train_x.pickle")
test_X = pd.read_pickle("test_x.pickle")
train_y = pd.read_pickle("train_y.pickle")
test_y = pd.read_pickle("test_y.pickle")

pd.options.display.max_columns = 10
pd.concat([train_y, train_X], axis=1)

  self.re = re.compile(self.reString)


Unnamed: 0,target,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,...,Soil_Type35,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40
507955,3,2575.0,272.0,5.0,481.0,...,0.0,0.0,0.0,0.0,0.0
381859,1,3406.0,162.0,14.0,631.0,...,0.0,0.0,1.0,0.0,0.0
316836,1,3020.0,304.0,9.0,108.0,...,0.0,0.0,0.0,0.0,0.0
556434,2,2740.0,69.0,17.0,67.0,...,0.0,0.0,0.0,0.0,0.0
275116,2,2706.0,258.0,8.0,201.0,...,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
538389,2,2860.0,324.0,13.0,0.0,...,0.0,0.0,0.0,0.0,0.0
527299,2,2653.0,314.0,22.0,30.0,...,0.0,0.0,0.0,0.0,0.0
169780,1,3279.0,7.0,15.0,713.0,...,0.0,0.0,0.0,0.0,0.0
296001,1,3149.0,156.0,9.0,90.0,...,0.0,0.0,0.0,0.0,0.0


## Step 1. Understand a Pipeline

Consider the following Lale pipeline:

In [4]:
from lale.lib.sklearn import Normalizer
from lale.lib.sklearn import SelectKBest
from lale.lib.sklearn import KNeighborsClassifier
from lale.lib.lale import Project
from lale.lib.lale import categorical
from lale.lib.lale import ConcatFeatures

In [5]:
prepA = Project(drop_columns=categorical(max_values=2)) >> Normalizer()
prepB = Project(columns=categorical(max_values=2)) >> SelectKBest(k=8)
pipeline1 = (prepA & prepB) >> ConcatFeatures >> KNeighborsClassifier()

In [6]:
from sklearn.metrics import accuracy_score
trained1 = pipeline1.fit(train_X, train_y)
print(f"accuracy {accuracy_score(test_y, trained1.predict(test_X)):.1%}")

accuracy 69.7%


Handy documentation links:
- [Lale API](https://lale.readthedocs.io/en/latest/)
- [scikit-learn API](https://scikit-learn.org/stable/modules/classes.html)

In [None]:
# if you'd like to try things out, put your code here

Please answer the following questions.

- Q1a: What is the final classifier in the pipeline?
- Q1b: Where does the input for the final classifier come from?
- Q1c: Which columns are subjected to feature selection?

## 2. Refine without Search

Create a `pipeline2` that is similar to `pipeline1` from Question 1, except
that instead of Normalizer, it uses a StandardScaler, and
instead of KNeighborsClassifier, it uses a DecisionTreeClassifier with
a maximum depth of 3. Handy documentation links:
- [Lale API](https://lale.readthedocs.io/en/latest/)
- [scikit-learn API](https://scikit-learn.org/stable/modules/classes.html)

### Answer for Step 2.

In [5]:
# pipeline1 reproduced below:
prepA = Project(drop_columns=categorical(max_values=2)) >> Normalizer()
prepB = Project(columns=categorical(max_values=2)) >> SelectKBest(k=8)
pipeline1 = (prepA & prepB) >> ConcatFeatures >> KNeighborsClassifier()

# your code here

### Questions about Step 2.

- Q2a. What documentation did you find the most helpful?
- Q2b. Did your new pipeline work the first time? If not, what went wrong?
- Q2c. What gotchas did you encounter, if any?

## Step 3. Error Messages

Look at the error message from the following code:

In [10]:
from lale.lib.sklearn import LinearSVC
pipeline3 = SelectKBest(k=15) >> LinearSVC(penalty="l1", loss="hinge")

ValidationError: Invalid configuration for LinearSVC(penalty='l1', loss='hinge') due to constraint the combination of penalty=`l1` and loss=`hinge` is not supported.
Schema of constraint 1: {
    "description": "The combination of penalty=`l1` and loss=`hinge` is not supported",
    "anyOf": [
        {"type": "object", "properties": {"penalty": {"enum": ["l2"]}}},
        {
            "type": "object",
            "properties": {"loss": {"enum": ["squared_hinge"]}},
        },
    ],
}
Value: {'penalty': 'l1', 'loss': 'hinge', 'dual': True, 'tol': 0.0001, 'C': 1.0, 'multi_class': 'ovr', 'fit_intercept': True, 'intercept_scaling': 1.0, 'class_weight': None, 'verbose': 0, 'random_state': None, 'max_iter': 1000}

In [11]:
trained3 = pipeline3.fit(train_X, train_y)
pred_y = trained3.predict(test_X)

NameError: name 'pipeline3' is not defined

Make a small change to the pipeline to avoid that error.

Handy documentation links:
- [Lale docs for LinearSVC](https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.linear_svc.html)
- [sklearn docs for LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html),

### Answer for Step 3.

In [None]:
# Please fix the reproduced pipeline in the cells below:

from lale.lib.sklearn import LinearSVC
pipeline3 = SelectKBest(k=15) >> LinearSVC(penalty="l1", loss="hinge")

In [None]:
trained3 = pipeline3.fit(train_X, train_y)
pred_y = trained3.predict(test_X)

### Questions about Step 3.

- Q3a. What caused the error?
- Q3b. Which documentation did you find useful for diagnosing the error?
- Q3c. Was the schema in the error message useful?
- Q3d. How do you normally debug machine learning pipelines?

## Step 4. Refine with Search

Experiment with a search space of variants of `pipeline1` (from Question 1):

- normalizers: Normalizer, StandardScaler, or neither
- classifiers: LogisticRegression, ExtraTreesClassifier,
  KNeighborsClassifier, or DecisionTreeClassifier

Find the one with the best predictive performance.

Handy documentation links:
- [Lale API](https://lale.readthedocs.io/en/latest/)
- [scikit-learn API](https://scikit-learn.org/stable/modules/classes.html)

### Answer for Step 4.

In [9]:
from lale.lib.sklearn import StandardScaler
from lale.lib.sklearn import LogisticRegression, ExtraTreesClassifier, DecisionTreeClassifier

# pipeline1 reproduced below:
prepA = Project(drop_columns=categorical(max_values=2)) >> Normalizer()
prepB = Project(columns=categorical(max_values=2)) >> SelectKBest(k=8)
pipeline1 = (prepA & prepB) >> ConcatFeatures >> KNeighborsClassifier()

# your code here

### Questions about Step 4.

- Q4a. Which pipeline variant lead to the highest accuracy?
- Q4b. What was the accuracy of that pipeline variant?
- Q4c. Did your new pipeline work the first time? If not, what went wrong?
- Q4d. What gotchas did you encounter, if any?
- Q4e. How do you normally search across pipeline variants?