# 🏁 Wrap-up quiz 1

This notebook contains the guided project to answer the hands-on questions
corresponding to the module "The predictive modeling pipeline" of the
Associate Practitioner Course. In this test **we do not have access to your
code**. Only it's output is evaluated by using the multiple choice questions,
to be answered in the dedicated User Interface.

First run the following cell to initialize jupyterlite. Notice that only basic
libraries are available, such as pandas, matplotlib, seaborn and numpy.
Remember that the initial import of libraries can take longer than usual, it
may take around 10-20 seconds for the following cell to run. Please be
patient.

In [None]:
%pip install seaborn==0.13.2
import matplotlib
import numpy
import pandas
import seaborn
import sklearn

Load the `ames_housing_no_missing.csv` dataset with the following cell of code.

The target is the "SalePrice" column. As we have not encountered any
regression problem yet, we convert the regression target into a classification
target, where the goal is to predict whether or not the sale price of a house
is greater than $200,000.

In [None]:
import pandas as pd
ames_housing = pd.read_csv("../datasets/ames_housing_no_missing.csv")

target_name = "SalePrice"
data, target = ames_housing.drop(columns=target_name), ames_housing[target_name]
target = (target > 200_000).astype(int)

Use the `data.info()` and ` data.head()` commands to examine the columns of
the dataframe. The dataset contains:

- a) only numerical features
- b) only categorical features
- c) both numerical and categorical features

_Select a single answer_

In [None]:
# Write your code here.

How many features are available to predict whether or not a house is
expensive?

- a) 79
- b) 80
- c) 81

_Select a single answer_

In [None]:
# Write your code here.

How many features are represented with numbers?

- a) 0
- b) 36
- c) 42
- d) 79

_Select a single answer_

Hint: you can use the method
[`df.select_dtypes`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html)
or the function
[`sklearn.compose.make_column_selector`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_selector.html)
as shown in a previous notebook.

In [None]:
# Write your code here.

Refer to the [dataset description](https://www.openml.org/d/42165) regarding
the meaning of the features.

Among the following features, which of them express a quantitative numerical
value (excluding ordinal categories)?

- a) "LotFrontage"
- b) "LotArea"
- c) "OverallQual"
- d) "OverallCond"
- e) "YearBuilt"

We consider the following numerical columns:

In [None]:
numerical_features = [
  "LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1", "BsmtFinSF2",
  "BsmtUnfSF", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "LowQualFinSF",
  "GrLivArea", "BedroomAbvGr", "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces",
  "GarageCars", "GarageArea", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch",
  "3SsnPorch", "ScreenPorch", "PoolArea", "MiscVal",
]

Now create a predictive model that uses these numerical columns as input data.
Your predictive model should be a pipeline composed of a
[`sklearn.preprocessing.StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
to scale these numerical data and a
[`sklearn.linear_model.LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

What is the accuracy score obtained by 10-fold cross-validation (you can set
the parameter `cv=10` when calling `cross_validate`) of this pipeline?

- a) ~0.5
- b) ~0.7
- c) ~0.9

_Select a single answer_

In [None]:
# Write your code here.

Instead of solely using the numerical columns, let us build a pipeline that
can process both the numerical and categorical features together as follows:
- the `numerical_features` (as defined above) should be processed as previously
  done with a `StandardScaler`;
- the left-out columns should be treated as categorical variables using a
  [`sklearn.preprocessing.OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).

To avoid any issue with rare categories that could only be present during the
prediction, you can pass the parameter `handle_unknown="ignore"` to the
`OneHotEncoder`.

What is the accuracy score obtained by 10-fold cross-validation of the
pipeline using both the numerical and categorical features?

- a) ~0.7
- b) ~0.9
- c) ~1.0

_Select a single answer_

In [None]:
# Write your code here.

One way to compare two models is by comparing their means, but small
differences in performance measures might easily turn out to be merely by
chance (e.g. when using random resampling during cross-validation), and not
because one model predicts systematically better than the other.

Another way is to compare cross-validation test scores of both models
fold-to-fold, i.e. counting the number of folds where one model has a better
test score than the other. This provides some extra information: are some
partitions of the data making the classification task particularly easy or
hard for both models?

Let's visualize the second approach:

![Fold-to-fold comparison](../figures/numerical_pipeline_wrap_up_quiz_comparison.png)

Select the true statement.

The number of folds where the model using all features perform better than the
model using only numerical features lies in the range:

- a) [0, 3]: the model using all features is consistently worse
- b) [4, 6]: both models are almost equivalent
- c) [7, 10]: the model using all features is consistently better

_Select a single answer_

In [None]:
# Write your code here.