## 📝 Exercise M1.02

The goal of this exercise is to fit a similar model as in the previous notebook to get familiar with manipulating scikit-learn objects and in particular the `.fit/.predict/.score` API.

Let's load the adult census dataset with only numerical variables

In [55]:
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census-numeric.csv")
data = adult_census.drop(columns="class")
target = adult_census["class"]

In the previous notebook we used `model = KNeighborsClassifier()`. All
scikit-learn models can be created without arguments. This is convenient
because it means that you don't need to understand the full details of a model
before starting to use it.

One of the `KNeighborsClassifier` parameters is `n_neighbors`. It controls the
number of neighbors we are going to use to make a prediction for a new data
point.

What is the default value of the `n_neighbors` parameter?

**The default value is 5.**

**Hint**: Look at the documentation on the [scikit-learn
website](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
or directly access the description inside your notebook by running the
following cell. This opens a pager pointing to the documentation.

In [56]:
from sklearn.neighbors import KNeighborsClassifier

Create a `KNeighborsClassifier` model with `n_neighbors=50`

In [57]:
model = KNeighborsClassifier(n_neighbors=50)

Fit this model on the data and target loaded above

In [58]:
model.fit(data, target)

Use your model to make predictions on the first 10 data points inside the
data. Do they match the actual target values?

In [59]:
target_predicted = model.predict(data[:10])

Compute the accuracy on the training data.

In [60]:
print(
    "Number of correct prediction: "
    f"{(target[:10] == target_predicted[:10]).sum()} / 10"
)

Number of correct prediction: 9 / 10


Now load the test data from `"../datasets/adult-census-numeric-test.csv"` and
compute the accuracy on the test data.

In [61]:
adult_census_test = pd.read_csv("../datasets/adult-census-numeric-test.csv")

target_name = "class"
target_test = adult_census_test[target_name]
data_test = adult_census_test.drop(columns=[target_name])

accuracy = model.score(data_test, target_test)

print("Accuracy: ", accuracy)

Accuracy:  0.8177909714402702


# 📝 Exercise M1.03

The goal of this exercise is to compare the performance of our classifier in
the previous notebook (roughly 81% accuracy with `LogisticRegression`) to some
simple baseline classifiers. The simplest baseline classifier is one that
always predicts the same class, irrespective of the input data.

- What would be the score of a model that always predicts `' >50K'`?
- What would be the score of a model that always predicts `' <=50K'`?
- Is 81% or 82% accuracy a good score for this problem?

Use a `DummyClassifier` and do a train-test split to evaluate its accuracy on
the test set. This
[link](https://scikit-learn.org/stable/modules/model_evaluation.html#dummy-estimators)
shows a few examples of how to evaluate the generalization performance of
these baseline models.

In [63]:
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")

We first split our dataset to have the target separated from the data used to
train our predictive model.

In [64]:
target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=target_name)

We start by selecting only the numerical columns as seen in the previous
notebook.

In [65]:
numerical_columns = ["age", "capital-gain", "capital-loss", "hours-per-week"]

data_numeric = data[numerical_columns]

Split the data and target into a train and test set.

In [None]:
from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(data_numeric, target, random_state=42, test_size=0.25)

Use a `DummyClassifier` such that the resulting classifier always predict the
class `' >50K'`. What is the accuracy score on the test set? Repeat the
experiment by always predicting the class `' <=50K'`.

Hint: you can set the `strategy` parameter of the `DummyClassifier` to achieve
the desired behavior.

In [None]:
from sklearn.dummy import DummyClassifier

model = DummyClassifier(strategy="constant", constant=" >50K")
model.fit(data_train, target_train)
score = model.score(data_test, target_test)

print("Score for DummyClassifier >50:", score)