# Working with numerical data and intuitions on linear models

In [None]:
import pandas as pd
adult_census = pd.read_csv("../datasets/adult-census.csv")
# Drop the numeric column that 100% correlates with the education column
adult_census = adult_census.drop(columns="education-num")
adult_census.head()

In [2]:
# Separate data and target
data, target = adult_census.drop(columns="class"), adult_census["class"]

In [3]:
data.shape

(48842, 12)

In [4]:
target.shape

(48842,)

In [7]:
# How to know which features are numerical
# Select only numerical features
data.dtypes

age                int64
workclass         object
education         object
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
dtype: object

In [8]:
numerical_columns = ["age", "capital-gain", "capital-loss", "hours-per-week"]

In [9]:
data_numeric = data[numerical_columns]

In [10]:
data_numeric.head()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week
0,25,0,0,40
1,38,0,0,50
2,28,0,0,40
3,44,7688,0,40
4,18,0,0,30


If you start doing machine learning, 90% of your time you're preparing data before you can start training the models. So, the steps above are something you will be doing often

### Intermezzo: Deal with missing values
- ML model cannot deal with missing values (NaN). You need to replace them with something numerical, such as -1. Or even better: use the mean() value of the entire column. This process is called **"imputation"** (Google for "how to impute using pandas")
- Categorical data: '?' -> This is fine, this is just a new category named "?" (= unknown)
- Introduce a new feature, for instance 'age_missing' that you assign a 0 or 1 (or False or True)

We don't have any missing data in this data set.

## Train-test split the dataset
We want to split, so we can predict on data that the model has never seen before. Thus: to avoid memorization.

In [13]:
from sklearn.model_selection import train_test_split

In [15]:
# use 'train_test_split()' to randomly ('shuffle') the data when splitting
# outputs a tuple with 4 data sets
data_train, data_test, target_train, target_test = train_test_split(
    data_numeric, 
    target, 
    test_size=0.25, 
    random_state=42) 


In [19]:
# To verify that the data is split in 75% and 25%
data_train.shape, data_test.shape

((36631, 4), (12211, 4))

## Train a linear model


In [20]:
# need a logistic regression model, because the target is categorical!
from sklearn.linear_model import LogisticRegression

In [21]:
model = LogisticRegression()

In [22]:
model.fit(data_train, target_train)

In [23]:
accuracy = model.score(data_test, target_test)

In [24]:
accuracy

0.8070592089099992

is this a good result? No, because with k-nearest-neighbors yesterday we had 81-82%.
Now we are going to use a different method to determine the simple baseline

### Exercise: Compare with simple baselines [Sven]
#### 1. Compare with simple baseline
The goal of this exercise is to compare the performance of our classifier in the previous notebook (roughly 81% accuracy with LogisticRegression) to some simple baseline classifiers. The simplest baseline classifier is one that always predicts the same class, irrespective of the input data.

What would be the score of a model that always predicts ' >50K'?

What would be the score of a model that always predicts ' <=50K'?

Is 81% or 82% accuracy a good score for this problem?

Use a DummyClassifier such that the resulting classifier will always predict the class ' >50K'. What is the accuracy score on the test set? Repeat the experiment by always predicting the class ' <=50K'.

Hint: you can set the strategy parameter of the DummyClassifier to achieve the desired behavior.

You can import DummyClassifier like this:
```python
from sklearn.dummy import DummyClassifier
```




In [25]:
from sklearn.dummy import DummyClassifier

In [3]:
# To read the documentation of a class or function, type 'ClassName?'
# Example: DummyClassifier?

In [27]:
# Use a DummyClassifier such that the resulting classifier will always predict the class ' >50K'. 
# What is the accuracy score on the test set? 

In [35]:
dummy_clf = DummyClassifier(strategy="constant", constant = ' >50K')
dummy_clf.fit(data_train, target_train)
dummy_clf.score(data_test, target_test)

0.23396937187781508

In [36]:
# Repeat the experiment by always predicting the class ' <=50K'.
dummy_clf = DummyClassifier(strategy="constant", constant = ' <=50K')
dummy_clf.fit(data_train, target_train)
dummy_clf.score(data_test, target_test)

0.7660306281221849

#### 2. (optional) Try out other baselines
What other baselines can you think of? How well do they perform?

In [37]:
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(data_train, target_train)
dummy_clf.score(data_test, target_test)

0.7660306281221849

In [47]:
dummy_clf = DummyClassifier(strategy="stratified", random_state=1234)
dummy_clf.fit(data_train, target_train)
dummy_clf.score(data_test, target_test)

0.6404061911391369

In [48]:
# With 2 classes, this resembles "flipping a coin"
dummy_clf = DummyClassifier(strategy="uniform", random_state=1234)
dummy_clf.fit(data_train, target_train)
dummy_clf.score(data_test, target_test)

0.4968471050691999

Other, real life baselines could be :
- How well would an average human score (e.g. recognize images of cats and dogs. The baseline accuracy would be 99.9%, so the bar is very high)
- How well would an expert human score (e.g. a doctor)

In some use cases, the human baseline is very high (cats vs. dogs) or lower. This sets different requirements on the prediction model that you're developing.