# Übung 8

In dieser Übung soll ein binäres Klassifikationsmodell erstellt werden, um vorherzusagen, ob eine Person ein bestimmtes Jahreseinkommen erzielt oder nicht. Als Datengrundlage dient ein Datensatz des US-Cenus. Dieser enthält numerische und kategorische Features und fehlende Werte, sodass mehrere Vorverarbeitungsschritte nötig sind.

In [54]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

from sklearn.base import TransformerMixin
from sklearn.base import BaseEstimator
from pandas.api.types import is_numeric_dtype

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import precision_recall_curve


#### Einlesen des Datensatzes

In [2]:
colnames = ['age','workclass','fnlwgt','education','education_num','marital-status','occupation','relationship','race','sex','capital-gain','capital-loss','hours-per-week','native-country','income']
df = pd.read_csv('adult.csv',sep=',', header=None, names=colnames, na_values=' ?')

In [3]:
df.shape

(32561, 15)

In [25]:
df.dtypes

age                int64
workclass         object
education         object
education_num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
income            object
dtype: object

In [4]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


Entfernen der Spalte `fnlwgt`. Diese enthält keine Information zu einer Person.

In [5]:
df = df.drop('fnlwgt', axis=1)

In [6]:
df

Unnamed: 0,age,workclass,education,education_num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [7]:
df['workclass'].value_counts()

 Private             22696
 Self-emp-not-inc     2541
 Local-gov            2093
 State-gov            1298
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name: workclass, dtype: int64

#### a) LabelEncoding der Zielvariable

Spalten Sie die Zielvariable `income` ab und führen Sie ein LabelEncoding durch. Verwenden Sie dazu die Klasse `LabelEncoder` aus dem Modul `sklearn.preprocessing`.

In [8]:
y = df['income']
X = df.drop('income', axis=1)

In [9]:
le = LabelEncoder()
le.fit([" >50K"," <=50K"])
y_le = le.transform(y)

#### b) Train-Test-Split

Führen Sie einen Train-Test-Split mit `test_size=0.2` durch.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y_le, test_size=0.2, random_state=43)

#### c) Eigenen Transformer (Imputer) schreiben

Implementieren Sie einen eigenen Imputer namens `MyImputer` für kategorische Features, welcher fehlende Werte durch die häufigste Ausprägung der jeweiligen Spalte ersetzt. Leiten Sie diesen von den Klassen  `sklearn.base.TransformerMixin` und `sklearn.base.BaseEstimator` ab. Dieser soll einen DataFrame zurückliefern, während der aus der Vorlesung bekannte `SimpleImputer` ein NumPy-Array erzeugt. Sie können dabei voraussetzen, dass der `MyImputer` nur auf DataFrames angewendet wird. <br> 

Make absolutely sure, that fit() takes in enough arguments (X_train and y_train in our case)

In [40]:
#Imputer für die numerischen Spalten eines DataFrame
class CategoricImputer(TransformerMixin, BaseEstimator):
    def __init__(self):
        self.impute_values = {} #zur Speicherung der einzusetzenden Werte
                
    def fit(self, X, y=None):
        for col in X.columns:
            if not is_numeric_dtype(X[col].dtype):
                self.impute_values[col] = X[col].value_counts().idxmax()  # most frequently occurring value
        return self
            
    def transform(self, X):
        X_transformed = X.copy()
        
        for col in self.impute_values.keys():
            if col in X_transformed.columns:
                #fills empty cells with the most frequently occurring value
                X_transformed[col] = X_transformed[col].fillna(self.impute_values[col])
        #returns dataframe without missing values for categorical features
        return X_transformed

#### d) ColumnTransformer und Pipelines

Mit Hilfe eines <a href="https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html">ColumnTransformers</a> können bestimmte Transformer auf Teilmengen der Datensatz-Spalten angewendet werden (z.B. die numerischen und die kategorischen).

Legen Sie einen ColumnTransformer an, der ...
* ...die kategorischen Spalten nacheinander mit dem `MyImputer` aus Teilaufgabe c) und einem <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html">OneHotEncoder</a> transformiert. Legen Sie dazu eine Pipeline aus beiden an und übergeben Sie diese an den ColumnTransformer, der diese dann für die kategorischen Spalten aufrufen soll.
* ...die numerischen Spalten nacheinander mit dem in Scikit-learn verfügbaren  <a href="https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html">SimpleImputer</a> und dem <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html">StandardScaler</a> transformiert. Fassen Sie auch diese beiden Schritte in einer Pipeline zusammen und übergeben Sie diese an den ColumnTransformer.

In [41]:
pipe_preprocessing_num = Pipeline([('imp_num', SimpleImputer(strategy='mean')), ('scaler', StandardScaler())])
pipe_preprocessing_cat = Pipeline([('imp_cat', CategoricImputer()), ('ohe', OneHotEncoder())])

categorical_features = ['workclass','education','marital-status','occupation',
                                                 'relationship','race','sex', 'native-country']
numeric_features = ['age', 'education_num', 'capital-gain', 'capital-loss',
                                                 'hours-per-week']


ct = ColumnTransformer(
    [
        ("preprocessor_cat", pipe_preprocessing_cat, categorical_features),
        ("preprocessor_num", pipe_preprocessing_num, numeric_features)
    ]
)

The ColumnTransformer returns a sparse matrix. This is great for machine learning as we can follow this up by applying a Logistic Regression Model, gradient boosting and so forth. It's not well suited for us humans to read however.

In [42]:
matrix_ct = ct.fit_transform(df)
matrix_ct

<32561x104 sparse matrix of type '<class 'numpy.float64'>'
	with 423293 stored elements in Compressed Sparse Row format>

#### e) Pipeline aus Transformer und Estimator

In [43]:

preprocessor = ColumnTransformer(
    [("preprocessor_cat", pipe_preprocessing_cat, categorical_features),
     ("preprocessor_num", pipe_preprocessing_num, numeric_features)]
)

Legen Sie nun eine Pipeline an, die aus dem `ColumnTransformer` und einem `LogisticRegression` Estimator besteht. Fitten Sie die Pipeline auf dem Trainingsdatensatz und berechnen Sie Accuracy auf dem Trainingsdatensatz. Wenden Sie das Modell/die Pipeline anschließend auf dem Testdatensatz an und berechnen Sie die Accuracy auf dem Testdatensatz.

In [44]:
type(y_train)

numpy.ndarray

In [45]:
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', LogisticRegression())])
pipeline.fit(X_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [50]:
accuracy = pipeline.score(X_test, y_test)
print("Accuracy: ", accuracy)

Accuracy:  0.8536772608628896


In [57]:
y_pred_proba = pipeline.predict_proba(X_test)[:, 1]
y_pred_proba

array([0.00922723, 0.04112971, 0.04663329, ..., 0.00335263, 0.17687902,
       0.19299808])

<h3>We're DONE! From here on out, we can continue to work on the Logistic Regressor</h3>

We've gone through all the necessary steps to preprocess the data. That has resulted in a pipeline, that returns our model. Now we can try to find the best threshold for the Logistic Regression Probabilites, visualize AUC curves, calculate ROC scores, accuracy, recall, precision etc. To maximize the quality of our model. 

In [58]:
# Calculate ROC-AUC scores for different thresholds
thresholds = np.linspace(0, 1, num=100)  # Generate a range of thresholds
roc_scores = []
for threshold in thresholds:
    y_pred = (y_pred_proba >= threshold).astype(int)
    roc_score = roc_auc_score(y_test, y_pred)
    roc_scores.append(roc_score)

# Find the threshold with the highest ROC score

best_threshold = thresholds[np.argmax(roc_scores)]
best_roc_score = np.max(roc_scores)

print("Best Threshold:", best_threshold)
print("Highest ROC Score:", best_roc_score)

Best Threshold: 0.21212121212121213
Highest ROC Score: 0.8254542972837708
