## Machine Learning

### Was ist Machine Learning?

Beim Machine Learning wollen wir innerhalb eines Datensets Zusammenhänge erkennen

Danach können wir verschiedene Dinge machen, z.B. Klassifizierung, Regression, Bilderkennung, neue Inhalte generieren (Text, Bilder, Audio, ...), ...

---

Beispiel:

Wettervorhersage

Per ML die Temperatur anhand von mehreren Parametern vorhersagen

Parameter: Datum, Zeit, Wolkenanzahl, Niederschlag

Output: Temperatur

Annahme: 21.06.2024, 13:30 Uhr, 10% Wolken, 0% Niederschlag -> Temperatur? 25°, 27°, Realität: 30°

Anhand von Machine Learning können jetzt historische Daten (ein Datenset) verarbeitet werden, um ein Modell zu erzeugen

Modell: Programm, welches die Parameter als Inputs bekommt und einen Output gibt

Dieses Modell kann ausgeführt werden um uns eine Vorhersage zu geben

Bei 4 Parametern kann das Programm noch per Hand geschrieben werden, bei 20 Parameter eher nicht mehr

### Datenset

Das Income.csv Datenset enthält Personendaten und soll aussagen, welche Personen über/unter 50.000$/Jahr verdienen

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
data = pd.read_csv("Data/Income.csv")

In [3]:
data

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


### Probleme mit dem Datenset

1. Alle Daten müssen numerisch sein -> Encoding
2. Die Daten müssen skaliert sein -> Outlier reduzieren
3. Unebenheiten ausgleichen -> Bei der zu findenden Spalte sollten die Trainingsdaten gleich viele Daten von beiden Seiten enthalten

### Daten numerisch machen

Für die Konvertierung von Text zu Zahlen können wir den LabelEncoder benutzen

In [4]:
from sklearn.preprocessing import LabelEncoder

In [6]:
def encodeColumn(colName):
    enc = LabelEncoder()  # LabelEncoder erstellen

    spalte = data[colName]  # Spalte zum Encoden entnehmen
    encodedColumn = enc.fit_transform(spalte)  # Encoding durchführen
    data[colName] = encodedColumn  # Encodede Spalte in das DataFrame hineinschreiben

In [8]:
encodeColumn("workclass")

In [9]:
data

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,7,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,6,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,4,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,4,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,4,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,4,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,4,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,4,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,4,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  int32 
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   gender          32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  class           32561 non-null  object
dtypes: int32(1), int64(6), object(8)
memory usage: 3.6+ MB


In [12]:
data.select_dtypes(include=object)  # Nur object Spalten nehmen

Unnamed: 0,education,marital-status,occupation,relationship,race,gender,native-country,class
0,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,United-States,<=50K
1,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States,<=50K
2,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States,<=50K
3,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,United-States,<=50K
4,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,Cuba,<=50K
...,...,...,...,...,...,...,...,...
32556,Assoc-acdm,Married-civ-spouse,Tech-support,Wife,White,Female,United-States,<=50K
32557,HS-grad,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,United-States,>50K
32558,HS-grad,Widowed,Adm-clerical,Unmarried,White,Female,United-States,<=50K
32559,HS-grad,Never-married,Adm-clerical,Own-child,White,Male,United-States,<=50K


In [14]:
for spalte in data.select_dtypes(include=object):
    if spalte != "class":
        encodeColumn(spalte)

In [15]:
data

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,7,77516,9,13,4,1,1,4,1,2174,0,40,39,<=50K
1,50,6,83311,9,13,2,4,0,4,1,0,0,13,39,<=50K
2,38,4,215646,11,9,0,6,1,4,1,0,0,40,39,<=50K
3,53,4,234721,1,7,2,6,0,2,1,0,0,40,39,<=50K
4,28,4,338409,9,13,2,10,5,2,0,0,0,40,5,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,4,257302,7,12,2,13,5,4,0,0,0,38,39,<=50K
32557,40,4,154374,11,9,2,7,0,4,1,0,0,40,39,>50K
32558,58,4,151910,11,9,6,1,4,4,0,0,0,40,39,<=50K
32559,22,4,201490,11,9,4,1,3,4,1,0,0,20,39,<=50K


In [20]:
data["class"] == "<=50K"  # Alle Zeilen mit <=50K finden

(data["class"] == "<=50K").astype(int)  # Boolean Maske zu Integern konvertieren (0 oder 1)

data["class"] = (data["class"] == "<=50K").astype(int)

In [21]:
data

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,7,77516,9,13,4,1,1,4,1,2174,0,40,39,1
1,50,6,83311,9,13,2,4,0,4,1,0,0,13,39,1
2,38,4,215646,11,9,0,6,1,4,1,0,0,40,39,1
3,53,4,234721,1,7,2,6,0,2,1,0,0,40,39,1
4,28,4,338409,9,13,2,10,5,2,0,0,0,40,5,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,4,257302,7,12,2,13,5,4,0,0,0,38,39,1
32557,40,4,154374,11,9,2,7,0,4,1,0,0,40,39,0
32558,58,4,151910,11,9,6,1,4,4,0,0,0,40,39,1
32559,22,4,201490,11,9,4,1,3,4,1,0,0,20,39,1


### Aufteilen der Daten

Die Daten müssen in ein Trainingsset und ein Testset aufgeteilt werden

Das Trainingsset wird jetzt weiterverarbeitet (Skaliert und Ausgeglichen)

Das Testset bleibt unberührt

In [23]:
data.sample()  # Ein Random Datensatz

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,class
22461,21,4,265434,15,10,4,10,3,4,0,0,0,30,39,1


In [26]:
random = data.sample(frac=1)  # Alle Daten in zufälliger Reihenfolge

In [31]:
trainingAmount = int(len(random) * 0.8)
testAmount = len(random) - trainingAmount

In [32]:
trainingAmount

26048

In [33]:
testAmount

6513

In [65]:
x = np.split(random, (trainingAmount, len(random)))  # Drei Ergebnisse: Trainingsset, Testset, Leeres Set

  return bound(*args, **kwds)


In [274]:
training = x[0]  # Trainingsset aus x entnehmen
test = x[1]  # Testset aus x entnehmen

test_left = test[test.columns[:-1]]
test_right = test[test.columns[-1]]

### Skalierung der Daten

Bei der Skalierung der Daten werden starke Unterschiede (Outlier) zw. den einzelnen Daten geglättet

In [275]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# WICHTIG: Hier darf nicht die letzte Spalte (class) skaliert werden, nachdem diese 0 oder 1 bleiben soll
scaledTraining = scaler.fit_transform(training[training.columns[0:-1]])

# Letzte Spalte wieder anhängen
training = np.hstack((scaledTraining, training[training.columns[-1]].values.reshape(-1, 1)))

# training[training.columns[-1]].values.reshape(-1, 1) -> Nimm die letzte Spalte, nimm davon das unterliegende Numpy Array und konvertiere dieses von einem 1D-Array zu einem 2D-Array

In [276]:
training = pd.DataFrame(training)

In [277]:
scaler = StandardScaler()
scaledTest = scaler.fit_transform(test[test.columns[0:-1]])
test = np.hstack((scaledTest, test[test.columns[-1]].values.reshape(-1, 1)))
test = pd.DataFrame(test)

### Unebenheiten ausgleichen

Wenn im Datenset Unebenheiten herrschen, kann sich beim Lernprozess eine Neigung in die eine oder andere Richtung entwickeln (Bias)

In [278]:
len(training.groupby(14).get_group(0))

6260

In [279]:
len(training.groupby(14).get_group(1))

19788

In [280]:
from imblearn.over_sampling import RandomOverSampler  # Generiert neue Daten, welche einen Ausgleich machen

In [281]:
overSampler = RandomOverSampler()

left, right = overSampler.fit_resample(training, training[training.columns[-1]])  # Zwei Parameter: Das Trainingsset, die Spalte nach welcher skaliert werden soll

training = left

In [282]:
left[left[14] == 0]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
3,-0.046781,0.089669,0.545993,0.183895,-0.415888,-0.399822,1.514550,-0.899648,0.392608,0.704136,-0.146405,-0.218364,-0.035183,0.291799,0.0
4,0.832186,0.089669,1.244919,-0.333031,1.131635,-0.399822,-0.613412,-0.899648,0.392608,0.704136,-0.146405,4.469570,-0.035183,0.291799,0.0
5,0.465950,0.089669,-0.337704,0.442357,1.518515,-0.399822,-0.613412,-0.899648,-1.973032,0.704136,-0.146405,-0.218364,-0.035183,0.291799,0.0
7,1.711153,0.089669,-0.341675,-0.333031,1.131635,-0.399822,-0.613412,-0.899648,0.392608,0.704136,-0.146405,-0.218364,0.772897,0.291799,0.0
10,0.099714,0.089669,0.353557,0.442357,1.518515,-0.399822,-0.613412,-0.899648,0.392608,0.704136,1.874571,-0.218364,0.772897,0.291799,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39571,1.271670,0.089669,0.289615,1.217745,-0.029007,-0.399822,1.514550,-0.899648,0.392608,0.704136,-0.146405,-0.218364,0.368857,0.291799,0.0
39572,-0.339770,1.464816,1.241910,0.959283,1.905396,1.592760,0.805229,1.583165,0.392608,0.704136,-0.146405,-0.218364,0.611281,0.291799,0.0
39573,-0.632759,0.089669,2.253530,0.442357,1.518515,0.928566,0.805229,-0.278945,0.392608,0.704136,1.268709,-0.218364,-0.035183,0.291799,0.0
39574,1.564659,-2.660627,-0.551537,-1.625344,-2.737171,-0.399822,-1.559173,-0.899648,0.392608,0.704136,-0.146405,-0.218364,-0.843262,0.291799,0.0


In [283]:
left[left[14] == 1]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,1.711153,0.089669,-0.558536,-1.366881,-2.350291,-1.728210,1.514550,1.583165,0.392608,-1.420180,-0.146405,-0.218364,-0.196799,-4.177447,1.0
1,-0.559511,0.089669,0.142638,-0.333031,1.131635,0.928566,-0.613412,-0.278945,0.392608,0.704136,-0.146405,-0.218364,0.772897,0.291799,1.0
2,-1.365231,0.089669,1.922611,0.183895,-0.415888,0.928566,1.278109,0.341758,0.392608,0.704136,-0.146405,-0.218364,1.580976,0.291799,1.0
6,0.612445,0.089669,-1.550025,1.217745,-0.029007,-1.728210,-1.322732,-0.278945,0.392608,-1.420180,-0.146405,-0.218364,-0.439223,0.291799,1.0
8,0.465950,2.152390,0.172110,1.217745,-0.029007,0.928566,-1.322732,1.583165,0.392608,-1.420180,-0.146405,-0.218364,-0.035183,0.291799,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26040,1.344917,-2.660627,0.088952,-2.659195,-1.576529,1.592760,-1.559173,-0.278945,0.392608,0.704136,-0.146405,-0.218364,-0.843262,0.291799,1.0
26041,0.978681,-1.973053,-1.580630,1.217745,-0.029007,-1.728210,-1.322732,-0.278945,0.392608,0.704136,-0.146405,-0.218364,-0.035183,0.291799,1.0
26042,-1.145489,-1.285479,0.008917,-0.849956,0.744754,0.928566,1.041669,-0.278945,0.392608,0.704136,-0.146405,-0.218364,0.934513,0.291799,1.0
26043,-0.779253,0.089669,0.005889,-0.333031,1.131635,0.928566,1.278109,-0.278945,0.392608,0.704136,-0.146405,-0.218364,-0.035183,0.291799,1.0


In [284]:
training

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,1.711153,0.089669,-0.558536,-1.366881,-2.350291,-1.728210,1.514550,1.583165,0.392608,-1.420180,-0.146405,-0.218364,-0.196799,-4.177447,1.0
1,-0.559511,0.089669,0.142638,-0.333031,1.131635,0.928566,-0.613412,-0.278945,0.392608,0.704136,-0.146405,-0.218364,0.772897,0.291799,1.0
2,-1.365231,0.089669,1.922611,0.183895,-0.415888,0.928566,1.278109,0.341758,0.392608,0.704136,-0.146405,-0.218364,1.580976,0.291799,1.0
3,-0.046781,0.089669,0.545993,0.183895,-0.415888,-0.399822,1.514550,-0.899648,0.392608,0.704136,-0.146405,-0.218364,-0.035183,0.291799,0.0
4,0.832186,0.089669,1.244919,-0.333031,1.131635,-0.399822,-0.613412,-0.899648,0.392608,0.704136,-0.146405,4.469570,-0.035183,0.291799,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39571,1.271670,0.089669,0.289615,1.217745,-0.029007,-0.399822,1.514550,-0.899648,0.392608,0.704136,-0.146405,-0.218364,0.368857,0.291799,0.0
39572,-0.339770,1.464816,1.241910,0.959283,1.905396,1.592760,0.805229,1.583165,0.392608,0.704136,-0.146405,-0.218364,0.611281,0.291799,0.0
39573,-0.632759,0.089669,2.253530,0.442357,1.518515,0.928566,0.805229,-0.278945,0.392608,0.704136,1.268709,-0.218364,-0.035183,0.291799,0.0
39574,1.564659,-2.660627,-0.551537,-1.625344,-2.737171,-0.399822,-1.559173,-0.899648,0.392608,0.704136,-0.146405,-0.218364,-0.843262,0.291799,0.0


In [285]:
training

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,1.711153,0.089669,-0.558536,-1.366881,-2.350291,-1.728210,1.514550,1.583165,0.392608,-1.420180,-0.146405,-0.218364,-0.196799,-4.177447,1.0
1,-0.559511,0.089669,0.142638,-0.333031,1.131635,0.928566,-0.613412,-0.278945,0.392608,0.704136,-0.146405,-0.218364,0.772897,0.291799,1.0
2,-1.365231,0.089669,1.922611,0.183895,-0.415888,0.928566,1.278109,0.341758,0.392608,0.704136,-0.146405,-0.218364,1.580976,0.291799,1.0
3,-0.046781,0.089669,0.545993,0.183895,-0.415888,-0.399822,1.514550,-0.899648,0.392608,0.704136,-0.146405,-0.218364,-0.035183,0.291799,0.0
4,0.832186,0.089669,1.244919,-0.333031,1.131635,-0.399822,-0.613412,-0.899648,0.392608,0.704136,-0.146405,4.469570,-0.035183,0.291799,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39571,1.271670,0.089669,0.289615,1.217745,-0.029007,-0.399822,1.514550,-0.899648,0.392608,0.704136,-0.146405,-0.218364,0.368857,0.291799,0.0
39572,-0.339770,1.464816,1.241910,0.959283,1.905396,1.592760,0.805229,1.583165,0.392608,0.704136,-0.146405,-0.218364,0.611281,0.291799,0.0
39573,-0.632759,0.089669,2.253530,0.442357,1.518515,0.928566,0.805229,-0.278945,0.392608,0.704136,1.268709,-0.218364,-0.035183,0.291799,0.0
39574,1.564659,-2.660627,-0.551537,-1.625344,-2.737171,-0.399822,-1.559173,-0.899648,0.392608,0.704136,-0.146405,-0.218364,-0.843262,0.291799,0.0


In [286]:
train_left = training[training.columns[:-1]]
train_right = training[training.columns[-1]]
test_left = test[test.columns[:-1]]
test_right = test[test.columns[-1]]

### Verschiedene ML Modelle

kNN - k-nearest neighbors
- Sammlung von bereits klassifizierten Datenpunkten
- Neuer Datenpunkt wird platziert
- Danach wird geprüft, wieviele Nachbarn der einen Klasse und der anderen Klasse dieser neue Datenpunkt hat
- Die Klasse welche öfter in den Nachbarn vertreten ist, wird dem neuen Datenpunkt zugewiesen
- WICHTIG: Bei kNN muss es eine ungerade Anzahl von Nachbarn geben

In [287]:
from sklearn.neighbors import KNeighborsClassifier

In [288]:
knn = KNeighborsClassifier(7)  # Anzahl Nachbarn = 7

In [289]:
knn_model = knn.fit(train_left, train_right)

In [290]:
prediction = knn_model.predict(test_left)

In [291]:
(prediction == test_right).value_counts()

14
True     5068
False    1445
Name: count, dtype: int64

In [292]:
# Funktion definieren um Modell zu testen
def evaluate(prediction):
    tf = (prediction == test_right).value_counts()
    print(f"Korrekte Vorhersagen: {tf.iloc[0] / len(prediction) * 100}%")
    print(f"Falsche Vorhersagen: {tf.iloc[1] / len(prediction) * 100}%")

In [293]:
def compare(prediction):
    outputs = [">50K", "<=50K"]
    actual = np.array([outputs[x] for x in test_right.astype(int)]).reshape(-1, 1)
    prediction = np.array([outputs[x] for x in prediction.astype(int)]).reshape(-1, 1)
    
    gesamt = np.hstack((test_left, actual, prediction))
    df = pd.DataFrame(gesamt)
    df.rename(columns={14: "Actual", 15: "Prediction"}, inplace=True)
    return df

In [294]:
evaluate(prediction)

Korrekte Vorhersagen: 77.81360356210656%
Falsche Vorhersagen: 22.186396437893443%


In [295]:
compare(prediction)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,Actual,Prediction
0,-0.7619072121359421,0.09157050605646816,0.6938955723637195,-0.3450678848159036,1.1477731298199816,0.8943503316894619,0.8316051420274628,-0.27321538091964637,0.39790769219551975,-1.4309893302419348,-0.14398643099098302,-0.20973069191908605,-0.036428740229222316,0.2906455882770199,<=50K,<=50K
1,-0.6147299644192042,0.09157050605646816,-1.397287665103948,1.2034579308519742,-0.04100710784514573,-0.4319027899620558,-0.5884283183242606,-0.9024416100319719,-1.9222111970340412,0.6988172300564477,-0.14398643099098302,-0.20973069191908605,4.785604577601512,0.2906455882770199,<=50K,>50K
2,-1.3506162030028936,-2.644203374428752,0.1318236572005176,0.171107387073389,-0.43726718706685486,0.8943503316894619,-1.5351172918920761,0.9852370773050046,0.39790769219551975,0.6988172300564477,-0.14398643099098302,-0.20973069191908605,-0.036428740229222316,0.2906455882770199,<=50K,<=50K
3,0.12115627416448509,0.09157050605646816,0.28600188045050373,-0.8612431567051962,0.7515130505982726,0.8943503316894619,-1.2984450485001222,-0.27321538091964637,0.39790769219551975,-1.4309893302419348,-0.14398643099098302,-0.20973069191908605,-0.036428740229222316,0.2906455882770199,<=50K,<=50K
4,0.6362766411730676,0.09157050605646816,0.23222482830890165,-0.6031555207605499,0.3552529713765634,-0.4319027899620558,-0.11508383154035276,-0.9024416100319719,0.39790769219551975,0.6988172300564477,-0.14398643099098302,-0.20973069191908605,-0.036428740229222316,0.2906455882770199,<=50K,>50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6508,-1.2770275791445247,0.09157050605646816,-0.1401550148675124,-0.8612431567051962,0.7515130505982726,0.8943503316894619,-1.2984450485001222,0.9852370773050046,0.39790769219551975,0.6988172300564477,-0.14398643099098302,-0.20973069191908605,-0.4450756315708099,0.2906455882770199,<=50K,<=50K
6509,0.48909939345632975,0.09157050605646816,0.18681763693419948,-0.3450678848159036,1.1477731298199816,-0.4319027899620558,-0.5884283183242606,-0.9024416100319719,-3.0822706416488215,0.6988172300564477,-0.14398643099098302,-0.20973069191908605,-0.036428740229222316,-2.2757518038708,>50K,>50K
6510,-0.9090844598526799,0.09157050605646816,1.3088341663553644,1.2034579308519742,-0.04100710784514573,-0.4319027899620558,0.12158841185160113,-0.9024416100319719,0.39790769219551975,0.6988172300564477,-0.14398643099098302,-0.20973069191908605,-0.036428740229222316,0.2906455882770199,<=50K,>50K
6511,1.1513970081816502,0.09157050605646816,-0.6659977879764305,0.171107387073389,-0.43726718706685486,-0.4319027899620558,-0.8251005617162145,-0.9024416100319719,-1.9222111970340412,0.6988172300564477,-0.14398643099098302,-0.20973069191908605,-0.036428740229222316,0.2906455882770199,<=50K,<=50K


Naive Bayes
- Verwendet Wahrscheinlichkeiten um Punkte zu klassifizieren
- Wenn die Wahrscheinlichkeiten in Summe über 50% sind, wird die eine Klasse genommen, sonst die andere

In [296]:
from sklearn.naive_bayes import GaussianNB

In [297]:
nb = GaussianNB()

In [298]:
nb_model = nb.fit(train_left, train_right)

In [299]:
prediction = nb_model.predict(test_left)

In [300]:
evaluate(prediction)

Korrekte Vorhersagen: 82.00522032857363%
Falsche Vorhersagen: 17.99477967142638%


In [301]:
compare(prediction)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,Actual,Prediction
0,-0.7619072121359421,0.09157050605646816,0.6938955723637195,-0.3450678848159036,1.1477731298199816,0.8943503316894619,0.8316051420274628,-0.27321538091964637,0.39790769219551975,-1.4309893302419348,-0.14398643099098302,-0.20973069191908605,-0.036428740229222316,0.2906455882770199,<=50K,<=50K
1,-0.6147299644192042,0.09157050605646816,-1.397287665103948,1.2034579308519742,-0.04100710784514573,-0.4319027899620558,-0.5884283183242606,-0.9024416100319719,-1.9222111970340412,0.6988172300564477,-0.14398643099098302,-0.20973069191908605,4.785604577601512,0.2906455882770199,<=50K,<=50K
2,-1.3506162030028936,-2.644203374428752,0.1318236572005176,0.171107387073389,-0.43726718706685486,0.8943503316894619,-1.5351172918920761,0.9852370773050046,0.39790769219551975,0.6988172300564477,-0.14398643099098302,-0.20973069191908605,-0.036428740229222316,0.2906455882770199,<=50K,<=50K
3,0.12115627416448509,0.09157050605646816,0.28600188045050373,-0.8612431567051962,0.7515130505982726,0.8943503316894619,-1.2984450485001222,-0.27321538091964637,0.39790769219551975,-1.4309893302419348,-0.14398643099098302,-0.20973069191908605,-0.036428740229222316,0.2906455882770199,<=50K,<=50K
4,0.6362766411730676,0.09157050605646816,0.23222482830890165,-0.6031555207605499,0.3552529713765634,-0.4319027899620558,-0.11508383154035276,-0.9024416100319719,0.39790769219551975,0.6988172300564477,-0.14398643099098302,-0.20973069191908605,-0.036428740229222316,0.2906455882770199,<=50K,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6508,-1.2770275791445247,0.09157050605646816,-0.1401550148675124,-0.8612431567051962,0.7515130505982726,0.8943503316894619,-1.2984450485001222,0.9852370773050046,0.39790769219551975,0.6988172300564477,-0.14398643099098302,-0.20973069191908605,-0.4450756315708099,0.2906455882770199,<=50K,<=50K
6509,0.48909939345632975,0.09157050605646816,0.18681763693419948,-0.3450678848159036,1.1477731298199816,-0.4319027899620558,-0.5884283183242606,-0.9024416100319719,-3.0822706416488215,0.6988172300564477,-0.14398643099098302,-0.20973069191908605,-0.036428740229222316,-2.2757518038708,>50K,<=50K
6510,-0.9090844598526799,0.09157050605646816,1.3088341663553644,1.2034579308519742,-0.04100710784514573,-0.4319027899620558,0.12158841185160113,-0.9024416100319719,0.39790769219551975,0.6988172300564477,-0.14398643099098302,-0.20973069191908605,-0.036428740229222316,0.2906455882770199,<=50K,<=50K
6511,1.1513970081816502,0.09157050605646816,-0.6659977879764305,0.171107387073389,-0.43726718706685486,-0.4319027899620558,-0.8251005617162145,-0.9024416100319719,-1.9222111970340412,0.6988172300564477,-0.14398643099098302,-0.20973069191908605,-0.036428740229222316,0.2906455882770199,<=50K,<=50K


Logistische Regression
- Ein Datenpunkt wird auf der Regressionslinie platziert, und dann mit 0 oder 1 verglichen
- Wenn dieser über dem Grenzwert liegt, bekommt er die eine Klasse, sonst die andere

In [302]:
from sklearn.linear_model import LogisticRegression

In [303]:
lr = LogisticRegression()

In [304]:
lr_model = lr.fit(train_left, train_right)

In [305]:
prediction = lr_model.predict(test_left)

In [306]:
evaluate(prediction)

Korrekte Vorhersagen: 77.41440196530017%
Falsche Vorhersagen: 22.58559803469983%


In [307]:
compare(prediction)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,Actual,Prediction
0,-0.7619072121359421,0.09157050605646816,0.6938955723637195,-0.3450678848159036,1.1477731298199816,0.8943503316894619,0.8316051420274628,-0.27321538091964637,0.39790769219551975,-1.4309893302419348,-0.14398643099098302,-0.20973069191908605,-0.036428740229222316,0.2906455882770199,<=50K,<=50K
1,-0.6147299644192042,0.09157050605646816,-1.397287665103948,1.2034579308519742,-0.04100710784514573,-0.4319027899620558,-0.5884283183242606,-0.9024416100319719,-1.9222111970340412,0.6988172300564477,-0.14398643099098302,-0.20973069191908605,4.785604577601512,0.2906455882770199,<=50K,>50K
2,-1.3506162030028936,-2.644203374428752,0.1318236572005176,0.171107387073389,-0.43726718706685486,0.8943503316894619,-1.5351172918920761,0.9852370773050046,0.39790769219551975,0.6988172300564477,-0.14398643099098302,-0.20973069191908605,-0.036428740229222316,0.2906455882770199,<=50K,<=50K
3,0.12115627416448509,0.09157050605646816,0.28600188045050373,-0.8612431567051962,0.7515130505982726,0.8943503316894619,-1.2984450485001222,-0.27321538091964637,0.39790769219551975,-1.4309893302419348,-0.14398643099098302,-0.20973069191908605,-0.036428740229222316,0.2906455882770199,<=50K,<=50K
4,0.6362766411730676,0.09157050605646816,0.23222482830890165,-0.6031555207605499,0.3552529713765634,-0.4319027899620558,-0.11508383154035276,-0.9024416100319719,0.39790769219551975,0.6988172300564477,-0.14398643099098302,-0.20973069191908605,-0.036428740229222316,0.2906455882770199,<=50K,>50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6508,-1.2770275791445247,0.09157050605646816,-0.1401550148675124,-0.8612431567051962,0.7515130505982726,0.8943503316894619,-1.2984450485001222,0.9852370773050046,0.39790769219551975,0.6988172300564477,-0.14398643099098302,-0.20973069191908605,-0.4450756315708099,0.2906455882770199,<=50K,<=50K
6509,0.48909939345632975,0.09157050605646816,0.18681763693419948,-0.3450678848159036,1.1477731298199816,-0.4319027899620558,-0.5884283183242606,-0.9024416100319719,-3.0822706416488215,0.6988172300564477,-0.14398643099098302,-0.20973069191908605,-0.036428740229222316,-2.2757518038708,>50K,>50K
6510,-0.9090844598526799,0.09157050605646816,1.3088341663553644,1.2034579308519742,-0.04100710784514573,-0.4319027899620558,0.12158841185160113,-0.9024416100319719,0.39790769219551975,0.6988172300564477,-0.14398643099098302,-0.20973069191908605,-0.036428740229222316,0.2906455882770199,<=50K,<=50K
6511,1.1513970081816502,0.09157050605646816,-0.6659977879764305,0.171107387073389,-0.43726718706685486,-0.4319027899620558,-0.8251005617162145,-0.9024416100319719,-1.9222111970340412,0.6988172300564477,-0.14398643099098302,-0.20973069191908605,-0.036428740229222316,0.2906455882770199,<=50K,<=50K


Support Vector Machines
- Legt eine Linie (Hyperplane) durch den Raum
- Von dieser Hyperplane breitet sich danach ein Margin in beide Richtungen zur Hyperplane aus
- Der Erste Punkt, welcher von dem Margin getroffen wird, gibt diesen beiden Punkten unterschiedliche Klassen
- Dieses Margin bewegt sich weiter, bis alle Punkte klassifiziert sind

In [308]:
from sklearn.svm import SVC

In [309]:
svc = SVC()

In [310]:
svc_model = svc.fit(train_left, train_right)

In [311]:
prediction = svc_model.predict(test_left)

In [312]:
evaluate(prediction)

Korrekte Vorhersagen: 79.53324120988792%
Falsche Vorhersagen: 20.46675879011208%


In [313]:
compare(prediction)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,Actual,Prediction
0,-0.7619072121359421,0.09157050605646816,0.6938955723637195,-0.3450678848159036,1.1477731298199816,0.8943503316894619,0.8316051420274628,-0.27321538091964637,0.39790769219551975,-1.4309893302419348,-0.14398643099098302,-0.20973069191908605,-0.036428740229222316,0.2906455882770199,<=50K,<=50K
1,-0.6147299644192042,0.09157050605646816,-1.397287665103948,1.2034579308519742,-0.04100710784514573,-0.4319027899620558,-0.5884283183242606,-0.9024416100319719,-1.9222111970340412,0.6988172300564477,-0.14398643099098302,-0.20973069191908605,4.785604577601512,0.2906455882770199,<=50K,>50K
2,-1.3506162030028936,-2.644203374428752,0.1318236572005176,0.171107387073389,-0.43726718706685486,0.8943503316894619,-1.5351172918920761,0.9852370773050046,0.39790769219551975,0.6988172300564477,-0.14398643099098302,-0.20973069191908605,-0.036428740229222316,0.2906455882770199,<=50K,<=50K
3,0.12115627416448509,0.09157050605646816,0.28600188045050373,-0.8612431567051962,0.7515130505982726,0.8943503316894619,-1.2984450485001222,-0.27321538091964637,0.39790769219551975,-1.4309893302419348,-0.14398643099098302,-0.20973069191908605,-0.036428740229222316,0.2906455882770199,<=50K,<=50K
4,0.6362766411730676,0.09157050605646816,0.23222482830890165,-0.6031555207605499,0.3552529713765634,-0.4319027899620558,-0.11508383154035276,-0.9024416100319719,0.39790769219551975,0.6988172300564477,-0.14398643099098302,-0.20973069191908605,-0.036428740229222316,0.2906455882770199,<=50K,>50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6508,-1.2770275791445247,0.09157050605646816,-0.1401550148675124,-0.8612431567051962,0.7515130505982726,0.8943503316894619,-1.2984450485001222,0.9852370773050046,0.39790769219551975,0.6988172300564477,-0.14398643099098302,-0.20973069191908605,-0.4450756315708099,0.2906455882770199,<=50K,<=50K
6509,0.48909939345632975,0.09157050605646816,0.18681763693419948,-0.3450678848159036,1.1477731298199816,-0.4319027899620558,-0.5884283183242606,-0.9024416100319719,-3.0822706416488215,0.6988172300564477,-0.14398643099098302,-0.20973069191908605,-0.036428740229222316,-2.2757518038708,>50K,>50K
6510,-0.9090844598526799,0.09157050605646816,1.3088341663553644,1.2034579308519742,-0.04100710784514573,-0.4319027899620558,0.12158841185160113,-0.9024416100319719,0.39790769219551975,0.6988172300564477,-0.14398643099098302,-0.20973069191908605,-0.036428740229222316,0.2906455882770199,<=50K,>50K
6511,1.1513970081816502,0.09157050605646816,-0.6659977879764305,0.171107387073389,-0.43726718706685486,-0.4319027899620558,-0.8251005617162145,-0.9024416100319719,-1.9222111970340412,0.6988172300564477,-0.14398643099098302,-0.20973069191908605,-0.036428740229222316,0.2906455882770199,<=50K,>50K


### Neurales Netzwerk

Wir können jetzt ein eigenes Modell bauen

Dieses Modell besteht aus Knoten, welche miteinander in Schichten verbunden sind

Jeder Knoten (Neuron) besteht aus:
- Den Inputs, diese werden Multipliziert
- Den Summenlayer, summiert alles auf
- Die Activation Function, eine beliebige Mathematische Funktion, welche genau einen Wert zurückgibt

Diese Neuronen werden in Schichten angelegt und miteinander verbunden

Jede Schicht wird fertig ausgeführt, und danach geht jeder Wert an die nächste Schicht weiter

Wenn das gesamte Netzwerk einmal ausgeführt wurde, wird das Endresultat wieder an das Modell zurückgefüttert und daraus entsteht eine Metrik namens Loss

Der Loss soll möglichst reduziert werden, um zu bestimmen wie genau das Modell ist

In [314]:
import tensorflow as tf

In [315]:
# Modell aufbauen mit der Sequential API (Keras)
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(14,)),  # Die erste Schicht, die Input Schicht
    tf.keras.layers.Dense(32, activation="relu"),
    tf.keras.layers.Dense(32, activation="relu"),
    tf.keras.layers.Dense(1, activation="sigmoid")  # Die letzte Schicht, der Output Layer
])

In [316]:
model.compile(loss="binary_crossentropy", metrics=["accuracy"])

In [None]:
model.fit(train_left, train_right,
          verbose=1,
          epochs=50)  # verbose: Outputs anzeigen, Epochs: Anzahl der Durchläufe

Epoch 1/50
[1m1237/1237[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - accuracy: 0.7521 - loss: 0.4944
Epoch 2/50
[1m 144/1237[0m [32m━━[0m[37m━━━━━━━━━━━━━━━━━━[0m [1m3s[0m 4ms/step - accuracy: 0.8172 - loss: 0.4045

In [None]:
prediction = model.predict(test_left).reshape(-1)

In [None]:
evaluate(prediction)

In [None]:
compare(prediction)

In [None]:
model.save("Income.keras")

In [None]:
model = tf.keras.models.load_model("Income.keras")

In [None]:
prediction = model.predict(test_left).reshape(-1)