## Machine Learning

Was ist ML?

In einem gegebenen Datenset Zusammenhänge erkennen; später bei neuen Daten diese Zusammenhänge erkennen, und eine "Vorhersage" treffen

---

Beispiel: Wetter

Inputs: Monat, Tageszeit, Wolken %, Niederschlag %

Output: Temperatur

Beispiel: 4, 10:00, 75, 0 -> 20°C, 18°C

Bei 4 Parametern kann dies per Hand gemacht werden, bei 10 Parametern ist dies etwas schwieriger

---

Jetzt kann ein ML Prozess verwendet werden, welches alle Input Parameter berücksichtigt, und am Ende ein Model erzeugt

### Income.csv

Enthält Personendaten, welche nach ihrem Gehalt klassifiziert werden sollen (über/unter 50.000$/Jahr)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
data = pd.read_csv("Data/Income.csv")

In [3]:
data

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


### Probleme mit dem Datenset

Zuerst muss das Datenset für ML angepasst werden:
- Alle Daten müssen numerisch sein
- Outlier reduzieren
- Unebenheiten ausgleichen (zwischen den Klassen)

#### class binär machen

In [4]:
data["class"].value_counts()

class
<=50K    24720
>50K      7841
Name: count, dtype: int64

In [5]:
data["class"] == ">50K"

0        False
1        False
2        False
3        False
4        False
         ...  
32556    False
32557     True
32558    False
32559    False
32560     True
Name: class, Length: 32561, dtype: bool

In [6]:
(data["class"] == ">50K").astype(int)

0        0
1        0
2        0
3        0
4        0
        ..
32556    0
32557    1
32558    0
32559    0
32560    1
Name: class, Length: 32561, dtype: int64

In [7]:
data["class"] = (data["class"] == ">50K").astype(int)

In [8]:
data

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,0
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,1
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,0
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,0


#### Alle Werte numerisch machen

Technik: LabelEncoding

Weist jedem Text pro Spalte eine Zahl zu; wandelt alle Texte zu der entsprechenden Zahl um

In [9]:
from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()
# enc.fit_transform(data["workclass"])  # Spalte konvertieren
data["workclass"] = enc.fit_transform(data["workclass"])

In [10]:
data

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,7,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,6,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,4,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,4,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,4,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,4,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,0
32557,40,4,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,1
32558,58,4,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,0
32559,22,4,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,0


In [11]:
pd.Series(enc.fit_transform(data["workclass"])).value_counts()

4    22696
6     2541
2     2093
0     1836
7     1298
5     1116
1      960
8       14
3        7
Name: count, dtype: int64

In [12]:
data["workclass"].value_counts()

workclass
4    22696
6     2541
2     2093
0     1836
7     1298
5     1116
1      960
8       14
3        7
Name: count, dtype: int64

In [13]:
enc = LabelEncoder()
for col in data.select_dtypes(object):
    data[col] = enc.fit_transform(data[col])

In [14]:
data

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,7,77516,9,13,4,1,1,4,1,2174,0,40,39,0
1,50,6,83311,9,13,2,4,0,4,1,0,0,13,39,0
2,38,4,215646,11,9,0,6,1,4,1,0,0,40,39,0
3,53,4,234721,1,7,2,6,0,2,1,0,0,40,39,0
4,28,4,338409,9,13,2,10,5,2,0,0,0,40,5,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,4,257302,7,12,2,13,5,4,0,0,0,38,39,0
32557,40,4,154374,11,9,2,7,0,4,1,0,0,40,39,1
32558,58,4,151910,11,9,6,1,4,4,0,0,0,40,39,0
32559,22,4,201490,11,9,4,1,3,4,1,0,0,20,39,0


#### Aufteilen des Datensets

Um das Model testen zu können, benötigen wir ein Testset

Das Testset ist ein Teil des Gesamtsets, welcher nicht mehr verändert wird (keine Skalierung, Ausgleichung)

Trainingsset: 80%, Testset: 20%

In [15]:
random = data.sample(frac=1)  # Datenset zufällig ordnen
pct80 = int(len(data) * 0.8)
training = random.iloc[0:pct80]
test = random.iloc[pct80:]

training_left = training.iloc[:, :-1]  # Spalte 0 bis zur letzten Spalte
training_right = training["class"]

test_left = test.iloc[:, :-1]  # Spalte 0 bis zur letzten Spalte
test_right = test["class"]

#### Werte skalieren

Jetzt werden die Outlier reduziert

Statt 6-stelligen Werten haben wir nurnoch 1-stellige Werte

In [16]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
# scaler.fit_transform(training_left)
training_left = pd.DataFrame(scaler.fit_transform(training_left))

# scaler.fit_transform(test_left)
test_left = pd.DataFrame(scaler.fit_transform(test_left))

#### Unebenheiten beheben

Die Klassen sind momentan noch uneben (viel mehr <=50K als >50K)

Dies hat den Nachteil, das sich beim Model ein Bias entwickeln könnte

In [17]:
training_right.value_counts()

class
0    19725
1     6323
Name: count, dtype: int64

In [18]:
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler()

left, right = ros.fit_resample(training_left, training_right)

training_left = pd.DataFrame(left)
training_right = pd.DataFrame(right)

### Verschiedene vorgegebene Modelle

#### k-nearest neighbors (kNN)

- Neuer Datenpunkt hat keine Klasse
- Sucht nach N Nachbarn
- Die Klasse, die am häufigsten zw. den Nachbarn vetreten ist, wird die neue Klasse des Datenpunkts 

In [19]:
from sklearn.neighbors import KNeighborsClassifier

In [20]:
knn = KNeighborsClassifier()

In [21]:
knn_model = knn.fit(training_left, training_right.values.reshape(-1))

In [22]:
knn_model.predict(test_left)

array([1, 1, 1, ..., 0, 0, 1])

In [23]:
pd.Series(knn_model.predict(test_left)).value_counts()

0    4194
1    2319
Name: count, dtype: int64

In [24]:
prediction = knn_model.predict(test_left)

In [25]:
prediction.reshape(-1, 1)

array([[1],
       [1],
       [1],
       ...,
       [0],
       [0],
       [1]])

In [26]:
test_right.values.reshape(-1, 1)

array([[1],
       [1],
       [1],
       ...,
       [0],
       [0],
       [0]])

In [27]:
prediction.reshape(-1, 1) == test_right.values.reshape(-1, 1)

array([[ True],
       [ True],
       [ True],
       ...,
       [ True],
       [ True],
       [False]])

In [28]:
x = pd.DataFrame( np.hstack([test_left, test_right.values.reshape(-1, 1), prediction.reshape(-1, 1), prediction.reshape(-1, 1) == test_right.values.reshape(-1, 1)]) )

In [29]:
x.iloc[:, -1].value_counts()

16
1.0    5038
0.0    1475
Name: count, dtype: int64

In [30]:
def evaluate(prediction):
    p = prediction.reshape(-1, 1)
    a = test_right.values.reshape(-1, 1)
    v = pd.Series((p == a).reshape(-1))
    vc = v.value_counts()
    print(f"Richtige Vorhersagen: {vc.iloc[0] / len(v) * 100}%, Falsche Vorhersagen: {vc.iloc[1] / len(v) * 100}%")

In [31]:
evaluate(prediction)

Richtige Vorhersagen: 77.35298633502227%, Falsche Vorhersagen: 22.647013664977734%


#### Naive Bayes

Verwendet Wahrscheinlichkeiten um die Datenpunkte zu klassifizieren

Am Ende wird mit 50% verglichen

In [32]:
from sklearn.naive_bayes import GaussianNB

In [33]:
nb = GaussianNB()

In [34]:
nb_model = nb.fit(training_left, training_right.values.reshape(-1))

In [35]:
prediction = nb_model.predict(test_left)

In [36]:
evaluate(prediction)

Richtige Vorhersagen: 81.97451251343468%, Falsche Vorhersagen: 18.025487486565332%


#### Logistische Regression

Verwendet wie Naive Bayes Wahrscheinlichkeiten und vergleicht mit einer Threshold (konfigurierbar)

In [37]:
from sklearn.linear_model import LogisticRegression

In [38]:
lr = LogisticRegression()

In [39]:
lr_model = lr.fit(training_left, training_right.values.reshape(-1))

In [40]:
prediction = lr_model.predict(test_left)

In [41]:
evaluate(prediction)

Richtige Vorhersagen: 76.26285889758944%, Falsche Vorhersagen: 23.737141102410565%


#### Support Vector Machines

Legt eine Hyperplane in den Raum

Breitet danach ein Margin in Schritten aus

Alle Punkte die vom Margin berührt werden, werden auf die entsprechende Seite klassifiziert

In [42]:
from sklearn.svm import SVC

In [43]:
svc = SVC()

In [44]:
svc_model = svc.fit(training_left, training_right.values.reshape(-1))

In [45]:
prediction = svc_model.predict(test_left)

In [46]:
evaluate(prediction)

Richtige Vorhersagen: 79.13403961308153%, Falsche Vorhersagen: 20.86596038691847%


### Neurales Netzwerk

Eigenes Model, welches aus Schichten (Layer) und Knoten (Neuron) besteht

Wird primär über die __Keras__ API aufgebaut

Für das neurale Netzwerk wird das Tensorflow Paket benötigt

---



In [47]:
import tensorflow as tf

In [88]:
model = tf.keras.Sequential([
    tf.keras.layers.Input((14,)),
    tf.keras.layers.Dense(32, activation="relu"),
    tf.keras.layers.Dense(32, activation="relu"),
    tf.keras.layers.Dense(1, activation="sigmoid")
])

In [89]:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
              loss="binary_crossentropy")

In [90]:
model.fit(training_left,
          training_right.values.reshape(-1),
          verbose=1,
          epochs=10,
          batch_size=8)

Epoch 1/10
[1m1233/1233[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - loss: 0.4769
Epoch 2/10
[1m1233/1233[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 873us/step - loss: 0.3921
Epoch 3/10
[1m1233/1233[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 864us/step - loss: 0.3873
Epoch 4/10
[1m1233/1233[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 986us/step - loss: 0.3841
Epoch 5/10
[1m1233/1233[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 880us/step - loss: 0.3800
Epoch 6/10
[1m1233/1233[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 867us/step - loss: 0.3751
Epoch 7/10
[1m1233/1233[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 880us/step - loss: 0.3731
Epoch 8/10
[1m1233/1233[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 866us/step - loss: 0.3679
Epoch 9/10
[1m1233/1233[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 900us/step - loss: 0.3674
Epoch 10/10
[1m1233/1233[0m [32m━━━━━━━━━━━━━━━━━━━━[

<keras.src.callbacks.history.History at 0x215581c2c60>

In [91]:
model.predict(test_left)

[1m204/204[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 704us/step


array([[0.86214817],
       [0.579123  ],
       [0.8813972 ],
       ...,
       [0.4438342 ],
       [0.03492748],
       [0.7424117 ]], dtype=float32)

In [92]:
prediction = model.predict(test_left)

[1m204/204[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 783us/step


In [93]:
predictionFinal = (prediction > 0.5).astype(int).reshape(-1)

In [94]:
evaluate(predictionFinal)

Richtige Vorhersagen: 79.0879778903731%, Falsche Vorhersagen: 20.9120221096269%
