## Machine Learning

Dem Computer beibringen, welche Zusammenhänge es in einen gegebenen Datenset gibt

z.B. Das Wetter

Beispiel: Februar, 100% Niederschlag, 5m/s Wind -> Temperatur?: 6°C

Dem Algorithmus werden historische Wetterdaten gefüttert, welcher dann reale Zusammenhänge in diesen Daten erkennt

Sobald die Daten vorbereitet sind, werden diese in einem Lernprozess verwendet

Dieser Lernprozess wird X-mal wiederholt, und am Ende kommt ein Model heraus

Dieses Model kann danach neue Daten empfangen und eine Vorhersage treffen

### Datenset

Das Income.csv Datenset enthält Personendaten, und sagt aus, welche Personen über/unter 50.000$/Jahr verdienen

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv("Data/Income.csv")

In [3]:
data

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


### Probleme im Datenset

- Nicht-numerische Daten
- Unenebene Daten (hours-per-week: meistens 40, manchmal anders)
- Ungleiche Datenverteilung (mehr >50k als <=50k)

In [4]:
len(data[data["class"] == ">50K"])

7841

In [5]:
len(data[data["class"] == "<=50K"])

24720

#### Nicht-numerische Daten

Jeder Begriff innerhalb einer Spalte wird durch eine Zahl ersetzt

Diese Zahlen sind innerhalb der Spalte einem Begriff bestimmten zugewiesen

In [6]:
from sklearn.preprocessing import LabelEncoder

In [7]:
enc = LabelEncoder()

In [8]:
enc.fit_transform(data["workclass"])  # Einzelne Spalte kodieren

array([7, 6, 4, ..., 4, 4, 5])

In [9]:
def encodeColumn(col: str):
    enc = LabelEncoder()
    encodedCol = enc.fit_transform(data[col])
    data[col] = encodedCol

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   gender          32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  class           32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [11]:
columns = ["workclass", "education", "marital-status", "occupation", "relationship", "race", "gender", "native-country"]

In [12]:
for c in columns:
    encodeColumn(c)

In [13]:
data

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,7,77516,9,13,4,1,1,4,1,2174,0,40,39,<=50K
1,50,6,83311,9,13,2,4,0,4,1,0,0,13,39,<=50K
2,38,4,215646,11,9,0,6,1,4,1,0,0,40,39,<=50K
3,53,4,234721,1,7,2,6,0,2,1,0,0,40,39,<=50K
4,28,4,338409,9,13,2,10,5,2,0,0,0,40,5,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,4,257302,7,12,2,13,5,4,0,0,0,38,39,<=50K
32557,40,4,154374,11,9,2,7,0,4,1,0,0,40,39,>50K
32558,58,4,151910,11,9,6,1,4,4,0,0,0,40,39,<=50K
32559,22,4,201490,11,9,4,1,3,4,1,0,0,20,39,<=50K


In [14]:
encodeColumn("class")

#### Aufteilen der Daten

In [15]:
data.sample()  # Nimmt einen Random Eintrag in der Tabelle

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,class
22446,45,4,88061,1,7,3,7,4,1,0,0,0,40,35,0


In [16]:
data.sample(frac=1)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,class
16342,53,4,346754,15,10,2,4,5,4,0,0,0,40,39,1
31230,39,4,314007,0,6,2,3,0,4,1,0,2051,40,39,0
15648,26,6,93806,15,10,2,12,0,4,1,0,0,55,39,0
26785,42,4,341638,1,7,0,3,1,4,1,0,0,40,39,0
20834,47,4,115613,11,9,2,3,0,4,1,0,0,50,39,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26204,22,4,37894,11,9,5,8,2,4,1,0,0,35,39,0
15657,28,4,219267,11,9,4,8,4,4,0,0,0,28,39,0
28633,21,4,174907,7,12,4,12,3,4,0,0,0,32,39,0
27399,34,4,149507,11,9,2,3,0,4,1,0,0,55,39,1


In [17]:
trainingLen = int(len(data) * 0.8)

In [18]:
testLen = len(data) - trainingLen

In [19]:
print(trainingLen)
print(testLen)
print(trainingLen + testLen)

26048
6513
32561


In [20]:
split = np.split(data.sample(frac=1), (trainingLen, len(data)))

  return bound(*args, **kwds)


In [21]:
training = split[0]
test = split[1]

#### Unebene Daten ausbessern

In [22]:
from sklearn.preprocessing import StandardScaler

In [23]:
scaler = StandardScaler()

# WICHTIG: Letzte Spalte exkludieren (class), diese soll 0 und 1 bleiben
scaled = scaler.fit_transform(training.iloc[:, :-1])

newTraining = np.hstack((scaled, training["class"].values.reshape(-1, 1)))

training = pd.DataFrame(newTraining)

In [24]:
training

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,-0.337820,0.091508,-0.292989,1.211947,-0.028741,-0.406190,-0.131729,-0.902756,0.391954,0.706394,-0.146971,-0.215359,-0.035806,0.291621,0.0
1,0.101012,0.091508,-1.491743,0.438329,1.522877,-0.406190,0.814507,-0.902756,0.391954,0.706394,-0.146971,-0.215359,1.584416,0.291621,0.0
2,-0.922929,0.091508,-0.997048,-0.335290,1.134972,-0.406190,-0.841407,-0.902756,0.391954,0.706394,-0.146971,-0.215359,-0.035806,0.291621,1.0
3,-0.922929,0.091508,-0.395692,-0.335290,1.134972,0.919014,1.287625,0.956921,0.391954,-1.415640,-0.146971,-0.215359,0.369249,-4.682261,0.0
4,-0.191543,0.091508,-1.044337,1.211947,-0.028741,-0.406190,1.287625,-0.902756,0.391954,0.706394,-0.146971,-0.215359,-0.035806,0.291621,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26043,-0.849790,-1.283113,0.704317,1.211947,-0.028741,0.919014,-0.604848,1.576813,0.391954,0.706394,-0.146971,-0.215359,-0.035806,0.291621,0.0
26044,-0.703513,0.091508,0.035368,0.180456,-0.416646,0.919014,0.104830,-0.282864,0.391954,-1.415640,-0.146971,-0.215359,-0.035806,0.291621,0.0
26045,-0.264681,1.466128,0.254247,-2.140401,-0.804550,0.919014,0.341389,-0.282864,-1.962664,-1.415640,-0.146971,-0.215359,-0.035806,0.164085,0.0
26046,0.247289,0.778818,0.835860,-0.077417,2.298686,-0.406190,0.814507,-0.902756,0.391954,0.706394,-0.146971,4.705395,1.584416,0.291621,1.0


In [25]:
scaler = StandardScaler()

# WICHTIG: Letzte Spalte exkludieren (class), diese soll 0 und 1 bleiben
scaled = scaler.fit_transform(test.iloc[:, :-1])

newTest = np.hstack((scaled, test["class"].values.reshape(-1, 1)))

test = pd.DataFrame(newTest)

#### Unebenheiten ausgleichen

In [26]:
len(training[training[14] == 0])

19774

In [27]:
len(training[training[14] == 1])

6274

In [28]:
from imblearn.over_sampling import RandomOverSampler

In [29]:
overSampler = RandomOverSampler()

# Zwei Parameter: Das Datenset ohne class, nur class
left, right = overSampler.fit_resample(training.loc[:, :13], training.loc[:, 14])

trainingNew = np.hstack((left, right.values.reshape(-1, 1)))

training = pd.DataFrame(trainingNew)

In [30]:
training

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,-0.337820,0.091508,-0.292989,1.211947,-0.028741,-0.406190,-0.131729,-0.902756,0.391954,0.706394,-0.146971,-0.215359,-0.035806,0.291621,0.0
1,0.101012,0.091508,-1.491743,0.438329,1.522877,-0.406190,0.814507,-0.902756,0.391954,0.706394,-0.146971,-0.215359,1.584416,0.291621,0.0
2,-0.922929,0.091508,-0.997048,-0.335290,1.134972,-0.406190,-0.841407,-0.902756,0.391954,0.706394,-0.146971,-0.215359,-0.035806,0.291621,1.0
3,-0.922929,0.091508,-0.395692,-0.335290,1.134972,0.919014,1.287625,0.956921,0.391954,-1.415640,-0.146971,-0.215359,0.369249,-4.682261,0.0
4,-0.191543,0.091508,-1.044337,1.211947,-0.028741,-0.406190,1.287625,-0.902756,0.391954,0.706394,-0.146971,-0.215359,-0.035806,0.291621,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39543,-0.996068,0.091508,-0.859017,-0.593163,0.359163,-0.406190,0.814507,2.196705,0.391954,-1.415640,0.869444,-0.215359,-0.035806,0.291621,1.0
39544,-0.118404,-1.970423,1.295255,0.438329,1.522877,-0.406190,0.814507,-0.902756,0.391954,0.706394,-0.146971,-0.215359,-0.035806,-4.682261,1.0
39545,0.027873,0.778818,-0.634314,-0.851036,0.747068,-0.406190,-0.604848,-0.902756,0.391954,0.706394,1.839322,-0.215359,1.179360,0.291621,1.0
39546,0.612983,0.091508,0.846003,1.211947,-0.028741,-0.406190,0.814507,-0.902756,0.391954,0.706394,0.869444,-0.215359,-0.035806,0.291621,1.0


In [31]:
len(training[training[14] == 0])

19774

In [32]:
len(training[training[14] == 1])

19774

#### Zwei Variablen für Links/Rechts

In [33]:
trainingL = training.loc[:, :13]
trainingR = training.loc[:, 14]

testL = test.loc[:, :13]
testR = test.loc[:, 14]

### Verschiedene Algorithmen

#### kNN

k-nearest neighbors

Schaut sich k umliegende Punkte an und klassifiziert den neuen Punkt, anhand der vorher klassifizierten Punkte welchen näher sind

k muss ungerade sein

In [34]:
from sklearn.neighbors import KNeighborsClassifier

In [35]:
knn = KNeighborsClassifier(7)

In [36]:
knn_model = knn.fit(trainingL, trainingR)

In [37]:
prediction = knn_model.predict(testL)  # Ergebnis: Array mit Vorhersagen (0 oder 1)

In [38]:
def eval(pred):
    a = np.hstack((testL, testR.values.reshape(-1, 1), pred.reshape(-1, 1)))
    d = pd.DataFrame(a)
    d["Comp"] = (testR - pred == 0).astype(bool).values.reshape(-1, 1)

    t = len(d[d["Comp"] == True])
    f = len(d) - t
    print(f"Korrekte Vorhersagen: {t}")
    print(f"Inkorrekte Vorhersagen: {f}")
    print(f"%: {t / len(d) * 100}")
    
    # return d

In [39]:
eval(prediction)

Korrekte Vorhersagen: 5071
Inkorrekte Vorhersagen: 1442
%: 77.85966528481498


#### Naive Bayes

Verwendet Wahrscheinlichkeiten über/unter 50% pro Spalte um diese entsprechende Spalte zu klassifizieren

Am Ende werden die Wahrscheinlichkeiten von allen Klassen aggregiert, und dadurch kommt eine Vorhersage heraus

In [40]:
from sklearn.naive_bayes import GaussianNB

In [41]:
nb = GaussianNB()
nb_model = nb.fit(trainingL, trainingR)

In [42]:
prediction = nb_model.predict(testL)

In [43]:
eval(prediction)

Korrekte Vorhersagen: 5281
Inkorrekte Vorhersagen: 1232
%: 81.08398587440504


#### Logistische Regression

In [44]:
from sklearn.linear_model import LogisticRegression

In [45]:
lr = LogisticRegression()
lr_model = lr.fit(trainingL, trainingR)

In [46]:
prediction = lr_model.predict(testL)

In [47]:
eval(prediction)

Korrekte Vorhersagen: 4994
Inkorrekte Vorhersagen: 1519
%: 76.67741440196531


#### SVM

Support-Vector-Machines



In [48]:
from sklearn.svm import SVC

In [49]:
svm = SVC()

In [50]:
svm_model = svm.fit(trainingL, trainingR)

In [51]:
prediction = svm_model.predict(testL)

In [52]:
eval(prediction)

Korrekte Vorhersagen: 5152
Inkorrekte Vorhersagen: 1361
%: 79.10333179794257


### Neurales Netzwerk

Wie funktioniert ein neurales Netzwerk?

Das Modell besteht aus Schichten, jede Schicht besteht aus Neuronen

Jedes Neuron hat folgenden Ablauf:
- Multiplizieren/Summieren der Daten
- Dieses Ergebnis in die Activation Function werfen
- Das Endergebnis an die nächste Schicht weitergeben

Activation Function: Beliebige Mathematische Funktion (häufig ReLU, sigmoid, supermax, ...)

Wenn alle Daten das Netzwerk durchquert haben, geht der Prozess von vorne los mit dem Ergebnis von den ganzen Schichten

In [53]:
import tensorflow as tf

In [54]:
# Mithilfe von Keras ein Modell bauen
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(14, )),
    tf.keras.layers.Dense(32, activation="relu"),
    tf.keras.layers.Dense(32, activation="relu"),
    tf.keras.layers.Dense(1, activation="sigmoid")
])

model.compile(loss="binary_crossentropy")

In [56]:
history = model.fit(trainingL, trainingR, verbose=1, epochs=50)

Epoch 1/50
[1m1236/1236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 5ms/step - loss: 0.4870
Epoch 2/50
[1m1236/1236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 4ms/step - loss: 0.3919
Epoch 3/50
[1m1236/1236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 4ms/step - loss: 0.3803
Epoch 4/50
[1m1236/1236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 3ms/step - loss: 0.3785
Epoch 5/50
[1m1236/1236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 4ms/step - loss: 0.3728
Epoch 6/50
[1m1236/1236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 4ms/step - loss: 0.3743
Epoch 7/50
[1m1236/1236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 4ms/step - loss: 0.3713
Epoch 8/50
[1m1236/1236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 4ms/step - loss: 0.3711
Epoch 9/50
[1m1236/1236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 3ms/step - loss: 0.3688
Epoch 10/50
[1m1236/1236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1

<keras.src.callbacks.history.History at 0x1cd3d1dd100>

In [57]:
model.predict(testL)

[1m204/204[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 884us/step


array([[0.03153429],
       [0.59712535],
       [0.9224721 ],
       ...,
       [0.43884477],
       [0.69889534],
       [0.8897023 ]], dtype=float32)

In [61]:
prediction = (model.predict(testL) >= 0.5).astype(int)

[1m204/204[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 734us/step


In [64]:
eval(prediction.reshape(-1))

Korrekte Vorhersagen: 5136
Inkorrekte Vorhersagen: 1377
%: 78.85766927683096


In [66]:
model.save("Models/Income.keras")

In [67]:
model = tf.keras.models.load_model("Models/Income.keras")