# Introduction to data science

## ML Process

### Feature selection

- Correlations, Colinearity (VIF)
- Pearson (only linear) / Spearman (compares classes) correlations

**Different approaches**:
- filter before modeling -> select best features
- after modeling  -> wrapped during modeling -> forward selection / backward elimination
- embedded modeling -> some algorithems can already select best features during modeling

**Goal**:
- optimize performance of the model
- Use the ones with high correlations and eliminate the colinearities

**Filter**:
- numeric features : low variance, except there is a high correlation
- non-numeric: information gain, gini impurity

Standard normal distribution: average 0, standard deviation 1.<br>
Variance as absolute value depends on the variable. => Max. scaler, variance coefficient: variance/mean.<br>
High variance has no meaning in the first place.

![image.png](attachment:image.png)

For unbalanced data, reduce the amount of the big dataset to rather match the smaller one. 

### Data cleaning

**Missing values (MV)**:

- Deletion
- Average Imputation (reduces variance)
- Transformation-based / rule based Imputation
- Regression Imputation / Machine Learning Imputation

**Noisy data**:

> e.g. we expect only positive values but there are negative ones.

- Use algorithms that work better with noisy data
- Smooth values
- Noise filter

**Outliers**:

> Important to see in context. Normally < 1%.

- Separate outliers in own model
- concentrate only on outliers (e.g. fraud detection)
- Smooth values

Univariate outliers (feature by feature)<br>
Multivariate outliers are only outliers when watching at all features (clustering) 

**Transformation**:

- Calculate new feature based on existing features
- Normalization (Min-Max) => from [0, 1]. $x - min(x) \over max(x) - min(x)$
- Standardization (Z-Score) => mean = 0 and std-deviation = 1. $x - mean \over standard deviation$
- Robust Scaler (Interquartile range)
- Proportion Transformation (sum of all rows, proportion of row on all rows)
- Binning (use categorical feature in regression or vice versa)
- One hot encoding (technically the same as dummy): make own column for each category.
  - Dummy: can be weighted e.g. for special months. Use not 0/1 but higher values. 

![image.png](attachment:image.png)

> Careful, nominal or ordinal values

Take numeric values (e.g. sizes, S = 2, M = 3, L = 4 ...) BUT differences between sizes are not exactly the same.

**Integrate data**:

- Combine multiple datasets
- Aggregate

### Data modeling

- Build model with hyperparameters (find out with grid search or manually)
- Supervised learning (classification) metrics
- Unsupervised learning (clustering) metrics
  - density of cluster, separation
- Ensemble modeling
  - Bagging (multiple results at once, take one) `Random Forest`
  - Boosting (multiple results one after another, second model fixes errors of the first) `XGBoost`


**Principal component analysis (PCA)**

Attributes are transformed into PCA (same count) but includes variance.<br>
Select the ones that are most important (explain the most variance).<br>

Check based on the weights which attributes are relevant. PCA can output weights.

<br><br>
![image.png](attachment:image.png)

<br>
Cluster Analysis (for preprocessing)<br>
Alternatives: Exploratory Factor Analysis (EFA)


### Evaluation

- Evaluate results from business perspective

### Deployment

- DevOps, MLOps
- Environment

Monitoring of:
- Data drift => Re-Learning
- Concept drift (same data, different environment)

**AutoML**: RapidMiner, similar for Microsoft, Google etc.

## Classification methods

### Introduction

- Find known points -> lines -> objects by characteristic attributes.
- Train with pictures / texts / ...

> Background substraction

### K-Nearest Neighbors (K-NN)

- Predict weather of a small village by the biggest cities nearby
- If it walks like a duck and quacks like a duck, then it's probably a duck

K = number of neighbors. For even numbers, take the distance to the neighbors.
- Too small value: noisy
- Too big value: wrong classification

KNN is a distance based algorithm. Therefore we need to scale the data first, especially if the values are very different.

KNN is a **lazy learner** (no **eager learner**):
- No model is built, for each object the distance is recalculated.


### Naive Bayes

- Based on the Bayes' theorem
- Probabilistic classifiers
- Strong independence assumptions between features (all features are independent, **no strong correlations > 0,5**)

Only discreet values (no numbers).
Values with 0 are bad. => Laplace correction => replace 0 with 1.

### Decision Trees

- Easy to interpret the decision
- Each attribute is calculated on it's own
- Works for all datatypes
- Output depth of the tree

> Careful with overfitting

**Pruning**:
- Take the tree with the shortest decision tree
- Pre-Pruning: Limit before modeling
- Prune after: cut after it was built

### Support Vector Machines (SVM)

- Classification & Regression
- Line / Plane / Hyperplan that separates the different classis
- The points on this "line" are vectors
- Line with the highest margin between the points / classes, so the classes are well separated from each other.

**Non-linear boundaries**:
- Use *kernal* to transform the data into higher dimensional space
- After transformation the line is linear again

Different algorithm options:
- Linear, Polynomial, Radial (for circles)

More advanced algorithms: XGBoost, Random forest