# Exercise 2

## Contents:

- Organisatorial Things
- Questions General + Exercise 1
- Automatic Data Type Recognition with Pandas
- Machine Learning Lingua
- Network architectures & Loss functions (so far)
- Teaser: Exercise 2

## Organisatorial Things
- Additional Exercise Poll (please vote!)
- Exercise 2 extendended until next Wednesday (9.11.)
- Questions about grading

## Questions General + Exercise 1

## Data Loading

Task: load a dataset (https://www.kaggle.com/code/divan0/multiple-linear-regression/data) into a numpy.  Can we use the code below using numpy's function ```loadtxt```?

In [1]:
from csv import reader
import numpy as np
import pandas as pd

In [2]:
def load_data(path_to_file, delimiter):
    dataset = np.loadtxt(path_to_file, delimiter=delimiter, skiprows=1)
    return dataset

In [3]:
try: 
    house_data = load_data('kc_house_data.csv', ',')
except ValueError as e:
    print('Error: {}'.format(e))

Error: could not convert string to float: '"7129300520"'


No, there are non-numeric columns in our dataset!

### Solution? Read in as strings

In [4]:
def load_data(path_to_file, delimiter):
    dataset = np.loadtxt(path_to_file, delimiter=delimiter, skiprows=1, dtype='str')
    
    # or
    dataset = []
    
    with open(path_to_file, 'r') as f:
        csv_reader = reader(f)
        for row in csv_reader:
            dataset.append(row)
    return dataset

### How will the data look like now?

In [5]:
print(load_data('kc_house_data.csv', ',')[:5])

[['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15'], ['7129300520', '20141013T000000', '221900', '3', '1', '1180', '5650', '1', '0', '0', '3', '7', '1180', '0', '1955', '0', '98178', '47.5112', '-122.257', '1340', '5650'], ['6414100192', '20141209T000000', '538000', '3', '2.25', '2570', '7242', '2', '0', '0', '3', '7', '2170', '400', '1951', '1991', '98125', '47.721', '-122.319', '1690', '7639'], ['5631500400', '20150225T000000', '180000', '2', '1', '770', '10000', '1', '0', '0', '3', '6', '770', '0', '1933', '0', '98028', '47.7379', '-122.233', '2720', '8062'], ['2487200875', '20141209T000000', '604000', '4', '3', '1960', '5000', '1', '0', '0', '5', '7', '1050', '910', '1965', '0', '98136', '47.5208', '-122.393', '1360', '5000']]


Better but still no automatic data type conversion is performed! (would need to be made explicit)

### Introducing Pandas
<img src="https://st4.depositphotos.com/21607914/24198/i/450/depositphotos_241982382-stock-photo-four-baby-giant-pandas-play.jpg" alt="Drawing" style="width: 70%;"/>

1. Easy read-in of CSV files

In [6]:
house_data = pd.read_csv('kc_house_data.csv', delimiter=',')
house_data.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


... with automatic data type recognition.

In [7]:
house_data.dtypes

id                 int64
date              object
price            float64
bedrooms           int64
bathrooms        float64
sqft_living        int64
sqft_lot           int64
floors           float64
waterfront         int64
view               int64
condition          int64
grade              int64
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
zipcode            int64
lat              float64
long             float64
sqft_living15      int64
sqft_lot15         int64
dtype: object

1. Dropping columns

In [8]:
house_data.drop(['id','date'], axis = 1)

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,221900.0,3,1.00,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.7210,-122.319,1690,7639
2,180000.0,2,1.00,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,604000.0,4,3.00,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,510000.0,3,2.00,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21608,360000.0,3,2.50,1530,1131,3.0,0,0,3,8,1530,0,2009,0,98103,47.6993,-122.346,1530,1509
21609,400000.0,4,2.50,2310,5813,2.0,0,0,3,8,2310,0,2014,0,98146,47.5107,-122.362,1830,7200
21610,402101.0,2,0.75,1020,1350,2.0,0,0,3,7,1020,0,2009,0,98144,47.5944,-122.299,1020,2007
21611,400000.0,3,2.50,1600,2388,2.0,0,0,3,8,1600,0,2004,0,98027,47.5345,-122.069,1410,1287


2. Slicing the dataset 

In [9]:
house_data.iloc[:, 2:5]

Unnamed: 0,price,bedrooms,bathrooms
0,221900.0,3,1.00
1,538000.0,3,2.25
2,180000.0,2,1.00
3,604000.0,4,3.00
4,510000.0,3,2.00
...,...,...,...
21608,360000.0,3,2.50
21609,400000.0,4,2.50
21610,402101.0,2,0.75
21611,400000.0,3,2.50


3. Changing data types

In [10]:
house_data['lat'].head()

0    47.5112
1    47.7210
2    47.7379
3    47.5208
4    47.6168
Name: lat, dtype: float64

In [11]:
house_data['lat'].astype(int).head()

0    47
1    47
2    47
3    47
4    47
Name: lat, dtype: int64

4. Convert to numpy

In [12]:
house_data[['sqft_lot','sqft_above','sqft_living']].to_numpy()

array([[ 5650,  1180,  1180],
       [ 7242,  2170,  2570],
       [10000,   770,   770],
       ...,
       [ 1350,  1020,  1020],
       [ 2388,  1600,  1600],
       [ 1076,  1020,  1020]])

## Machine Learning Lingua

### Overarching concept: Function Approximation!

- Input Features ($X$) mapped into target/ output space ($y$) by function $G$
    - **Features** = attributes ($x_i \in X$) of a selection of wines
    - **Targets** = wine qualities ($y_i \in X$) of a selection of wines (**ground truth**)

- **Basic idea:** learn from examples!
    - Examples = Training Data
    - **Supervised Training** = we know y, i.e. the target/ output
    - (**Unsupervised Training** = we do not know our target (e.g. clustering)) 

- $G$ can (probably) never be found (i.e. $X$ is not finite and constantly changing)
    - But maybe we can make a good guess :-)

- **Ultimate Goal:** find an approximation for $G$, i.e. $\mathcal{N}(x;\theta)$
    - $N$ = network architecture, e.g. linear regression, SVM, neural network, ...


- $\theta$ = weights of our architecture 
    - what we need to optimize
    - **Goal of training:** find suitable weights such that $\mathcal{N}(x_i, y) \approx y_i$, i.e. the function returns the correct output for all training examples
    - Evaluate novelty of weight selection using **loss function**
    - **Inference**: Checking the output of your network given some input data

- **Problem:** no intuition about generalization
    - **Solution:** Split $X$ into train, validation and test dataset
    - **Train data:** used for adjusting $\theta$ using optimization algorithm (-> Gradient Descent)
    - **Validation data:** gives intuition about generalization; not used during training (unseen data); use to tweak your training process (different architecture, hyperparameters, ...)
    - **Test data:** used for final evaluation; should only be touched once the training is completed!

- If we found weights $\theta$ where $\mathcal{N}(x_i, y) \approx y_i$ on **unseen** data, i.e. $\mathcal{N}(x;\theta) \approx G(x)$, we speak of **generalization**!

### Over- and Underfitting

<img src="https://media.geeksforgeeks.org/wp-content/cdn-uploads/20190523171258/overfitting_2.png" alt="Drawing" style="width: 70%;"/>

### Network architectures & Loss functions (so far)

- **Linear Regression networks** (map X to a continues value y, e.g. price of a house)
    - $\mathcal{N}: y_j = \sum_{i=1}^{M}{\theta_i * x_{ji}} + \theta_0$ for $j \in N$ 
    - $\theta$ coefficients (network weights), $N$ training examples, $M$ features
    - $\theta_i$ coeficient $i$ of input feature $i$
    - $x_ij \in X$ = input feature $i$ of wine at index $j$

- **Loss function:** Mean sum of squared errors (**residuals**)
    - $E(\theta) = \sum_{i=1}^{N}$(ground truth value$_i$ - prediction of our network$_i$) $= \frac{1}{N}\sum_{j=1}^{N}||(\sum_{i=1}^{M}{\theta_i * x_{ji}} + \theta_0) - y_j||^2$
    - Optimal solution $\hat{\theta}$: minimizes our loss function, i.e. $\underset{\theta}{\arg \min} E(\theta)$
    - **Special case here:** $\triangledown E(\theta) = 0$ ensures that $\theta$ is global minimizer (because $\mathcal{N}$ is convex)

- **Linear Classification network** (map X to a discrete value y, e.g. a category of wine)
    - Output of our network is now a class $c$ chosen from $C$ classes
    - $y$ now represented as one-hot/ unit vectors $e_i$ (probability vector with 100% probability of being class $i$)
    - $\theta$ now $\in \mathbb{R}^{\text{no. classes } \times \text{ no. features + 1}}$, i.e. $\mathbb{R}^{C \times M + 1}$

    - $\mathcal{N}: y_j = \sum_{i=1}^{M}{\theta_i * x_{ji}} + \theta_{0}$ for $j \in N$
    - Output is now a vector of size $1 \times \text{C}$

- **Loss Function:** Cross-Entropy Loss
    - Combination of: softmax + log loss
    - Softmax: 
        - Converts network output to probability vector with largest component having largest probability
        - Input and Output are now a vectors of size $1 \times \text{C}$
        - Softmax for class $t$: $(sm(x))_t = \frac{e^{x_t}}{\sum_j^C{e^{x_c}}}$
    - log-loss:
        - $\mathcal{L}(z,y) = -\sum_i^N{y_i \log(z_i))}$
        - if $y$ is unit/ one-hot vector: $\mathcal{L}(z,y) = -\log(z_i)$
    - Putting both together:
        - $E(\theta) = \sum_{i=1}^{N} \mathcal{L}(sm(x_i),y) = \sum_{i=1}^{N} -\log({\frac{e^{x_{it}}}{\sum_j^C{e^{x_{ic}}}}}), \text{ where } y = e_t$

- **Issue:** Extrapolating!
    - Simplistic model; can only represent linear relationships
    - How can we represent more complex, non-linear relations?

<img src="https://i.kym-cdn.com/photos/images/newsfeed/000/531/557/a88.jpg" alt="Drawing" style="width: 50%;"/>

- **Fully connected networks:**
    - Nesting multiple linear and activation layers 
        - **Linear layer:**
            - $l(x;\theta) = \sum_{i=1}^M \theta_i x_i + \theta_{0}$
        - **Activation layer:**
            - Rectified Linear Unit (ReLU): $(l(x))_j = \max(z_j, 0)$
            - Introduces non-linearity; without it won't be more expressive than linear regression network

## Teaser Exercise 2