<a href="https://colab.research.google.com/github/luimui/Angular-GettingStarted/blob/master/02-linear-models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 2. Linear models

In this exercise, we will implement the linear models that appear in the lecture slides ourselves.
We will start with the Iris data that you already know from the lecture.   

*Exercise*: Load the dataset "iris.csv" into a dataframe using the `pd.read_csv()` method from the pandas package!   

In [19]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from google.colab import drive
drive.mount('/content/drive')

# load the iris dataset into a dataframe
df = pd.read_csv('/content/drive/MyDrive/KI2WS202324/iris.csv')
df

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0.1,Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Name
0,0,5.1,3.5,1.4,0.2,Iris-setosa
1,1,4.9,3.0,1.4,0.2,Iris-setosa
2,2,4.7,3.2,1.3,0.2,Iris-setosa
3,3,4.6,3.1,1.5,0.2,Iris-setosa
4,4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,145,6.7,3.0,5.2,2.3,Iris-virginica
146,146,6.3,2.5,5.0,1.9,Iris-virginica
147,147,6.5,3.0,5.2,2.0,Iris-virginica
148,148,6.2,3.4,5.4,2.3,Iris-virginica


Now let's investigate how our data actually looks.   

*Exercise*: Do some exploratory data analysis (EDA). Look for example what columns `df` has, what datatypes the single columns are made off and check for missing values. What else can you say about the dataset?

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   150 non-null    int64  
 1   SepalLength  150 non-null    float64
 2   SepalWidth   150 non-null    float64
 3   PetalLength  150 non-null    float64
 4   PetalWidth   150 non-null    float64
 5   Name         150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


In [22]:
df = df.rename(columns={"Unnamed: 0":"Index"})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Index        150 non-null    int64  
 1   SepalLength  150 non-null    float64
 2   SepalWidth   150 non-null    float64
 3   PetalLength  150 non-null    float64
 4   PetalWidth   150 non-null    float64
 5   Name         150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


In [26]:
df.Name
ser = pd.Series(df.Name)
df.Name = ser.astype('category')
df.info()
df.isnull().values.any()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   Index        150 non-null    int64   
 1   SepalLength  150 non-null    float64 
 2   SepalWidth   150 non-null    float64 
 3   PetalLength  150 non-null    float64 
 4   PetalWidth   150 non-null    float64 
 5   Name         150 non-null    category
dtypes: category(1), float64(4), int64(1)
memory usage: 6.3 KB


False

## 2.1. Simple linear regression

First, let's look at the simple linear model that represents the relationship between PetalLength and SepalWidth. We will use the Python package statsmodels (`https://www.statsmodels.org`) to fit linear models. The specification of linear models works very similarly to the R examples in the lecture.

To better understand the `ols` formulas, it is worth taking a look here: (https://patsy.readthedocs.io/en/latest/formulas.html)

*Exercise*: Write down the general equation for regression in this markdown cell. Afterwards, modify the formula to represent the relationship between PetalLength and SepalWidth and plug it into `smf.ols()`.   

*Solution*:   
* Regression equation = Todo   
* For our example: Todo

In [32]:
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

x1 = df.PetalLength
y = df.SepalWidth
# initialize the linear regression model
model = smf.ols(y,x1)
# fit/train the model
results = model.fit()
# print the results of the model
print(results.summary())

PatsyError: ignored

*Exercise*: Now fit the model and interpret the results! What do the coefficients, $R^2$, p-values mean? (see slide 17) Write everything down in this markdown cell!

## 2.2. Regularization

Now import the data from 'reg_data.csv' as a DataFrame and look what types of columns we have!

In [None]:
df2 = 'Implement me!'
df2.info()


*Exercise*: Fit a linear model with y as the dependent variable and x1 and x2 as independent variables. Interpret the results again. What is the problem?

In [None]:
# initialize and train the model in one step -> save results
results = smf.ols('Implement me!', data=df2).fit()
# get the parameters of the trained model
results.summary()

Just a markdown cell to take notes.

*Exercise*: Have a deeper look into your data and tell me what problem we have with our data.

In [None]:
'Implement me!'

*Exercise*: Use ridge regression/regularization to fit your linear model.

In [None]:
# initialize the model
model = smf.ols('Implement me!', data=df)
#train the model and save the results
results = model.fit_regularized(L1_wt=0.1,alpha=1.9)
# get the parameters of the trained model
results.params

## 2.3. Logitic regression

In machine learning, we are often not so much interested in the parameters of the model, but in the fact that the model provides good predictions. For such purposes, the logistic regression from the `sklearn` package is better suited -- there, things that are quite tedious in statsmodels, such as encoding the class, dealing with more than two classes, etc., are done directly. Have a look at the following documentation: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

*Exercise*: Divide the Iris dataset that you already imported in `df` into two sets which contain the feature $X$ and target values $y$ respectively. Further seperate these two sets into two subsets: A training set that contains 80% of the data and a test set that contains the remaining 20% of the data. Use the `train_test_split()`-method of `sklearn` for this purpose.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# split the iris data from 'df' into features (X) and target (y) value
X = "Implement me!"
y = "Implement me!"
# split X and y into train and test sets
X_train, X_test, y_train, y_test = "Implement me!"

*Exercise*: Train a logit model with the training data and generate predictions for the test data. Use the `LogisticRegression` class and its respective methods for this purpose. Afterwards look at the results by using the methods provided by `sklearn.metrics`. How many test examples have been classified correctly by your trained model?

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

# Train the classifier
clf = "Implement me!"

# compute predictions with your trained model and save these predictions in yhat
yhat = "Implement me!"

# compute and print some classification metrics
print("Implement me!")
print("Implement me!")

*Additional exercise*: There is a way to also get the coefficients of a `LogisticRegression` model in sklearn. Try to figure out how to get these parameters. How many $\beta_i$  do we have in total?

In [None]:
parameter = ['theta_'+str(i) for i in range(X_train.shape[1]+1)]
columns = ['intersect:x_0=1'] + list(X.columns.values)
sk_theta = []
for i in range (len(clf.intercept_)):
    theta = [clf.intercept_[i]]
    theta.extend(clf.coef_[i])
    sk_theta.append(theta)
parameter_df = pd.DataFrame({'Parameter':parameter,'Columns':columns,'c1':sk_theta[0], 'c2':sk_theta[1], 'c3':sk_theta[2]})
parameter_df