# Exercise 0 - Ordinary Least Squares (25 Points)

The first exercise is about linear models.
The given data set contains prices and other attributes of approximately 54,000 diamonds. You should fit a linear model to predict the price of a diamond, given its attributes.

This exercise is meant to get you started with the tool stack. We use the following python packages:
- pandas (https://pandas.pydata.org/)
- numpy (http://www.numpy.org/)
- matplotlib (https://www.matplotlib.org) and seaborn (https://seaborn.pydata.org)
- sklearn (http://scikit-learn.org/)

If you are unfamiliar with them, follow the documentation links. In the (unlikely) event of a persistent problem, do not hesitate to contact the course instructors.

### Submission

- Deadline of submission:
        x.y.z
- Mail your solution notebook or a link to your gitlab repository to:
        paul.kahlmeyer@uni-jena.de

### Diamonds Dataset 

- price: price in US dollars (326.0 - 18823.0)
- carat: weight of the diamond (0.2 - 5.01)
- cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)
- color: diamond colour, from J (worst) to D (best)
- clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
- x: length in mm (0--10.74)
- y: width in mm (0--58.9)
- z: depth in mm (0--31.8)
- depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)
- table: width of top of diamond relative to widest point (43--95)


### Task 1 (1 Point)
Import the data from the file and examine it.

Determine the following:

* The number of data points. (*Hint:* check out the dataframe `.shape` attribute.)
* The column names. (*Hint:* check out the dataframe `.columns` attribute.)
* The data types for each column. (*Hint:* check out the dataframe `.dtypes` attribute.)

In [None]:
import pandas as pd
import numpy as np

# TODO: load data

# TODO: determine number of datapoints

# TODO: determine column names

# TODO: determine datatypes of columns

### Task 2 (2 Points)

Since there are discrete variables and we do not yet know how to include them into our regression model, remove them. Additionally, verify that there are no missing values in our dataset.

Hint: there are multiple ways to [check](https://towardsdatascience.com/how-to-check-for-missing-values-in-pandas-d2749e45a345) for missing values

In [None]:
# TODO: remove discrete variables

# TODO: check for missing values

Visualizing correlation in your data often helps to build intuition and get a feeling of the deeper mojo in the set.

Here we want to use the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) as a measure for correlation between two variables.

Let $x$ and $y$ be two variables of our dataset (e.g. `carat` and `price`). The empirical Pearson correlation coefficient between $x$ and $y$ is defined as 

\begin{align}
r_{xy} = \cfrac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2}\sqrt{\sum_{i=1}^n(y_i-\bar{y})^2}}\,,
\end{align}
where $\bar{x}$ and $\bar{y}$ are the respective empirical means.

### Task 3 (5 Points)

Implement a function `pearson_corr`, that takes two vectors $[x_i]_{i=1,\dots n}, [y_i]_{i=1,\dots n}$ as well as $\bar{x}$ and $\bar{y}$ and computes $r_{xy}$. Use this function to calculate the pairwise correlation matrix for our dataset. Visualize this correlation matrix and label the rows/columns.

In [None]:
import matplotlib.pyplot as plt

def pearson_corr(x, y, x_bar, y_bar):
    # TODO: calculate correlation coefficient
    pass

# TODO: calculate pairwise correlation matrix

# TODO: visualize matrix and label the rows/columns

### Task 4 (1 Point)
Make a scatter plot of `carat` vs `price` using Matplotlib. Label the axes and give the plot a title.

In [None]:
# TODO: display data in scatter plot + label axes + set title

### Task 5 (5 Points)
Fit a linear model by using maximum likelihood estimation (cf. The Lecture). Here we want to predict the `price` of a diamond from the variable `carat` by implementing the OLS method by yourself.

- Build the design matrix $\mathbb{X}$ and the vector of the dependent variable $Y$.
- Estimate the parameter vector $\theta$
- Make a scatter plot of `carat` vs `price` and include the regression line

In [None]:
# TODO: build X, y

# TODO: estimate theta

# TODO: plot data + regression line

### Task 6 (2 Points)

You can find an implementation of this method in the python module scikit-learn. Use it and compare your result.

In [None]:
# TODO: use scikit learn to estimate theta

# TODO: compare results

### Task 7 (5 Points)

Build a model to predict the `price` from the variables `carat`, `depth`, `table`, `x`, `y`, `z`.

- Build the design matrix
- Estimate the parameter vector $\theta$
- Compare your results with the result that the `LinearRegression` module from scikit-learn gives you.

In [None]:
# TODO: build X, y

# TODO: estimate theta

# TODO: estimate theta using scikit-learn + compare

### Question 8 (4 Points)

The [coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination) (a.k.a. $R^2$) is the proportion of variation in the predictions $Y$, explained by the observations $\mathbb{X}$ that is commonly used to measure the Goodness-of-Fit of a linear model.

- Calculate the $R^2$ for you model.
- Is $R^2$ a good measure for the goodness-of-fit?
- What are its advantages?
- What are its limits?

In [None]:
# TODO: calculate R^2

# TODO: strengths and weaknesses of R^2

$R^2$ is good for 
- Assessing the quality of fit with a linear regressor
- comparing different linear regressors

$R^2$ does not indicate whether:
- the independent variables are a cause of the changes in the dependent variable;
- omitted-variable bias exists;
- the correct regression was used;
- the most appropriate set of independent variables has been chosen;
- there is collinearity present in the data on the explanatory variables;
- the model might be improved by using transformed versions of the existing set of independent variables;
- there are enough data points to make a solid conclusion