# Exercise 0 - Ordinary Least Squares

The first exercise is about linear models.
The given data set contains prices and other attributes of approximately 54,000 diamonds. You should fit a linear model to predict the price of a diamond, given its attributes.

This exercise is meant to get you started with the tool stack. We use the following python packages:
- pandas (https://pandas.pydata.org/)
- numpy (http://www.numpy.org/)
- matplotlib (https://www.matplotlib.org) and seaborn (https://seaborn.pydata.org)
- sklearn (http://scikit-learn.org/)

If you are unfamiliar with them, follow the documentation links. In the (unlikely) event of a persistent problem, do not hesitate to contact the course instructors.

### Diamonds Dataset 

- price: price in US dollars (\$326.0 - \$18823.0)
- carat: weight of the diamond (0.2 - 5.01)
- cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)
- color: diamond colour, from J (worst) to D (best)
- clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
- x: length in mm (0--10.74)
- y: width in mm (0--58.9)
- z: depth in mm (0--31.8)
- depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)
- table: width of top of diamond relative to widest point (43--95


### Connectivity INFO

In [None]:
%connect_info

### Question 1
Import the data from the file and examine it.

Determine the following:

* The number of data points. (*Hint:* check out the dataframe `.shape` attribute.)
* The column names. (*Hint:* check out the dataframe `.columns` attribute.)
* The data types for each column. (*Hint:* check out the dataframe `.dtypes` attribute.)

In [None]:
import pandas as pd
import numpy as np


# load data
data = pd.read_csv("diamonds.csv")

# display the first few rows
data.head()

In [None]:
# Number of rows and columns
print(data.shape)

# Column names
print(data.columns.tolist())

# Data types
print(data.dtypes)

### Question 2

Since there are discrete variables and we do not know how to include them into our regression model, remove them. Additionally, verify that there are no missing values in our dataset.

In [None]:
# drop columns
data.drop(["cut", "color", "clarity"], axis=1, inplace=True)

# check if there are missing values
print(data.isnull().values.any())
data.head()

### Question 3

Visualizing your data often helps to build intuition and get a feeling of the deeper mojo in the set.

Compute some pairwise correlation matrices and visualize them.

In [None]:
corr = data.corr()
corr

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(20,10))
sns.heatmap(corr, vmax=1, annot=True,square=True)
plt.title("Variable Correlation Heatmap")

### Question 4
Make a scatter plot of `carat` vs `price` using Matplotlib. Label the axes and give the plot a title.

In [None]:
# A simple scatter plot with Matplotlib
ax = plt.axes()

ax.scatter(data.carat, data.price)

# Label the axes
ax.set(xlabel='carat',
       ylabel='price',
       title='nyuu\'s title');

### Question 5
Fit a linear model by using maximum likelihood estimation (cf. The Lecture). Here we want to predict the `price` of a diamond from the variable `carat` by implementing the OLS method by yourself.

- Build the design matrix $\mathbb{X}$ and the vector of the dependent variable $Y$.
- Estimate the parameter vector $\theta$
- Make a scatter plot of `carat` vs `price` and include the regression line

In [None]:
Y = data.price.values
X = data.carat.values
print("Y = ", Y)
print("X = ", X)

In [None]:
X.shape

In [None]:
# reshape X and add the intercept row
X = np.row_stack((np.ones(np.shape(X)[0]),X))

In [None]:
X.shape

In [None]:
# estimate the parameters
theta = np.linalg.solve((X.dot(X.T)), X.dot(Y))
print(theta)

In [None]:
# predictions
Y_pred = theta.dot(X)

In [None]:
# plot data and model
plt.scatter(data.carat, data.price)
plt.plot(X[1:][0], Y_pred, color="red")

### Question 6

You can find an implementation of this method in the python module scikit-learn. Use it and compare your result.

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
LR = LinearRegression()
LR = LR.fit(data[["carat"]], data[["price"]])

In [None]:
vars(LR)

In [None]:
theta

### Question 7

Build a model to predict the `price` from the variables `carat`, `depth`, `table`, `x`, `y`, `z`.

- Build the design matrix
- Estimate the parameter vector $\theta$
- Compare your results with the result that the `LinearRegression` module from scikit-learn gives you.

In [None]:
X = data[["carat", "depth", "table", "x", "y", "z"]].values
Y = data["price"].values

In [None]:
X = np.row_stack((np.ones(np.shape(X)[0]),X.T))
X

In [None]:
theta = np.linalg.inv((X.dot(X.T))).dot(X.dot(Y))

In [None]:
theta

In [None]:
Y_pred = theta.dot(X)
print(Y_pred)
print(Y)

In [None]:
LR = LinearRegression()
LR = LR.fit(data[["carat", "depth", "table", "x", "y", "z"]], data[["price"]])

In [None]:
vars(LR)

In [None]:
theta

In [None]:
data[["price"]]

### Question 8

The [coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination) (a.k.a. $R^2$) is the proportion of variation in the predictions $Y$, explained by the observations $\mathbb{X}$ that is commonly used to measure the Goodness-of-Fit of a linear model.

- Calculate the $R^2$ for you model.
- Is $R^2$ a good measure for the goodness-of-fit?
- What are its advantages?
- What are its limits?