In [1]:
%matplotlib inline

# Linear Regression task - Diamond price training

### Imports

In [2]:
print(__doc__)

import csv
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error

Automatically created module for IPython interactive environment


### CSV reading

In [3]:
# columns 1, 5, 6, 8, 9, 10 have numerical variables
# column 7 contains the target: diamond price
diamonds_data = np.genfromtxt('diamonds.csv', delimiter=",", skip_header=1,
                       usecols=(1, 5, 6, 8, 9, 10, 7))

print(diamonds_data.shape)
print(diamonds_data[:, np.newaxis, 6]) # target

(53940, 7)
[[  326.]
 [  326.]
 [  327.]
 ..., 
 [ 2757.]
 [ 2757.]
 [ 2757.]]


### Feature and target selection

In [4]:
diamonds_features_df = pd.DataFrame(diamonds_data[:, 0:6])
diamonds_target_df = pd.DataFrame(diamonds_data[:, np.newaxis, 6])

#print(diamonds_features_df)
#print(diamonds_target_df)

## Linear Regression implementation

### Cost function

The general form of the cost function is

$$
h_\theta\left(x\right) = \theta_0 x_0 + \theta_1 x_1 +
    \theta_2 x_2 + \cdots + \theta_n x_n 
$$

This can be also interpreted as the dot product betweeen the parameters array
and the features array

$$
x =
\begin{bmatrix}
x_0 && x_1 && x_2 && \cdots && x_n
\end{bmatrix},~~
\theta =
\begin{bmatrix}
\theta_0 && \theta_1 && \theta_2&& \cdots && \theta_n
\end{bmatrix},
\\
h_\theta\left(x\right) = \theta \cdot x
$$

We will use this dot product to calculate the cost function throughout the code,
by calling

```python
numpy.dot(parameters, features)
```

### Gradient Descent

The main formula for the iteration steps in the Gradient Descent algorithm is

$$
\theta_j = \theta_j - \alpha \frac{1}{m}
    \sum_{i=1}^{m}{\left(h_\theta\left(x^{\left(i\right)}\right) -
    y^{\left(i\right)}\right) x_j^{\left(i\right)}}
$$

for all $j = 0, 1, ..., n$, $j$ being the index of the parameter, and
$i$ being the index of the data example.

$$
\theta_0 = \theta_0 - \alpha \frac{1}{m}
    \sum_{i=1}^{m}{\left(h_\theta\left(x^{\left(i\right)}\right) -
    y^{\left(i\right)}\right) x_0^{\left(i\right)}}
\\
\theta_1 = \theta_1 - \alpha \frac{1}{m}
    \sum_{i=1}^{m}{\left(h_\theta\left(x^{\left(i\right)}\right) -
    y^{\left(i\right)}\right) x_1^{\left(i\right)}}
\\
\theta_2 = \theta_2 - \alpha \frac{1}{m}
    \sum_{i=1}^{m}{\left(h_\theta\left(x^{\left(i\right)}\right) -
    y^{\left(i\right)}\right) x_2^{\left(i\right)}}
\\
...
$$

We define a difference between the cost and the target, $k$, such that
$$
k^{\left(i\right)} = h_\theta\left(x^{\left(i\right)}\right) - y^{\left(i\right)}
$$
It can be noted that $k_i$ is used repeatedly among all the
parameter calculations. For this reason, we will calculate all the
$k_i$ just once and use it for all parameters.

$$
\theta_j = \theta_j - \alpha \frac{1}{m}
    \sum_{i=1}^{m}{k^{\left(i\right)} ~ x_j^{\left(i\right)}}
$$


In [5]:
# this is our model, what we want to have as a final result
parameters = np.array([0.0, 0.0, 0.0, 0.0, 0.0, 0.0])

# this is the number of data examples that we have. for this dataset: 53940
data_size = diamonds_target_df.shape[0]

k_diff = [np.dot(parameters, diamonds_features_df.iloc[i, :]) - diamonds_target_df.iloc[i, 0]
          for i in range(data_size)]

#print(k_diff)