## Assignment 1

### Part 1

Write your own linear regression algorithm.

 #### Solution
 LinearRegression fits a linear model with coefficients $\vec{w}$.
 Here we deal with the set $(\vec{x_i}, y_i)$, where $\vec{x}_i$ - vector of features of i-th observation.
 
 Model finds approximate dependence $f(x)$ which can be written as
  $$f(x) = <\vec{x}', \vec{w}>, \text{ where } \vec{x}' = (\vec{x}, 1), \ \vec{w} = (w_1,\dots, w_n, w_0).$$


The solution is  $\vec{w} = (X^TX)^{-1}X\vec{y}$.

In [1]:
import numpy as np

In [2]:
class LinearRegression:
    def __init__(self):
        self.coef_ = None
        pass

    def fit(self, x: np.array, y: np.array):
        X = np.c_[ x, np.ones(x.shape[0]) ]
        self.coef_  = np.linalg.lstsq(X, y, rcond=None)[0] 
        return self
    
    def predict(self, x: np.array):
        X = np.c_[ x, np.ones(x.shape[0]) ]
        return np.dot(X, self.coef_)

### Part 2

Write a function r2(y_true, y_pred) to calculate the coefficient of determination $R^2$:
$$R^2 = 1 - \frac{D_\epsilon}{D_y},$$  
where $D_\epsilon$ is a variance of the error, $D_y$ is a variance of the random variable $y$.

In [3]:
def r2(y_true, y_pred):
    err = y_true - y_pred
    return 1 - err.var() / y_true.var()

### Part 3

Among the 5 samples offered to you, find the best and worst in terms of linear regression modeling.


#### Solution

Samples can be estimated in terms of $R^2$: the best sample has the highest $R^2$ value and is well described with linear model, while the worst sample has a low $R^2$ value. 
A low $R^2$ value indicates that the features are chosen incorrectly or the relationship $y = f(x)$ is not linear.

Here we calculate $R^2$ score for each sample and then compare them:

In [4]:
r2_score = []

for sample in range(1, 6):
    
    filename = f'data/part_3/{sample}.npy'
    x, y = np.load(filename)[:, 0],  np.load(filename)[:, 1]
    
    model = LinearRegression().fit(x.reshape(-1, 1), y)
    
    y_predict = model.predict(x)
    
    r2_score.append(r2(y, y_predict))

In [5]:
print(f'The best is sample no. {r2_score.index(max(r2_score)) + 1} with r2 value {round(max(r2_score), 3)}')
print(f'The worst is sample no. {r2_score.index(min(r2_score)) + 1} with r2 value {round(min(r2_score), 3)}')

The best is sample no. 3 with r2 value 0.944
The worst is sample no. 4 with r2 value 0.186


## Assignment 2

As a training set, use the data from the file 'candy-data.csv'. Do not include the following candies: Charleston Chew, Dum Dums. 
Train the model. Find the predicted value of winpercent for Charleston Chew and Dum Dums.
Enter the predicted value of winpercent for a candy with the following parameters: [0, 0, 0, 1, 0, 1, 1, 0, 1, 0.885, 0.649]

In [6]:
import pandas as pd

In [7]:
data = pd.read_csv('data/candy-data.csv')
training_data = data[(data['competitorname'] != 'Charleston Chew') & (data['competitorname'] != 'Dum Dums')]

In [8]:
X = training_data.drop(['winpercent', 'competitorname', 'Y'], axis=1).values
y = training_data['winpercent'].values

In [9]:
linear_model = LinearRegression()
linear_model.fit(X, y)

<__main__.LinearRegression at 0x7f212e854040>

In [10]:
charles_chew = data[data['competitorname'] == 'Charleston Chew'].drop(['competitorname', 'winpercent', 'Y'], axis=1).values
charles_chew_winpercent = linear_model.predict(charles_chew)

In [11]:
dum_dums = data[data['competitorname'] == 'Dum Dums'].drop(['competitorname', 'winpercent', 'Y'], axis=1).values
dum_dums_winpercent = linear_model.predict(dum_dums)

In [12]:
candy_data = np.array([[0, 0, 0, 1, 0, 1, 1, 0, 1, 0.885, 0.649]])
candy_winpercent = linear_model.predict(candy_data)

In [13]:
print('Win Percentage prediction for Charles Chew candies:', round(charles_chew_winpercent[0], 3), 
     '\nWin Percentage prediction for Dum Dums candies:', round(dum_dums_winpercent[0], 3),
     '\nWin Percentage prediction for candies with given parameters:', round(candy_winpercent[0], 3))

Win Percentage prediction for Charles Chew candies: 70.361 
Win Percentage prediction for Dum Dums candies: 50.075 
Win Percentage prediction for candies with given parameters: 45.74
