## Machine Learning: Programming Exercise 1
### 1 Write a function to generate an m+1 dimensional data set, of size n, consisting of m continuous independent variables (X) and one dependent variable (Y) defined as
`yi = xiβ + e`
where,

* e is a Gaussuan distribution with mean 0 and standard deviation (σ), representing the unexplained
variation in Y
* β is a random vector of dimensionality m + 1, representing the coefficients of the linear relationship
between X and Y, and
* ∀i ∈ [1, n], xi0 = 1


The function should take the following parameters:
* σ: The spread of noise in the output variable
* n: The size of the data set
* m: The number of indepedent variables


Output from the function should be:
* X: An n × m numpy array of independent variable values (with a 1 in the first column)
* Y : The n × 1 numpy array of output values
* β: The random coefficients used to generatre Y from X

In [1]:
import numpy as np
import matplotlib.pyplot as plt

In [2]:
def generate_data(sigma, n, m, seed=None):
    if seed:
        np.random.seed(seed)
    # Generate X with n rows and m columns
    # 1 extra coloum for multplying with B0 Bias
    X = np.random.uniform(low=0, high=1, size=(n , m + 1))
    # The First column of X should be 1
    X[:, 0] = 1
    #  A random column_vector with m columns, one for each dependent variable
    # 1 Extra column for Bias
    Beta_vector = np.random.random(size=(m+1, 1))
    # e
    e = np.random.normal(loc=0, scale=sigma, size=(n, 1))
    # Get Y_vector according to the given formula
    y = np.dot(X, Beta_vector) + e
    return X, y, Beta_vector

### 2 Write a function that learns the parameters of a linear regression line given inputs
* X: An n × m numpy array of independent variable values
* Y : The n × 1 numpy array of output values
* k: the number of iteractions (epochs)
* τ : the threshold on change in Cost function value from the previous to current iteration
* λ: the learning rate for Gradient Descent

The function should implement the Gradient Descent algorithm as discussed in class that initialises β with
random values and then updates these values in each iteraction by moving in the the direction defined by
the partial derivative of the cost function with respect to each of the coefficients. The function should use
only one loop that ends after a number of iterations (k) or a threshold on the change in cost function value
`(τ )`.

The output should be an m + 1 dimensional vector of coefficients and the final cost function value.

In [3]:
def learning(X, Y, k, T, L, seed= None):
    if seed:
        np.random.seed(seed)
    def get_gradient(X, Y, B):
        n = Y.shape[0]
        y_pred = np.dot(X, B)
        cost_dot = np.dot(X.T, (y_pred - Y))
        gradient = (2/n) * cost_dot
        return gradient
        
    def get_cost(B):
        """
        RMSE as cost_function
        """
        y_pred = np.dot(X, B)
        cost = np.sqrt(np.sum((Y - y_pred)**2)/Y.shape[0])
        return cost

    # Start Model Parameters at random
    B = np.random.random(size=(X.shape[1], 1))
    prev_cost = get_cost(B)
    # k loops
    for _ in range(k):
        # Perform Gradient Descent
        gradient = get_gradient(X, Y, B)
        B = B + (L * gradient)
        curr_cost = get_cost(B)
        if curr_cost - prev_cost < T:
            return B, curr_cost
        prev_cost = curr_cost
    return B, curr_cost        

### 3 Create a report investigating how different values of n and σ impact the ability for your linear regression function to learn the coefficients, `β`, used to generate the output vector `Y `.

In [4]:
beta_tuple_list = list()
# For Finding Eficiency of Model in learning Beta

In [5]:
def get_difference_in_beta(B_true, B_pred):
    return np.sum(np.square(B_true - B_pred))/len(B_true)

In [6]:
sigma = 5
n = 100
m = 5 # Dependent Variables

In [7]:
X, Y, B_true_1 = generate_data(sigma, n, m, 42)
# 1 extra column in X for Bias Calculation 
Y.shape, X.shape

((100, 1), (100, 6))

In [8]:
# Iterations
k = 40
# Learning Rate
L = 1.0
# Threshold
T = np.inf

In [9]:
B_predicted_1, cost_1 = learning(X=X, Y=Y, k=k, T=T, L=L, seed=42)
beta_tuple_list.append((B_true_1, B_predicted_1))

In [10]:
cost_1

10.901492010427813

In [11]:
sigma = 4
n = 100

In [12]:
X, Y, B_true_2 = generate_data(sigma, n, m, seed=42)
# 1 extra column in X for Bias Calculation 
Y.shape, X.shape

B_predicted_2, cost_2 = learning(X=X, Y=Y, k=k, T=T, L=L, seed=42)
beta_tuple_list.append((B_true_2, B_predicted_2))
cost_2

9.76229043784497

In [13]:
sigma = 3
n = 100
X, Y, B_true_3 = generate_data(sigma, n, m, seed=42)

B_predicted_3, cost_3 = learning(X=X, Y=Y, k=k, T=T, L=L, seed=42)
beta_tuple_list.append((B_true_3, B_predicted_3))
cost_3

8.657513675179832

In [14]:
sigma = 2
n = 100
X, Y, B_true_4 = generate_data(sigma, n, m)

# Let's use Seed so that random factors don't affect us when seeing the effects of n and sigma
B_predicted_4, cost_4 = learning(X=X, Y=Y, k=k, T=T, L=L, seed=42)
beta_tuple_list.append((B_true_4, B_predicted_4))
cost_4

3.8882302289415316

In [15]:
sigma = 1
n = 100
X, Y, B_true_5 = generate_data(sigma, n, m)

# Let's use Seed so that random factors don't affect us when seeing the effects of n and sigma
B_predicted_5, cost_5 = learning(X=X, Y=Y, k=k, T=T, L=L, seed=42)
beta_tuple_list.append((B_true_5, B_predicted_5))
cost_5

2.769655729369683

**I am taking Number of Iterations at 10, because otherwise my function gives an overflow warning during calculations resulting in cost becoming infinity**

In [16]:
cost_1, cost_2, cost_3, cost_4, cost_5

(10.901492010427813,
 9.76229043784497,
 8.657513675179832,
 3.8882302289415316,
 2.769655729369683)

In [17]:
# Lower Value Means the Coefficient learned by Model are closer to real ones
[get_difference_in_beta(true, pred) for true, pred in beta_tuple_list]

[6.228754163505287,
 5.35600656321289,
 4.549684171901164,
 0.8646129568549409,
 0.6118753406343145]

As per the above observation, we can say that as sigma decreases, the cost_value decreases.
And Model's ability to learn beta vector increases.

Report on effect of changing value of n

In [19]:
beta_tuple_list_for_n = []

In [20]:
n = 1000
X, Y, B_true_1000 = generate_data(sigma, n, m)

# Let's use Seed so that random factors don't affect us when seeing the effects of n and sigma
B_predicted_1000, cost_1000 = learning(X=X, Y=Y, k=k, T=T, L=L, seed=42)
beta_tuple_list_for_n.append((B_true_1000, B_predicted_1000))
cost_1000

1.2686115477976918

Increase the the numbers of rows of dataset has given us a notable decrease in cost function. given the same sigma 
that is sigma = 1.

In [21]:
n = 2000
X, Y, B_true_2000 = generate_data(sigma, n, m)

# Let's use Seed so that random factors don't affect us when seeing the effects of n and sigma
B_predicted_2000, cost_2000 = learning(X=X, Y=Y, k=k, T=T, L=L, seed=42)
beta_tuple_list_for_n.append((B_true_2000, B_predicted_2000))
cost_2000

3.173817410634527

After doubling the rows on previous basis, the error has increased significantly.

In [22]:
n = 10000
X, Y, B_true_10_4 = generate_data(sigma, n, m)

# Let's use Seed so that random factors don't affect us when seeing the effects of n and sigma
B_predicted_10_4, cost_10_4 = learning(X=X, Y=Y, k=k, T=T, L=L, seed=42)
beta_tuple_list_for_n.append((B_true_10_4, B_predicted_10_4, ))
cost_10_4

1.3057592693545543

The Cost value has decreased when we increased the n rows exponentially but not comparable to n = 1000

In [23]:
n = 100000
X, Y, B_true_10_5 = generate_data(sigma, n, m)

# Let's use Seed so that random factors don't affect us when seeing the effects of n and sigma
B_predicted_10_5, cost_10_5 = learning(X=X, Y=Y, k=k, T=T, L=L, seed=42)
beta_tuple_list_for_n.append((B_true_10_5, B_predicted_10_5,))
cost_10_5

1.8594298456638836

In [24]:
n = 1000000
X, Y, B_true_10_6 = generate_data(sigma, n, m)

# Let's use Seed so that random factors don't affect us when seeing the effects of n and sigma
B_predicted_10_6, cost_10_6 = learning(X=X, Y=Y, k=k, T=T, L=L, seed=42)
beta_tuple_list_for_n.append((B_true_10_6, B_predicted_10_6,))
cost_10_6

2.136143321091605

In [31]:
(cost_1000,0), (cost_2000,1), (cost_10_4,2), (cost_10_5,3), (cost_10_6,4),

((1.2686115477976918, 0),
 (3.173817410634527, 1),
 (1.3057592693545543, 2),
 (1.8594298456638836, 3),
 (2.136143321091605, 4))

In the above observation, we can see that after increase the data rows/samples to 1000,
the cost decreased significantly.
However, as we exponentially increase the n rows by 10 times each iteration, the cost increases.

In [32]:
[(get_difference_in_beta(true, pred), n)  for (n, (true, pred)) in enumerate(beta_tuple_list_for_n)]

[(0.4451295745106673, 0),
 (0.8479972271901514, 1),
 (0.26487284485965173, 2),
 (0.40125308473266347, 3),
 (0.48500617333434065, 4)]

BUT! , as we can see in element 2, 3; the difference in Actual and Predicted Beta Values is the smaller.
even when the cost associated with them is not the lowest cost!

So It shows here that model generalises well with these n rows. and the model was overfitted when trained on data 
with lower n rows.

Results:
* The Model's ability to learn Beta_vector is:
     * inversly propotional to the the sigma (standard deviation of e)
     * directly propotional to the 'n' number of samples in the generated data.
     * In case the model is having more cost value when increasing the n rows, it may that model was overfitted before.
     * We cannot fairly conclude that lower cost value of a cost function translates to model learning the coefficients correclty.

**Thanks for reading**