# COL744 : Machine Learning (Assignment 1)

## Question 1

### Part (a) : Smapling 1 million datapoints
In this part I have sampled $10^6$ datapoints with 2 features $X_1$ and $X_2$ and then found $y^{(i)}$ by adding error $\epsilon_i$ to $\theta^Tx^{(i)}$ with $\theta$ = [3,1,2]. Here all the variables come from the following distribution.

* $y = \Sigma_{i=1}^{m} {\theta _i x_{i}} + \epsilon_{i}$

* $x1 \sim{} \mathcal{N} (3,4)$ 

* $x2 \sim{} \mathcal{N} (-1,4)$

* $\epsilon_i \sim{} \mathcal{N} (0,2)$

In [1]:
%matplotlib notebook
from tqdm import tqdm
import numpy as np
import pandas as pd
import pickle
import os
import math

import matplotlib as mp
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D 
from matplotlib.animation import FuncAnimation


In [2]:
theta = np.array([3,1,2])
sampleSize=10**6

In [3]:
if os.path.isfile('./data/xQ2.pkl') and os.path.isfile('./data/yQ2.pkl'):
    with open('./data/xQ2.pkl', 'rb') as f:
        X = pickle.load(f)
    with open('./data/yQ2.pkl', 'rb') as f:
        Y = pickle.load(f)
elif not os.path.isfile('xQ2.pkl'):
    dataX1 = [np.random.normal(loc=3, scale=2) for i in range(sampleSize)]
    dataX2 = [np.random.normal(loc=-1, scale=2) for i in range(sampleSize)]
    error = [np.random.normal(loc=0, scale=math.sqrt(2)) for i in range(sampleSize)]

    X_ = np.vstack((dataX1,dataX2)) # X_.shape = (2,smpaleSize)
    X = np.vstack((np.ones(X_.shape[1]), X_)).T # X.shape = (sampleSize,3)
    Y = (X.dot(theta)) - error
    
    if not os.path.isdir('./data'):
        os.mkdir('./data')
    
    with open('./data/xQ2.pkl', 'wb') as f:
        pickle.dump(X,f)
    with open('./data/yQ2.pkl', 'wb') as f:
        pickle.dump(Y,f)    

## (b) Implementing SGD

* In this part I have implemented SGD with $\eta$ = 0.001 and tested it with differnt batch size of (1,100,10000,1000000).

* To check the convergence I have taken average of $J(\theta)$ after every 1000 iterations and if their difference is less than 1e-4 then I have stopped the algorithm. Also I have passed a parameter max_iter = $10^6$ to stop the algorithm after that many iterations.

In [4]:
def grad(theta, X, Y):
    '''Function to compute partial differentiation wrt theta'''
    err = (X.dot(theta)) - Y #100x1
    loss_val = ((err**2).sum())/(2*X.shape[0])
    grad_val = (1/X.shape[0])*((X.T).dot(err))
    return (grad_val, loss_val)

def SGD(X,Y,lr=0.001,r=1, max_iter=10**6, tolerance=1e-4):
    indices = np.random.permutation(X.shape[0])
    X = X.take(indices, axis=0)
    Y = Y.take(indices, axis=0)
    batchNo = 0
    epoch = 0
    currentSum = 0
    previousAvg = np.inf
    
    theta = np.zeros(X.shape[1])
    totalBatchForOneEpoch = X.shape[0]/r
    
    loss_lst = []
    theta_lst=[]
    
    for i in tqdm(range(max_iter)):
        if batchNo == totalBatchForOneEpoch:
            batchNo=0
            epoch+=1
        if i%1000 == 0:
            if abs((currentSum/1000)-previousAvg) <= tolerance:
                print('converged in %d iterations'%(i))
                break
            else:
                previousAvg = currentSum/1000
                currentSum = 0
        X_curr = X[(batchNo*r):((batchNo+1)*r),:]
        Y_curr = Y[(batchNo*r):((batchNo+1)*r)]
        (grad_val, loss_val) = grad(theta, X_curr, Y_curr)
        currentSum += loss_val
        theta_next= theta - lr * np.array(grad_val)
        theta_lst.append(theta)
        loss_lst.append(loss_val)
        theta=theta_next
        batchNo+=1
    
    return (theta_lst, loss_lst)

In [5]:
(theta_lst1, loss_lst1) = SGD(X,Y,r=1, max_iter=10**6)

 16%|███████████▎                                                          | 161644/1000000 [00:02<00:13, 61857.27it/s]

converged in 166000 iterations


 16%|███████████▎                                                          | 161644/1000000 [00:02<00:13, 59989.91it/s]


In [6]:
(theta_lst100, loss_lst100) = SGD(X,Y,r=100)

100%|█████████████████████████████████████████████████████████████████████| 1000000/1000000 [00:17<00:00, 55738.08it/s]


In [7]:
(theta_lst10000, loss_lst10000) = SGD(X,Y,r=10000)

  2%|█▎                                                                      | 17633/1000000 [00:02<02:32, 6456.02it/s]

converged in 18000 iterations


  2%|█▎                                                                      | 17633/1000000 [00:02<02:35, 6311.90it/s]


In [8]:
(theta_lst1000000, loss_lst1000000) = SGD(X,Y,r=10**6)

  2%|█▎                                                                      | 17998/1000000 [06:39<5:48:20, 46.99it/s]

converged in 18000 iterations


  2%|█▎                                                                      | 17998/1000000 [06:50<5:48:20, 46.99it/s]

In [48]:
print('for r=1 --> Theta = {}'.format(theta_lst1[-1]))
print('for r=100 --> Theta = {}'.format(theta_lst100[-1]))
print('for r=10000 --> Theta = {}'.format(theta_lst10000[-1]))
print('for r=1000000 --> Theta = {}'.format(theta_lst1000000[-1]))

for r=1 --> Theta = [3.0124899  1.00283899 1.96698663]


NameError: name 'theta_lst100' is not defined

In [50]:
Data = pd.read_csv('./ass1_data/data/q2/q2test.csv').to_numpy()

X = Data[:,:2]
Y = Data[:,2]

In [51]:
def findError(X,Y,theta):
    pred = X.dot(theta[1:3]) + theta[0]
    error = Y - pred
    rmse = (error**2).sum()/(2*X.shape[0])
    return rmse

findError(X[:,:2],Y,theta1)

1.0439121940120524

In [42]:
theta_lst1

[array([0., 0., 0.]),
 array([ 0.00288294,  0.00500052, -0.00252275]),
 array([ 0.00914933,  0.03150129, -0.00766111]),
 array([0.00582948, 0.03220206, 0.00055866]),
 array([ 0.01095946,  0.05552871, -0.00652506]),
 array([0.02201379, 0.11092611, 0.00422535]),
 array([0.0221894 , 0.11128641, 0.00391652]),
 array([0.02130435, 0.10929577, 0.00698538]),
 array([0.03106439, 0.16027811, 0.01333939]),
 array([0.04560415, 0.22892524, 0.05744626]),
 array([0.05113319, 0.25584766, 0.04635103]),
 array([0.05402068, 0.27138766, 0.03972941]),
 array([0.05647512, 0.27674522, 0.03649357]),
 array([0.06835115, 0.35711972, 0.05223889]),
 array([0.07279458, 0.37056851, 0.05130027]),
 array([0.07852834, 0.3781528 , 0.06262627]),
 array([0.0802043 , 0.3808827 , 0.06021031]),
 array([0.08731199, 0.42508218, 0.05128535]),
 array([0.09095267, 0.42098422, 0.05588529]),
 array([0.09367747, 0.42450017, 0.0545679 ]),
 array([0.08912595, 0.42122537, 0.07499976]),
 array([0.09641868, 0.45157932, 0.07789085]),
 ar

In [49]:
theta1 = theta_lst1[-1]
theta100 = theta_lst100[-1]
theta10000 = theta_lst10000[-1]
theta1000000 = theta_lst1000000[-1]

NameError: name 'theta_lst100' is not defined

In [None]:
fin

In [15]:
def plotTheta3D(theta_lst):
    fig = plt.figure(figsize=(8,6))
    ax = fig.gca(projection='3d')

    ax.set_xlabel('$\\theta_0$', color='r')
    ax.set_ylabel('$\\theta_1$', color='r')
    ax.set_zlabel('$\\theta_2$', color='r')
    ax.set_zlim(0,2.5)
    ax.set_xlim(0, 3.5)
    ax.set_ylim(0, 1.5)
    graph, = plt.plot([], [], 'x',markersize=1, c='black', label = '$<\\theta_0, \\theta_1, \\theta_2>$')
    def animate(i):
        graph.set_data(data[:i+1,0], data[:i+1,1])
        graph.set_3d_properties(data[:i+1,2])
        return graph

    data = np.array(theta_lst)
    anim = FuncAnimation(fig, animate, interval=1)
    plt.legend(loc=4)
    plt.title('3Dplot representing movement of theta ')
    plt.show()
    return anim

  2%|█▎                                                                      | 17995/1000000 [06:20<5:16:48, 51.66it/s]

In [7]:
plotTheta3D(theta_lst1)

<IPython.core.display.Javascript object>

<matplotlib.animation.FuncAnimation at 0x1e0fab284c8>

In [31]:
plotTheta3D(theta_lst100)

<IPython.core.display.Javascript object>

<matplotlib.animation.FuncAnimation at 0x1cbcc778108>

In [32]:
plotTheta3D(theta_lst10000)

<IPython.core.display.Javascript object>

<matplotlib.animation.FuncAnimation at 0x1cbeecd3248>

In [33]:
plotTheta3D(theta_lst1000000)

<IPython.core.display.Javascript object>

<matplotlib.animation.FuncAnimation at 0x1cbebca4e08>