# Gradient Descent - Boston Dataset

Boston dataset is one of the datasets available in sklearn.
You are given a Training dataset csv file with X train and Y train data. As studied in lecture, your task is to come up with Gradient Descent algorithm and thus predictions for the test dataset given.

Task is to:
1. Code Gradient Descent for N features and come with predictions.
2. Try and test with various combinations of learning rates and number of iterations.
3. Try using Feature Scaling, and see if it helps you in getting better results. 


Instructions:
1. Use Gradient Descent as a training algorithm and submit results predicted.
2. Files are in csv format, you can use genfromtxt function in numpy to load data from csv file. Similarly you can use savetxt function to save data into a file.
3. Submit a csv file with only predictions for X test data. File name should not have spaces. File should not have any headers and should only have one column i.e. predictions. Also predictions shouldn't be in exponential form. 
4. Your score is based on coefficient of determination.


In [2]:
## Load data
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
# test_data = np.loadtxt("test_boston_x_test.csv", delimiter=",")
# train_data = np.loadtxt("training_boston_x_y_train.csv", delimiter=",")

test_data =pd.read_csv("test_boston_x_test.csv")
train_data = pd.read_csv("training_boston_x_y_train.csv", delimiter=",")


In [3]:
## Shape of data: no. of row and columns
print("Shape of train dataset:",train_data.shape )
print("Shape of test dataset:",test_data.shape)

Shape of train dataset: (379, 14)
Shape of test dataset: (126, 13)


In [4]:
## description of only numerical columns training data
## its a function of dataframe
## .T means transpose: row to columns vice versa
train_data.describe().T


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
# CRIM,379.0,0.019628,1.06749,-0.417713,-0.408171,-0.383729,0.055208,9.941735
ZN,379.0,0.002455,1.000813,-0.487722,-0.487722,-0.487722,0.156071,3.804234
INDUS,379.0,0.03617,1.017497,-1.516987,-0.867691,-0.180458,1.015999,2.422565
CHAS,379.0,0.028955,1.048995,-0.272599,-0.272599,-0.272599,-0.272599,3.668398
NOX,379.0,0.028775,0.999656,-1.465882,-0.878475,-0.144217,0.628913,2.732346
RM,379.0,0.032202,1.001174,-3.880249,-0.57148,-0.103479,0.529069,3.555044
AGE,379.0,0.038395,0.985209,-2.335437,-0.768994,0.338718,0.911243,1.117494
DIS,379.0,-0.001288,1.027803,-1.267069,-0.829872,-0.329213,0.674172,3.960518
RAD,379.0,0.043307,1.016265,-0.982843,-0.637962,-0.523001,1.661245,1.661245
TAX,379.0,0.043786,1.019974,-1.31399,-0.755697,-0.440915,1.530926,1.798194


In [5]:
## Info show the data type and not null values
train_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 379 entries, 0 to 378
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   # CRIM    379 non-null    float64
 1    ZN       379 non-null    float64
 2    INDUS    379 non-null    float64
 3    CHAS     379 non-null    float64
 4    NOX      379 non-null    float64
 5    RM       379 non-null    float64
 6    AGE      379 non-null    float64
 7    DIS      379 non-null    float64
 8    RAD      379 non-null    float64
 9    TAX      379 non-null    float64
 10   PTRATIO  379 non-null    float64
 11   B        379 non-null    float64
 12   LSTAT    379 non-null    float64
 13   Y        379 non-null    float64
dtypes: float64(14)
memory usage: 41.6 KB


In [6]:
train_data.head()


Unnamed: 0,# CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,Y
0,-0.40785,-0.487722,-1.266023,-0.272599,-0.576134,1.239974,0.840122,-0.520264,-0.752922,-1.278354,-0.303094,0.410571,-1.09799,37.9
1,-0.407374,-0.487722,0.247057,-0.272599,-1.016689,0.001946,-0.838337,0.336351,-0.523001,-0.060801,0.113032,0.291169,-0.520474,21.4
2,0.125179,-0.487722,1.015999,-0.272599,1.36749,-0.439699,0.687212,-0.577309,1.661245,1.530926,0.806576,-3.795795,0.891076,12.7
3,0.028304,-0.487722,1.015999,-0.272599,1.859875,-0.047918,0.801005,-0.712836,1.661245,1.530926,0.806576,-0.06605,0.215438,19.9
4,-0.412408,-0.487722,-0.969827,-0.272599,-0.913029,-0.384137,-0.834781,0.300508,-0.752922,-0.957633,0.02056,0.431074,0.029007,22.5


In [7]:
train_data.tail().T


Unnamed: 0,374,375,376,377,378
# CRIM,-0.204929,0.231398,-0.408311,-0.41062,0.342909
ZN,-0.487722,-0.487722,-0.487722,-0.487722,-0.487722
INDUS,1.231945,1.015999,0.247057,-1.152214,1.015999
CHAS,3.668398,-0.272599,-0.272599,-0.272599,3.668398
NOX,0.434551,1.36749,-1.016689,-0.818007,0.659147
RM,2.161728,0.215644,-0.206055,0.068904,1.041946
AGE,1.053485,0.687212,-0.809889,-1.826921,1.028593
DIS,-0.83396,-0.703186,0.140451,0.674814,-1.232462
RAD,-0.523001,1.661245,-0.523001,-0.637962,1.661245
TAX,-0.031105,1.530926,-0.060801,0.129256,1.530926


In [8]:
## checking null values 
train_data.isnull().sum()

# CRIM      0
 ZN         0
 INDUS      0
 CHAS       0
 NOX        0
 RM         0
 AGE        0
 DIS        0
 RAD        0
 TAX        0
 PTRATIO    0
 B          0
 LSTAT      0
 Y          0
dtype: int64

Seems like there are no null values, it means there are no missing values.

In [9]:
train_data.isna().sum()

# CRIM      0
 ZN         0
 INDUS      0
 CHAS       0
 NOX        0
 RM         0
 AGE        0
 DIS        0
 RAD        0
 TAX        0
 PTRATIO    0
 B          0
 LSTAT      0
 Y          0
dtype: int64

There are no NA values in data

**Now, using gradient descent, we will find the best values of m and c**

In [13]:
# This function finds the new gradient at each step
def step_gradient(points, learning_rate, m , c):
    m_slope = 0
    c_slope = 0
    M = len(points)
    for i in range(M):
        x = points[i, 0]
        y = points[i, 1]
        m_slope += (-2/M)* (y - m * x - c)*x
        c_slope += (-2/M)* (y - m * x - c)
    new_m = m - learning_rate * m_slope
    new_c = c - learning_rate * c_slope
    # return new_m, new_c
    print(new_m, new_c)

In [14]:
# The Gradient Descent Function
def gd(points, learning_rate, num_iterations):
    m = 0       # Intial random value taken as 0
    c = 0       # Intial random value taken as 0
    for i in range(num_iterations):
        m, c = step_gradient(points, learning_rate, m , c)
        print(i, " Cost: ", cost(points, m, c))
    return m, c

In [15]:
# This function finds the new cost after each optimisation.
def cost(points, m, c):
    total_cost = 0
    M = len(points)
    for i in range(M):
        x = points[i, 0]
        y = points[i, 1]
        total_cost += (1/M)*((y - m*x - c)**2)
    return total_cost

In [18]:
def run():
    learning_rate = 0.0001
    num_iterations = 100
    m, c = gd(train_data, learning_rate, num_iterations)
    print("Final m :", m)
    print("Final c :", c)
    return m,c

In [17]:
m, c = run()


NameError: name 'training_data' is not defined