<a href="https://colab.research.google.com/github/nyashaswini6/assessment-task-4/blob/main/Copy_of_Task_6_assesment_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Boston Housing Dataset
Predicting Median value of owner-occupied homes
The aim of this assignment is to learn the application of machine learning algorithms to data sets. This involves learning what data means, how to handle data, training, cross validation, prediction, testing your model, etc.
This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive, and has been used extensively throughout the literature to benchmark algorithms. The data was originally published by Harrison, D. and Rubinfeld, D.L. Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.
The dataset is small in size with only 506 cases. It can be used to predict the median value of a home, which is done here. There are 14 attributes in each case of the dataset. They are:
CRIM - per capita crime rate by town
ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS - proportion of non-retail business acres per town.
CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX - nitric oxides concentration (parts per 10 million)
RM - average number of rooms per dwelling
AGE - proportion of owner-occupied units built prior to 1940
DIS - weighted distances to five Boston employment centres
RAD - index of accessibility to radial highways
TAX - full-value property-tax rate per $10,000
PTRATIO - pupil-teacher ratio by town
B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT - % lower status of the population
MEDV - Median value of owner-occupied homes in $1000's

Attribute Information:

    1. CRIM      per capita crime rate by town
    2. ZN        proportion of residential land zoned for lots over 
                 25,000 sq.ft.
    3. INDUS     proportion of non-retail business acres per town
    4. CHAS      Charles River dummy variable (= 1 if tract bounds 
                 river; 0 otherwise)
    5. NOX       nitric oxides concentration (parts per 10 million)
    6. RM        average number of rooms per dwelling
    7. AGE       proportion of owner-occupied units built prior to 1940
    8. DIS       weighted distances to five Boston employment centres
    9. RAD       index of accessibility to radial highways
    10. TAX      full-value property-tax rate per $10,000
    11. PTRATIO  pupil-teacher ratio by town
    12. B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks 
                 by town
    13. LSTAT    % lower status of the population
    14. MEDV     Median value of owner-occupied homes in $1000's

To Predict: MEDV

 Aim
To implement a linear regression with regularization via gradient descent.
to implement gradient descent with Lp norm, for 3 different values of p in (1,2]
To contrast the difference between performance of linear regression Lp norm and L2 norm for these 3 different values.
Tally that the gradient descent for L2 gives same result as matrix inversion based solution.
All the code is written in a single python file. The python program accepts the data directory path as input where the train and test csv files reside. Note that the data directory will contain two files train.csv used to train your model and test.csv for which the output predictions are to be made. The output predictions get written to a file named output.csv. The output.csv file should have two comma separated columns [ID,Output].
Working of Code
NumPy library would be required, so code begins by importing it
Import phi and phi_test from train and test datasets using NumPy's loadtxt function
Import y from train dataset using the loadtxt function
Concatenate coloumn of 1s to right of phi and phi_test
Apply min max scaling on each coloumn of phi and phi_test
Apply log scaling on y
Define a function to calculate change in error function based on phi, w and p norm
Make a dictionary containing filenames as keys and p as values
For each item in this dictionary
Set the w to all 0s
Set an appropriate value for lambda and step size
Calculate new value of w
Repeat steps until error between consecutive ws is less than threshold
Load values of id from test data file
Calculate y for test data using phi test and applying inverse log
Save the ids and y according to filename from dictionary
 
Feature Engineering
Columns of phi are not in same range, this is because their units are different i.e phi is ill conditioned
So, min max scaling for each column is applied to bring them in range 0-1
Same scaling would be required on columns of phi test
Log scaling was used on y. This was determined by trial and error
Comparison of performance
(p1=1.75, p2=1.5, p3=1.3)
As p decreases error in y decreases
As p decreases norm of w increases but this can be taken care by increasing lambda
As p decreases number of iterations required decreases
Tuning of Hyperparameter
If p is fixed and lambda is increased error decreases up to a certain lambda, then it starts rising
So, lambda was tuned by trial and error.
Starting with 0, lambda was increased in small steps until a minimum error was achieved.
Comparison of L2 gradient descent and closed form
Error from L2 Gradient descent were 4.43268 and that from closed form solution was 4.52624.
Errors are comparable so, the L2 gradient descent performs closely with closed form solution.



BOSTON HOUSING DATASET

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model
import seaborn as sns

In [None]:
df = pd.read_csv("/content/test.csv")
df.head()

Unnamed: 0,ID,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0,0.10612,30.0,4.93,0,0.428,6.095,65.1,6.3361,6,300.0,16.6,394.62,12.4
1,1,0.34109,0.0,7.38,0,0.493,6.415,40.1,4.7211,5,287.0,19.6,396.9,6.12
2,2,12.2472,0.0,18.1,0,0.584,5.837,59.7,1.9976,24,666.0,20.2,24.65,15.69
3,3,0.22489,12.5,7.87,0,0.524,6.377,94.3,6.3467,5,311.0,15.2,392.52,20.45
4,4,1.80028,0.0,19.58,0,0.605,5.877,79.2,2.4259,5,403.0,14.7,227.61,12.14


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105 entries, 0 to 104
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   ID       105 non-null    int64  
 1   CRIM     105 non-null    float64
 2   ZN       105 non-null    float64
 3   INDUS    105 non-null    float64
 4   CHAS     105 non-null    int64  
 5   NOX      105 non-null    float64
 6   RM       105 non-null    float64
 7   AGE      105 non-null    float64
 8   DIS      105 non-null    float64
 9   RAD      105 non-null    int64  
 10  TAX      105 non-null    float64
 11  PTRATIO  105 non-null    float64
 12  B        105 non-null    float64
 13  LSTAT    105 non-null    float64
dtypes: float64(11), int64(3)
memory usage: 11.6 KB


In [None]:
df.isnull().sum()

ID         0
CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
dtype: int64

In [None]:
df.shape

(105, 14)

In [None]:
#deleting duplicates
df =df.drop_duplicates()

In [None]:
df.shape

(105, 14)

In [None]:
b = []
for i in df.keys():
  b.append(i)
print(b)

['ID', 'CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']


In [None]:
df = pd.get_dummies(df, columns = ['CHAS'])

In [None]:
df.head()

Unnamed: 0,ID,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0,0.10612,30.0,4.93,0,0.428,6.095,65.1,6.3361,6,300.0,16.6,394.62,12.4
1,1,0.34109,0.0,7.38,0,0.493,6.415,40.1,4.7211,5,287.0,19.6,396.9,6.12
2,2,12.2472,0.0,18.1,0,0.584,5.837,59.7,1.9976,24,666.0,20.2,24.65,15.69
3,3,0.22489,12.5,7.87,0,0.524,6.377,94.3,6.3467,5,311.0,15.2,392.52,20.45
4,4,1.80028,0.0,19.58,0,0.605,5.877,79.2,2.4259,5,403.0,14.7,227.61,12.14


In [None]:
#outliers
x = df.describe().T
x

In [None]:
def outlierpresence(df):
  for i in df.keys():
    Q1 = df.quantile(0.25)
    Q3 = df.quantile(0.75)
    IQR = Q3 - Q1
    x = (df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))
    # df[x.isin([True])]
    substring = 'True'
    y= x[x.apply(lambda row: row.astype(str).str.contains(substring, case=False).any(), axis=1)] #IT WILL GIVE ALL OUTLIERS IN THE DATAFRAME WITH ALL COLUMNS
    if True in y[i].tolist(): #HERE WE CHECK True is in the list of particular column
      print('Outliers', '\033[1m'+ 'present' +'\033[0m', 'in the data of','\033[1m' + i + '\033[0m')
      print('-------------------------------')
    else:
      print('Outliers', '\033[1m'+ ' not present in the data of' +'\033[0m', 'in','\033[1m' + i + '\033[0m') 
      print('-------------------------------') 
outlierpresence(df)

In [None]:
def loweruppwhisker(df):
  for i in df.keys():
    Q1 = df[i].quantile(0.25)
    Q3 = df[i].quantile(0.75)
    IQR = Q3 - Q1
    whisker_width = 1.5
    lower_whisker = Q1 -(whisker_width*IQR)
    upper_whisker = Q3 + (whisker_width*IQR)
    print('\033[1m' + i + '\033[0m')
    print('-------------------------')
    print("Lowe whisker: ",lower_whisker)
    print("Upper whisker: ", upper_whisker)
    print("%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%")
loweruppwhisker(df)

In [None]:
for k, v in df.items():
  q1 = v.quantile(0.25)
  q3 = v.quantile(0.75)
  irq = q3 - q1
  v_col = v[(v <= q1 - 1.5 * irq) | (v >= q3 + 1.5 * irq)]
  perc = np.shape(v_col)[0] * 100.0 / np.shape(df)[0]
  print("Column %s outliers = %.2f%%" % (k, perc))

In [None]:
plt.figure(figsize = (16, 12))
sns.heatmap(df.corr(), annot = True, fmt = '.2%')
# plt.savefig('../images/features_correlation.png')

In [None]:
sns.set_theme()

In [None]:
#from the correlation plot we can seegreater POSITIVE correlation in this order
plt.title('Price of home vs average homes per dwelling')
sns.scatterplot(data=df, x=df['MEDV'], y=df['RM'])
plt.show()

In [None]:
df[['MEDV','RM']].corr()

Good linear relationship and correlation

for other main factors

In [None]:
plt.title('Price of home vs residential land zoned for lots over 25,000 sq.ft')
sns.scatterplot(data=df, x=df['MEDV'], y=df['ZN'])
plt.show()

In [None]:
df[['MEDV','ZN']].corr()

In [None]:
plt.title('Price of home VS proportion of blacks by town')
sns.scatterplot(data=df, x=df['MEDV'], y=df['B'])
plt.show()
df[['MEDV','B']].corr()

In [None]:
plt.title('Price of home VS weighted distances to five Boston employment centres')
sns.scatterplot(data=df, x=df['MEDV'], y=df['DIS'])
plt.show()
df[['MEDV','DIS']].corr()

In [None]:
plt.title('Price of home per capita crime rate by town')
sns.scatterplot(data=df, x=df['MEDV'], y=df['CRIM'])
plt.show()
df[['MEDV','CRIM']].corr()

In [None]:
plt.title('Price of home VS proportion of non-retail business acres per town.')
sns.scatterplot(data=df, x=df['MEDV'], y=df['INDUS'])
plt.show()
df[['MEDV','INDUS']].corr()

In [None]:
plt.title('Price of home VS nitric oxides concentration')
sns.scatterplot(data=df, x=df['MEDV'], y=df['NOX'])
plt.show()
df[['MEDV','NOX']].corr()

In [None]:
plt.title('Price of home VS proportion of owner-occupied units built prior to 1940')
sns.scatterplot(data=df, x=df['MEDV'], y=df['AGE'])
plt.show()
df[['MEDV','AGE']].corr()

In [None]:
plt.title('Price of home VS index of accessibility to radial highways')
sns.scatterplot(data=df, x=df['MEDV'], y=df['RAD'])
plt.show()
df[['MEDV','RAD']].corr()

In [None]:
plt.title('Price of home VS full-value property-tax rate per \$10,000')
sns.scatterplot(data=df, x=df['MEDV'], y=df['TAX'])
plt.show()
df[['MEDV','TAX']].corr()

In [None]:
plt.title('Price of home VS pupil-teacher ratio by town')
sns.scatterplot(data=df, x=df['MEDV'], y=df['PTRATIO'])
plt.show()
df[['MEDV','PTRATIO']].corr()

In [None]:
#SPLITTING
b = []
for i in df.keys():
  b.append(i)
print(b)

In [None]:
b.remove('ID')
b.remove('MEDV')
print(b)

In [None]:
X = df[b].values#array of features
y = df['MEDV'].values

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
from sklearn.preprocessing import StandardScaler ## standrard scalig 
scaler = StandardScaler() #initialise to a variable
scaler.fit(X_train,y_train) # we are finding the values of mean and sd from the td
X_train_scaled = scaler.transform(X_train) # fit (mean, sd) and then transform the training data
X_test_scaled = scaler.transform(X_test) # transform the test data

In [None]:
#MODEL TRAINING
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train_scaled, y_train)

In [None]:
coeff_df = pd.DataFrame(regressor.coef_,[b], columns=['Coefficient'])
y_pred = regressor.predict(X_test_scaled)
coeff_df

In [None]:
print(y_pred)

In [None]:
regressor.intercept_ # c

In [None]:
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df

In [None]:
from sklearn import metrics
print('R2- SCORE:', metrics.r2_score(y_test,y_pred))

Feature selection

In [None]:
from sklearn.feature_selection import RFE
estimator = LinearRegression()
selector = RFE(estimator, n_features_to_select=3, step=1)
selector = selector.fit(X_train_scaled, y_train)
sorted(list(zip(selector.ranking_,b)))

In [None]:
df1 = pd.read_csv("/content/test.csv")
df1.head()
df1 = pd.get_dummies(df1, columns = ['CHAS'])

In [None]:
c = []
for i in df1.keys():
  c.append(i)
print(c)

In [None]:
c.remove('ID')

In [None]:
#Now take the feature value as the given data

X = df1[c].values
print(X)

In [None]:
X_test_scale = scaler.transform(X) #scale the data of features X

In [None]:
y_testpred = regressor.predict(X_test_scale)
print(y_testpred)

In [None]:
#pridicted value is addesd to our new dataframe

df1['MEDV'] = y_testpred
df1.head()
#save new csv file as output.csv

df1.to_csv('output.csv', index=False)
