## PROBLEM STATEMENT :

##### Given a dataset containing air samples of different locations having 5 features of the air, predict the air quality index and compare it with the "target" column in the given "Test" file.

## Solution:

### Importing Libraries and Dataset

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

In [2]:
trainData = pd.read_csv('Train.csv')
testData = pd.read_csv('Test.csv')

### Visualising Data

In [3]:
trainData.head()  # .head() gives first 5 rows

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5,target
0,0.293416,-0.945599,-0.421105,0.406816,0.525662,-82.154667
1,-0.836084,-0.189228,-0.776403,-1.053831,0.597997,-48.89796
2,0.236425,0.132836,-0.147723,0.699854,-0.187364,77.270371
3,0.175312,0.143194,-0.581111,-0.122107,-1.292168,-2.988581
4,-1.693011,0.542712,-2.798729,-0.686723,1.244077,-37.596722


In [4]:
testData.head()

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5,target
0,1.015254,2.076209,-0.266435,-2.418088,-0.980503,114.583689
1,-0.375021,0.953185,0.626719,0.704406,-0.355489,118.012815
2,-1.024452,0.962991,-0.407942,-1.861274,0.455201,-20.739852
3,-2.489841,0.544802,0.601219,-0.607021,-1.314286,-43.936899
4,-0.384675,-0.833624,1.358552,-0.547932,0.411925,-95.914898


#### Separating X-Train , Y-Train and X-Test, Y-Test

In [5]:
dfXTrain = trainData.iloc[:,:5]  #exclusive of end values
dfXTrain.head()

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5
0,0.293416,-0.945599,-0.421105,0.406816,0.525662
1,-0.836084,-0.189228,-0.776403,-1.053831,0.597997
2,0.236425,0.132836,-0.147723,0.699854,-0.187364
3,0.175312,0.143194,-0.581111,-0.122107,-1.292168
4,-1.693011,0.542712,-2.798729,-0.686723,1.244077


In [6]:
dfYTrain = trainData.iloc[:,5]
dfYTrain.head()

0   -82.154667
1   -48.897960
2    77.270371
3    -2.988581
4   -37.596722
Name: target, dtype: float64

In [7]:
dfXTest = testData.iloc[:,:5]  #exclusive of end values
dfXTest.head()

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5
0,1.015254,2.076209,-0.266435,-2.418088,-0.980503
1,-0.375021,0.953185,0.626719,0.704406,-0.355489
2,-1.024452,0.962991,-0.407942,-1.861274,0.455201
3,-2.489841,0.544802,0.601219,-0.607021,-1.314286
4,-0.384675,-0.833624,1.358552,-0.547932,0.411925


In [8]:
dfYTest = testData.iloc[:,5]
dfYTest.head()

0    114.583689
1    118.012815
2    -20.739852
3    -43.936899
4    -95.914898
Name: target, dtype: float64

#### Preparing Data in training format - Converting dataframe to numpy arrays

In [9]:
xTrain = dfXTrain.values 
yTrain = dfYTrain.values
print(xTrain.shape,yTrain.shape)

(1600, 5) (1600,)


In [10]:
xTrain[0]

array([ 0.29341624, -0.94559871, -0.42110515,  0.40681602,  0.52566183])

In [11]:
xTest = dfXTest.values
yTest = dfYTest.values
print(xTest.shape,yTest.shape)

(400, 5) (400,)


In [12]:
xTest[0]

array([ 1.01525387,  2.07620944, -0.26643482, -2.4180882 , -0.98050279])

### Training

In [13]:
model = LinearRegression()

In [14]:
model.fit(xTrain,yTrain)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

### Predicting

In [15]:
output = model.predict(xTest)

#### Visualising the features LEARNED by the model

In [16]:
bias = model.intercept_
coeff = model.coef_
print(bias)
print(coeff)

4.990966735574957
[29.68187118 92.66247759  8.28062089 44.76773522  2.51916121]


#### Comparing the Predicted and True values

In [17]:
model.score(xTest,yTest)

0.999926185789197

### Hence, our model predicts the Air Quality with an accuracy of 99%