# Data Mining

## Linear Regression

### After completing materials of this notebook, you should be able to:

* Explain what linear regression is, how it is used and the benefits of using it.
* Recognize the necessary format for data in order to perform predictive linear regression.
* Explain the basic algebraic formula for calculating linear regression.
* Develop a linear regression data mining model using a training data set.
* Interpret the model’s coefficients and apply them to a scoring data set in order to deploy the model.

#### ORGANIZATIONAL UNDERSTANDING
    we are trying to predict heating oil usage for new customers

#### Data Understanding
* __Insulation__: This is a density rating, ranging from one to ten, indicating the thickness of each home’s insulation. A home with a density rating of one is poorly insulated, while a home with a density of ten has excellent insulation.
* __Temperature__: This is the average outdoor ambient temperature at each home for the most recent year, measure in degree Fahrenheit.
* __Heating_Oil__: This is the total number of units of heating oil purchased by the owner of each home in the most recent year.
* __Num_Occupants__: This is the total number of occupants living in each home.
* __Avg_Age__: This is the average age of those occupants.
* __Home_Size__: This is a rating, on a scale of one to eight, of the home’s overall size. The higher the number, the larger the home.

#### Data Preparation
    using linear regression as a predictive model, it is extremely important to remember that the ranges for all attributes in the scoring data must be within the ranges for the corresponding attributes in the training data

In [49]:
import pandas as pd
training_data = pd.read_csv('linear_regression_data.csv')
scoring_data = pd.read_csv('deployment_data.csv')

In [30]:
# Check for Missing data
training_data.isnull().values.any()
print(f'Is there any null value in training_data?? {training_data.isnull().values.any()}')
training_data[training_data.isnull().any(axis = 1)]

Is there any null value in training_data?? False


Unnamed: 0,Insulation,Temperature,Heating_Oil,Num_Occupants,Avg_Age,Home_Size


In [31]:
# Check for Missing data
scoring_data.isnull().values.any()
print(f'Is there any null value in scoring_data?? {scoring_data.isnull().values.any()}')
scoring_data[scoring_data.isnull().any(axis = 1)]

Is there any null value in scoring_data?? False


Unnamed: 0,Insulation,Temperature,Num_Occupants,Avg_Age,Home_Size,Predicted_Heatin_Oil


In [32]:
training_data.describe()

Unnamed: 0,Insulation,Temperature,Heating_Oil,Num_Occupants,Avg_Age,Home_Size
count,1218.0,1218.0,1218.0,1218.0,1218.0,1218.0
mean,3.785714,65.078818,197.394089,3.1133,42.706404,4.649425
std,2.768094,16.932425,56.248267,1.690605,15.051137,2.321226
min,0.0,38.0,114.0,1.0,15.1,1.0
25%,1.0,49.0,148.25,2.0,29.7,3.0
50%,4.0,60.0,185.0,3.0,42.9,5.0
75%,6.0,81.0,253.0,4.0,55.6,7.0
max,8.0,90.0,301.0,10.0,72.2,8.0


In [33]:
scoring_data.describe()

Unnamed: 0,Insulation,Temperature,Num_Occupants,Avg_Age,Home_Size,Predicted_Heatin_Oil
count,42650.0,42650.0,42650.0,42650.0,42650.0,42650.0
mean,4.010996,63.962087,5.489285,44.040131,4.495193,198.285437
std,2.575511,15.313351,2.874612,16.736901,2.290911,37.057353
min,0.0,38.0,1.0,15.0,1.0,96.666505
25%,2.0,51.0,3.0,29.5,3.0,169.616597
50%,4.0,64.0,5.0,44.1,4.0,198.386502
75%,6.0,77.0,8.0,58.6,6.0,226.893676
max,8.0,90.0,10.0,73.0,8.0,300.891633


In [35]:
print(scoring_data.columns)

Index(['Insulation', 'Temperature', 'Num_Occupants', 'Avg_Age', 'Home_Size',
       'Predicted_Heatin_Oil'],
      dtype='object')


In [36]:
training_data.columns

Index(['Insulation', 'Temperature', 'Heating_Oil', 'Num_Occupants', 'Avg_Age',
       'Home_Size'],
      dtype='object')

In [17]:
# we want to predict heating-oil so...
x = training_data.drop(['Heating_Oil'],axis=1)
y = training_data.Heating_Oil

In [37]:
x.head()

Unnamed: 0,Insulation,Temperature,Num_Occupants,Avg_Age,Home_Size
0,4,74,4,23.8,4
1,0,43,4,56.7,4
2,7,81,2,28.0,6
3,1,50,4,45.1,3
4,8,80,5,20.8,2


#### Modeling 
class sklearn.linear_model.LinearRegression(fit_intercept=True, normalize=False, copy_X=True, n_jobs=1)

__Parameters__:

__fit_intercept__ : boolean, optional, default True

    whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (e.g. data is expected to be already centered).

__normalize__ : boolean, optional, default False

    This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to standardize, please use sklearn.preprocessing.StandardScaler before calling fit on an estimator with normalize=False.

__copy_X__ : boolean, optional, default True

    If True, X will be copied; else, it may be overwritten.

__n_jobs__ : int, optional, default 1

    The number of jobs to use for the computation. If -1 all CPUs are used. This will only provide speedup for n_targets > 1 and sufficient large problems.

__Attributes__:	

__coef___ : array, shape (n_features, ) or (n_targets, n_features)

    Estimated coefficients for the linear regression problem. If multiple targets are passed during the fit (y 2D), this is a 2D array of shape (n_targets, n_features), while if only one target is passed, this is a 1D array of length n_features.

__intercept___ : array

    Independent term in the linear model.


In [38]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [40]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

In [43]:
lr = LinearRegression()
lr.fit(x_train, y_train)
y_train_predicted = lr.predict(x_train)
y_test_predicted = lr.predict(x_test)

In [41]:
y_test_predicted[:10]

array([189.51712098, 151.71171326, 170.03182213, 148.17719489,
       245.69163745, 252.04296693, 194.80430361, 127.42658043,
       126.92259429, 174.27547566])

### Evaluation

In [44]:
from sklearn.metrics import mean_squared_error
train_mse = mean_squared_error(y_true = y_train, y_pred = y_train_predicted)
test_mse = mean_squared_error(y_true = y_test, y_pred = y_test_predicted)
print('train_mse:', train_mse)
print('test_mse:', test_mse)

train_mse: 564.689979963641
test_mse: 588.3367577896204


In [45]:
for col , coef in zip(X_train.columns, lr.coef_):
    print(f'{col :15s}|{coef :^15.4f}|')

Insulation     |    -3.2884    |
Temperature    |    -0.8544    |
Num_Occupants  |    -0.2390    |
Avg_Age        |    1.9871     |
Home_Size      |    3.2789     |


* Insulation has negetive coefficient: So if a house has a thick Insulation the amount of heating oil will decrease
* Num_Occupants: The effect of Num_Occupants is very low, we can ignore it
* Home_Size: As much as the home size increases the oil which is required for house
* Avg_Age: Old people spend more time on shover and mostly want to keep the house warmer than young people...

#### Deployment

In [48]:
y_predicted

array([247.26731977, 216.40815181, 222.73635144, ..., 150.58341369,
       250.07287595, 220.48505967])

In [50]:
y_predicted = lr.predict(scoring_data)
scoring_data['Predicted_Heating_Oil'] = y_predicted
scoring_data

Unnamed: 0,Insulation,Temperature,Num_Occupants,Avg_Age,Home_Size,Predicted_Heating_Oil
0,5,69,10,70.1,7,251.195384
1,5,80,1,66.7,1,217.518543
2,6,89,9,67.8,7,226.488073
3,3,81,9,52.4,6,209.307842
4,6,58,8,22.9,7,163.991065
...,...,...,...,...,...,...
42645,3,63,9,59.1,8,244.558154
42646,3,84,1,43.6,5,187.891206
42647,8,67,1,27.3,4,150.304390
42648,2,58,1,65.7,3,250.750871


How much oil we will need, how much oil we will need for each house in average???

In [29]:
print(f' sum  of heating oils: {scoring_data.Predicted_Heatin_Oil.sum() :^15.4f} \n '
      f'mean of heating oils: {scoring_data.Predicted_Heatin_Oil.mean():^15.4f}')

 sum  of heating oils:  8456873.9019   
 mean of heating oils:    198.2854    


In [None]:
scoring_data.to_csv('results.csv', index_label=False)

*:)*