## Linear Regression
Linear regression is used to model the relationship between two variables and estimate the value of a response by using a line-of-best-fit.
* In this Notebook, we look at linear regression with multiple variables (Multiple Variable Regression or Multivariate Regression)

* Previouly the price was dependent only on area but now we are making our problem more complex by adding bedrooms and age, because as we know in normal live the home price is dependent on multiple factors. 
* So, before we tackle any machine learning problem, the 1st thing that need to be done is we need to carefully analyze our dataset or our training data. 
* The 2nd thing is, we should choose correct machine learning algorithm related to our dataset. If there was a linear relationship between inside of dataset, so we can use linear regression.
* So, let's start...


In [85]:
# Required models...
import pandas as pd
import numpy as np
import matplotlib as plt

In [88]:
# Next to read data into a DataFrame:
df = pd.read_csv("homeprices.csv")
df

Unnamed: 0,area,bedrooms,age,price
0,2600,3.0,20,550000
1,3000,4.0,15,565000
2,3200,,18,610000
3,3600,3.0,30,595000
4,4000,5.0,8,760000


In [89]:
# So, as we see there is missing value in our dataset. 
# The way we think be suitable for handling missing value is to take median of bedrooms column:
df.bedrooms.median()

3.5

In [90]:
# So to keep only integers, we use and import math model:
import math
bedrooms_median = math.floor(df.bedrooms.median())
bedrooms_median

3

In [22]:
# Now, the way we fill the n.a. value is using fillna() function:
df.bedrooms.fillna(bedrooms_median)

0    3.0
1    4.0
2    3.0
3    3.0
4    5.0
Name: bedrooms, dtype: float64

In [91]:
# So it's giving us the complete series. We need to assign this series back to the original series:
df.bedrooms = df.bedrooms.fillna(bedrooms_median)
df

Unnamed: 0,area,bedrooms,age,price
0,2600,3.0,20,550000
1,3000,4.0,15,565000
2,3200,3.0,18,610000
3,3600,3.0,30,595000
4,4000,5.0,8,760000


So now we did filling the missing value. Means our data preprocessing step is over.

In [92]:
# So, now is the model training phase, first we create object from Linera Regression class and then we call fit() method to train the model:
# Fit() methods takes first argument as independant variable or variables (area, bedrooms, age) and the second argument is dependent variable which is price.
reg = linear_model.LinearRegression()
reg.fit(df[['area', 'bedrooms', 'age']], df.price)

LinearRegression()

* So our model is trained (Linear multivariate model)
* Now our model is ready for prediction.

In [93]:
# Before prediction we can check co-efficent and intercept:
# To check co-efficent:
reg.coef_

array([   137.25, -26025.  ,  -6825.  ])

In [94]:
# To check intercept:
reg.intercept_

383724.9999999998

In [95]:
# So now let's do prediction and predict the price:
reg.predict([[3000, 3, 40]])



array([444400.])

So the price is 444400, the reason is why it is almost small price, the home is old.

In [96]:
# Let's see how it works internally:
price = 137.25*3000 + -26025*3 + -6825*40 + 383724.9999999998
price

444399.9999999998

* So it's giving us the same result.

In [97]:
# Similarlly, let's predict other home price:
reg.predict([[2500, 4, 5]])



array([588625.])

* So, we see how different factors play significant roles...

### Exercise
In exercise folder (same level as this notebook on github) there is **hiring.csv**. This file contains hiring statics for a firm such as experience of candidate, his written test score and personal interview score. Based on these 3 factors, HR will decide the salary. Given this data, you need to build a machine learning model for HR department that can help them decide salaries for future candidates. Using this predict salaries for following candidates,

In [38]:
# First let's to read the CSV file:
dfe = pd.read_csv("Exercise/hiring.csv")
dfe

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,,8.0,9,50000
1,,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,,7,72000
7,eleven,7.0,8,80000


In [41]:
# Now, the first step is doing preprocessing as we have missing data:
# For column 'experience' we assume the experience is zero. So we just add zero.
dfe.experience = dfe.experience.fillna("zero")
dfe.experience

0      zero
1      zero
2      five
3       two
4     seven
5     three
6       ten
7    eleven
Name: experience, dtype: object

In [42]:
# So now let's refresh our DataFrame:
dfe

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,zero,8.0,9,50000
1,zero,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,,7,72000
7,eleven,7.0,8,80000


In [70]:
# For column test_score(out of 10), first we rename the column name for 3rd and 4th cloumns:
dfe.rename(columns={
    "test_score(out of 10)": "test_score",
    "interview_score(out of 10)": "interview_score",
    "salary($)": "salary"
}, inplace=True)
dfe

Unnamed: 0,experience,test_score,interview_score,salary
0,0,8.0,9,50000
1,0,8.0,6,45000
2,5,6.0,7,60000
3,2,10.0,10,65000
4,7,9.0,6,70000
5,3,7.0,10,62000
6,10,8.0,7,72000
7,11,7.0,8,80000


In [71]:
# So now we use median() to handle the missing value of test_score column:
test_score_median = dfe.test_score.median()
test_score_median

8.0

In [72]:
# So we add 8.0 value to the cell.
dfe.test_score = dfe.test_score.fillna(test_score_median)
dfe

Unnamed: 0,experience,test_score,interview_score,salary
0,0,8.0,9,50000
1,0,8.0,6,45000
2,5,6.0,7,60000
3,2,10.0,10,65000
4,7,9.0,6,70000
5,3,7.0,10,62000
6,10,8.0,7,72000
7,11,7.0,8,80000


In [73]:
# The next thing is, to change experience variablle values from string to numbers, becuse ML only understand numbers. For that we use and import 'word2number' model:
from word2number import w2n
dfe.experience = dfe.experience.astype(str)
dfe.experience = dfe.experience.apply(w2n.word_to_num)

In [74]:
dfe

Unnamed: 0,experience,test_score,interview_score,salary
0,0,8.0,9,50000
1,0,8.0,6,45000
2,5,6.0,7,60000
3,2,10.0,10,65000
4,7,9.0,6,70000
5,3,7.0,10,62000
6,10,8.0,7,72000
7,11,7.0,8,80000


*  **Yesssssssssssss!** the pre-processing step is over.

In [76]:
# Now we can train our model:
rege = linear_model.LinearRegression()
rege.fit(dfe[['experience', 'test_score', 'interview_score']], dfe.salary)

LinearRegression()

In [77]:
# Yessssssss! the model is trained, now we can check co-efficent and intercept:
# First co-efficent:
rege.coef_

array([2812.95487627, 1845.70596798, 2205.24017467])

In [79]:
# Intercept:
rege.intercept_

17737.26346433771

In [80]:
# Now let's perform prediction:
rege.predict([[2, 9, 6]])



array([53205.96797671])

In [81]:
# 2nd prediction:
rege.predict([[12, 10, 10]])



array([92002.18340611])

Thats were all about linear regression with multi variables...