#### Predict your scores better :)
As you saw in the previous simple linear regression task that previous year grades (G2) have significant correlation with third year grades (G3). But G2 is not direct causation of G3, there are many factors which determine G3. Let's add few more variables which may help to determine G3.

### Multiple linear regression
Multiple linear regression is simply the linear regression extended to problems where the dependent or output variable is determined by more than one independent variable.<br>

#####  $\hat{y}$ (w, x) = w_0 + w_1 * x_1 + ... + w_p * x_ps

#### Dataset
The dataset is available at __"data/multiple_linear_data.csv"__ in the respective challenge's repo.<br><br>

This is the __modified version__ of the dataset *'Student Performance'* provided by UCI Machine Learning repository.<br>
Original dataset: https://archive.ics.uci.edu/ml/datasets/student+performance

#### Features (X)
1. age - student's age (numeric: from 15 to 22)
2. address - student's home address type (binary: 'U' - urban or 'R' - rural)
3. famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
4. reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
5. studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
6. failures - number of past class failures (numeric: n if 1<=n<3, else 4)
7. schoolsup - extra educational support (binary: yes or no)
8. famsup - family educational support (binary: yes or no)
9. paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
10. activities - extra-curricular activities (binary: yes or no)
11. higher - wants to take higher education (binary: yes or no)
12. internet - Internet access at home (binary: yes or no)
13. romantic - with a romantic relationship (binary: yes or no)
14. freetime - free time after school (numeric: from 1 - very low to 5 - very high)
15. goout - going out with friends (numeric: from 1 - very low to 5 - very high)
16. health - current health status (numeric: from 1 - very bad to 5 - very good)
17. absences - number of school absences (numeric: from 0 to 93)
18. G1 - first year math grades (numeric: from 0 to 100)
19. G2 - second year math grades (numeric: from 0 to 100)

#### Output target (Y) 
20. G3 - final year math grades (numeric: from 0 to 100, output target)

#### Objective
To learn multiple linear regression and practice handling categorical features

#### Tasks
- To load the data and print first 5 rows
- Transform categorical features into numerical features. Use either one hot encoding, label encoding or any other suitable preprocessing technique.
- Define X matrix (independent features) and y vector (target feature)
- Train Linear Regression Model (sklearn.linear_model.LinearRegression class)
- Print 'Mean Squared Error' (MSE) obtained on the same dataset i.e. same X and y (sklearn.metrics.mean_squared_error function)
- Predict on a numpy array defined by you
```python
>>> new_data = np.array([1,0,1,.....,30,20]).reshape(1,-1)
>>> print("Predicted grade:",model.predict(new_data))
```

#### Further fun (will not be evaluated)
- Train LassoRegression and RidgeRegression as well. Read about them from scikit-learn user guide.
- *Step-up challenge*: Get down the MSE (mean squared error) below 3.25 using linear models
- Implement multiple linear regression from scratch
- Plot loss curve (Loss vs number of iterations)

#### Helpful links
- Scikit-learn documentation for linear regression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
- Read till where you feel comfortable: https://jakevdp.github.io/PythonDataScienceHandbook/05.06-linear-regression.html
- Use slack for doubts: https://join.slack.com/t/deepconnectai/shared_invite/zt-givlfnf6-~cn3SQ43k0BGDrG9_YOn4g

In [29]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# IF sklearn.compose.ColumnTransformer is used for feature transformation, then below import will help to infer features
# from helper.utils import get_column_names_from_ColumnTransformer

In [59]:
# NOTE: DO NOT CHANGE THE VARIABLE NAME(S) IN THIS CELL
# Load the data
data = pd.read_csv('multiple_linear_data.csv')
data

Unnamed: 0,age,address,famsize,reason,studytime,failures,schoolsup,famsup,paid,activities,higher,internet,romantic,freetime,goout,health,absences,G1,G2,G3
0,18,U,GT3,course,2,0,yes,no,no,no,yes,no,no,3,4,3,6,25,30,30
1,17,U,GT3,course,2,0,no,yes,no,no,yes,yes,no,3,3,3,4,25,25,30
2,15,U,LE3,other,2,3,yes,no,yes,no,yes,yes,no,3,2,3,10,35,40,50
3,15,U,GT3,home,3,0,no,yes,yes,yes,yes,yes,yes,2,2,5,2,75,70,75
4,16,U,GT3,home,2,0,no,yes,yes,no,yes,no,no,3,2,5,4,30,50,50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
390,20,U,LE3,course,2,2,no,yes,yes,no,yes,no,no,5,4,4,11,45,45,45
391,17,U,LE3,course,1,0,no,no,no,no,yes,yes,no,4,5,2,3,70,80,80
392,21,R,GT3,course,1,3,no,no,no,no,yes,no,no,5,3,3,3,50,40,35
393,18,R,LE3,course,1,0,no,no,no,no,yes,yes,no,4,1,5,0,55,60,50


In [31]:
# You may need original list of columns to interpret the features after transformation, maybe
orig_cols = data     
print(orig_cols)

     age address famsize  reason  studytime  ...  health absences  G1  G2  G3
0     18       U     GT3  course          2  ...       3        6  25  30  30
1     17       U     GT3  course          2  ...       3        4  25  25  30
2     15       U     LE3   other          2  ...       3       10  35  40  50
3     15       U     GT3    home          3  ...       5        2  75  70  75
4     16       U     GT3    home          2  ...       5        4  30  50  50
..   ...     ...     ...     ...        ...  ...     ...      ...  ..  ..  ..
390   20       U     LE3  course          2  ...       4       11  45  45  45
391   17       U     LE3  course          1  ...       2        3  70  80  80
392   21       R     GT3  course          1  ...       3        3  50  40  35
393   18       R     LE3  course          1  ...       5        0  55  60  50
394   19       U     LE3  course          1  ...       5        5  40  45  45

[395 rows x 20 columns]


In [32]:
# Handle categorical values

categorical = data.select_dtypes(exclude=[np.number])

print(categorical.columns)

Index(['address', 'famsize', 'reason', 'schoolsup', 'famsup', 'paid',
       'activities', 'higher', 'internet', 'romantic'],
      dtype='object')


In [33]:
for i in categorical:
  print(i,data[i].unique())

address ['U' 'R']
famsize ['GT3' 'LE3']
reason ['course' 'other' 'home' 'reputation']
schoolsup ['yes' 'no']
famsup ['no' 'yes']
paid ['no' 'yes']
activities ['no' 'yes']
higher ['yes' 'no']
internet ['no' 'yes']
romantic ['no' 'yes']


In [34]:
data = pd.get_dummies(data,columns = ['address','famsize','reason'],prefix = ['address','famsize','reason'])
# prefix part seperates single columns into multiple ones .. example : Address and Famsize
data

Unnamed: 0,age,studytime,failures,schoolsup,famsup,paid,activities,higher,internet,romantic,freetime,goout,health,absences,G1,G2,G3,address_R,address_U,famsize_GT3,famsize_LE3,reason_course,reason_home,reason_other,reason_reputation
0,18,2,0,yes,no,no,no,yes,no,no,3,4,3,6,25,30,30,0,1,1,0,1,0,0,0
1,17,2,0,no,yes,no,no,yes,yes,no,3,3,3,4,25,25,30,0,1,1,0,1,0,0,0
2,15,2,3,yes,no,yes,no,yes,yes,no,3,2,3,10,35,40,50,0,1,0,1,0,0,1,0
3,15,3,0,no,yes,yes,yes,yes,yes,yes,2,2,5,2,75,70,75,0,1,1,0,0,1,0,0
4,16,2,0,no,yes,yes,no,yes,no,no,3,2,5,4,30,50,50,0,1,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
390,20,2,2,no,yes,yes,no,yes,no,no,5,4,4,11,45,45,45,0,1,0,1,1,0,0,0
391,17,1,0,no,no,no,no,yes,yes,no,4,5,2,3,70,80,80,0,1,0,1,1,0,0,0
392,21,1,3,no,no,no,no,yes,no,no,5,3,3,3,50,40,35,1,0,1,0,1,0,0,0
393,18,1,0,no,no,no,no,yes,yes,no,4,1,5,0,55,60,50,1,0,0,1,1,0,0,0


In [35]:
categorical = categorical.drop(['address','famsize','reason'],axis=1)
categorical.columns

Index(['schoolsup', 'famsup', 'paid', 'activities', 'higher', 'internet',
       'romantic'],
      dtype='object')

In [36]:
# label encoding the yes and no columns 

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

data[categorical.columns] = data[categorical.columns].apply(label_encoder.fit_transform)
data.head()

Unnamed: 0,age,studytime,failures,schoolsup,famsup,paid,activities,higher,internet,romantic,freetime,goout,health,absences,G1,G2,G3,address_R,address_U,famsize_GT3,famsize_LE3,reason_course,reason_home,reason_other,reason_reputation
0,18,2,0,1,0,0,0,1,0,0,3,4,3,6,25,30,30,0,1,1,0,1,0,0,0
1,17,2,0,0,1,0,0,1,1,0,3,3,3,4,25,25,30,0,1,1,0,1,0,0,0
2,15,2,3,1,0,1,0,1,1,0,3,2,3,10,35,40,50,0,1,0,1,0,0,1,0
3,15,3,0,0,1,1,1,1,1,1,2,2,5,2,75,70,75,0,1,1,0,0,1,0,0
4,16,2,0,0,1,1,0,1,0,0,3,2,5,4,30,50,50,0,1,1,0,0,1,0,0


In [57]:
# Define your X and y
X = data.drop(['G3'],axis=1) # data
y = data.G3 # target variable

In [40]:
# Initialize the model
model = LinearRegression()

In [41]:
# Fit the model. Wait! We will complete this step for you ;)
model.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [42]:
for i in range(len(X.columns)):
  print(X.columns[i]," : ", model.coef_[i])

age  :  -0.6675332646220739
studytime  :  -0.6943669320835469
failures  :  -1.018061412277915
schoolsup  :  1.9565425585994025
famsup  :  0.3680424363560679
paid  :  0.5390181854013278
activities  :  -1.8724388369868763
higher  :  1.1014048067709505
internet  :  -0.7863160948110591
romantic  :  -1.7004516417698874
freetime  :  0.517708183745894
goout  :  0.36097048547154426
health  :  0.4647795627473134
absences  :  0.22488449647052605
G1  :  0.18890245073867784
G2  :  0.9521395362761486
address_R  :  0.17302645310683856
address_U  :  -0.17302645310683826
famsize_GT3  :  -0.2111932055599734
famsize_LE3  :  0.2111932055599734
reason_course  :  -0.25533410351879354
reason_home  :  -1.3346098752466962
reason_other  :  1.1866852540996669
reason_reputation  :  0.4032587246658238


In [43]:
model.intercept_

-1.2304628372237758

In [44]:
# Print mean squared error
predictions = model.predict(X)
mse = mean_squared_error(predictions,y)

print(f"\nMSE: {mse}")


MSE: 85.27642516663595


In [49]:
# Predict on your own data
new_data = np.array([20,10 ,	2	,	1,	1,	0	,1,	0,	0	,3	,2	,5	,4	,30,	50,	50,	0	,1	,1	,0	,0	,1	,0	,0])
print(new_data.shape)

(24,)


In [50]:
new_data = new_data.reshape((1,-1))
new_data.shape

(1, 24)

In [51]:
print("Predicted grade:",model.predict(new_data))

Predicted grade: [38.56906271]


In [52]:
data.shape

(395, 25)

In [53]:
new_data = data[385:].drop(['G3'],axis=1)
new_data.head()

Unnamed: 0,age,studytime,failures,schoolsup,famsup,paid,activities,higher,internet,romantic,freetime,goout,health,absences,G1,G2,address_R,address_U,famsize_GT3,famsize_LE3,reason_course,reason_home,reason_other,reason_reputation
385,18,3,0,0,0,1,0,1,0,0,3,3,4,2,50,45,1,0,1,0,0,0,1,0
386,18,1,0,0,1,1,1,1,1,1,4,3,5,7,30,25,1,0,1,0,0,0,0,1
387,19,3,1,0,0,0,1,1,1,0,4,2,5,0,35,25,1,0,1,0,1,0,0,0
388,18,2,0,0,1,1,0,1,1,0,3,4,1,0,35,45,0,1,0,1,1,0,0,0
389,18,2,1,0,0,0,1,1,0,0,1,1,5,0,30,25,0,1,1,0,1,0,0,0


In [56]:
new_predictions = model.predict(new_data)

In [61]:
print("Predicted Value \t Actual Value")
print("-------------------------------")
for i,j in zip(new_predictions,data[385:].G3):
  print("{} \t {}".format(i,j))


Predicted Value 	 Actual Value
-------------------------------
44.69610401809026 	 50
20.596317704161066 	 30
16.66613764785002 	 0
39.28983748221501 	 40
15.609693742750409 	 0
43.497473247522386 	 45
81.69925832405849 	 80
35.081839928371856 	 35
58.77725373599114 	 50
41.45898484573693 	 45
