## 1. Introduction to Linear Regression

__Linear regression__ is a _basic_ and _commonly_ used type of __predictive analysis__.  The overall idea of regression is to examine two things: 
- Does a set of __predictor variables__ do a good job in predicting an __outcome__ (dependent) variable?  
- Which variables in particular are __significant predictors__ of the outcome variable, and in what way they do __impact__ the outcome variable?  

These regression estimates are used to explain the __relationship between one dependent variable and one or more independent variables__.  The simplest form of the regression equation with one dependent and one independent variable is defined by the formula :<br/>
$y = \beta_0 + \beta_1x$

![image.png](https://miro.medium.com/max/1000/0*1RXk8ZlOUvlo9UGF)


## 2. Data Loading and Description

#### Importing Packages

In [1]:
import pandas as pd
import numpy as np 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

#### Importing the Dataset

In [15]:
data = pd.read_csv('../../data/movie_metadata.csv')
data.head(2)

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0


#### Find Missing values

In [4]:
def handle_missing_values(data):
    total = data.isnull().sum().sort_values(ascending = False)
    percentage = round(total / data.shape[0] * 100)
    return pd.concat([total, percentage], axis = 1, keys = ['total', 'percentage'])
handle_missing_values(data)

Unnamed: 0,total,percentage
gross,884,18.0
budget,492,10.0
aspect_ratio,329,7.0
content_rating,303,6.0
plot_keywords,153,3.0
title_year,108,2.0
director_name,104,2.0
director_facebook_likes,104,2.0
num_critic_for_reviews,50,1.0
actor_3_name,23,0.0


## 3. Splitting X and y into train and test datasets.

In [17]:
data.fillna(0, axis=1, inplace=True)
data.head(2)
#Selecting features
features = ['actor_3_facebook_likes', 'actor_1_facebook_likes', 'gross',
       'num_voted_users', 'cast_total_facebook_likes', 'facenumber_in_poster',
       'num_user_for_reviews', 'budget', 'title_year',
       'actor_2_facebook_likes', 'aspect_ratio',
       'movie_facebook_likes']
target = ['imdb_score']
X = data[features].dropna()
y = data[target].dropna()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)
print('Train cases as below')
print('X_train shape: ',X_train.shape)
print('y_train shape: ',y_train.shape)
print('\nTest cases as below')
print('X_test shape: ',X_test.shape)
print('y_test shape: ',y_test.shape)

Train cases as below
X_train shape:  (4034, 12)
y_train shape:  (4034, 1)

Test cases as below
X_test shape:  (1009, 12)
y_test shape:  (1009, 1)


## 4. Linear regression in scikit-learn

To apply any machine learning algorithm on your dataset, basically there are 4 steps:
1. Load the algorithm
2. Instantiate and Fit the model to the training dataset
3. Prediction on the test set
4. Calculating Root mean square error 
The code block given below shows how these steps are carried out:<br/>

``` from sklearn.linear_model import LinearRegression
    linreg = LinearRegression()
    linreg.fit(X_train, y_train) 
    RMSE_test = np.sqrt(metrics.mean_squared_error(y_test, y_pred_test))```

In [19]:
model = LinearRegression()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
np.sqrt(mean_squared_error(y_test, prediction))

0.985036028868051

- __RMSE__ is even more popular than MSE, because RMSE is _interpretable_ in the "y" units.
    - Easier to put in context as it's the same units as our response variable.