<h1>Multiple Linear Regression</h1>

Multiple linear regression is a statistical technique used to model the relationship between one dependent variable and two or more independent variables. It extends simple linear regression by allowing multiple predictors to estimate the target variable, making it ideal for analyzing complex relationships where various factors contribute to the outcome.

Key Aspects of Multiple Linear Regression

	1.	Equation: The model can be represented as:
        \[
        Y = b_0 + b_1X_1 + b_2X_2 + \dots + b_nX_n + \epsilon\]
        
	•	￼ is the dependent variable (target).
	•	\( X_1, X_2, \dots, X_n \) are independent variables (predictors).
	•	￼ is the intercept, and \( b_1, b_2, \dots, b_n \) are the coefficients for each predictor.
	•	￼ is the error term.
    
	2.	Objective: To find the best-fitting line (or hyperplane) that minimizes the sum of the squared differences (errors) between the              observed and predicted values of ￼.
    
	3.	Assumptions:
	•	Linearity: The relationship between the dependent and independent variables is linear.
	•	Independence: Observations should be independent of each other.
	•	Homoscedasticity: Constant variance of the error terms.
	•	Normality of Errors: The residuals (errors) are normally distributed.
    
	4.	Interpretation:
	•	Coefficients: Each coefficient represents the change in ￼ for a one-unit change in the respective ￼, assuming other variables are            held constant.
	•	Significance: The p-values for each coefficient can determine if a predictor significantly influences ￼.
	•	Goodness of Fit: Metrics like ￼ and Adjusted ￼ assess the model’s explanatory power.

Applications

Multiple linear regression is used widely in fields like finance, healthcare, and marketing to predict outcomes based on several factors. Examples include predicting house prices based on features like size, location, and number of rooms or forecasting sales based on advertising spend, season, and economic indicators.

<h2>Import Libraries</h2>

In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

In [12]:
dataset = pd.read_csv("50_Startups.csv")
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,-1].values

In [13]:
dataset.head(10)

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94
5,131876.9,99814.71,362861.36,New York,156991.12
6,134615.46,147198.87,127716.82,California,156122.51
7,130298.13,145530.06,323876.68,Florida,155752.6
8,120542.52,148718.95,311613.29,New York,152211.77
9,123334.88,108679.17,304981.62,California,149759.96


<h2>Encoding Categorical Columns</h2>

In [14]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers = [('encoder',OneHotEncoder(), [3])], remainder = 'passthrough')
X = np.array(ct.fit_transform(X))

In [15]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state = 42)

<h2>Training the Model on Training Set</h2>

In [18]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,y_train)

<h2>Predicting Test Set Results</h2>

In [20]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[126362.88 134307.35]
 [ 84608.45  81005.76]
 [ 99677.49  99937.59]
 [ 46357.46  64926.08]
 [128750.48 125370.37]
 [ 50912.42  35673.41]
 [109741.35 105733.54]
 [100643.24 107404.34]
 [ 97599.28  97427.84]
 [113097.43 122776.86]]
