# **Simple Linear Regression**

## Assumption
1. Linear relationship between dependent and independent variables
2. Both variables are numeric
3. Both variables are normally distributed
4. Both variables have homoscedasticity
5. Independent variables are not correlated with each other

## **Linear Regression**

In [None]:
# Importing the necessary Libraries
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Loading dataset in Python
df = sns.load_dataset('iris')
# Checking the dataset
df.info()
# Checking top 5 rows
df.head()


# Splitting the dataset into X and y (features and lables)
X = df[['sepal_length']]
y = df['sepal_width']
# Splitting the dataset into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=0)
# Training the model on the training set 
model = LinearRegression() # Calling the model
model.fit(X_train, y_train) # fitting or training the model (Fitting a regression line)
# Predicting the values of testing dataset
y_pred = model.predict(X_test)


# Print the slope and intercept of the regression line
print('Intercept:', model.intercept_)
print('Slope:', model.coef_)

# Plotting the trained model's predicted values against the actual values
plt.plot(X_test, y_pred, color='red')
plt.scatter(X, y, color='green')
plt.scatter(X_test, y_test, color='blue')

# Printing the metrics for evaluating the model
print("R2 Score: ", r2_score(y_test, y_pred))
print("Mean Absolute Error: ", mean_absolute_error(y_test, y_pred))
print("Mean Squared Error: ", mean_squared_error(y_test, y_pred))

## **Multiple Linear Regression**


In [None]:
# Improting the necessary Libreraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Laoding dataset in Python
df = sns.load_dataset('iris')
# checking the dataset
df.info()
# Checking top 5 rows 
df.head()


#splitting the dataset into X and y (features and lables)
X = df [['sepal_length', 'sepal_width']]
y = df ['petal_width']
# Splitting the dataset into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Training the model on the training set
model = LinearRegression () # calling the model
model.fit (X_train, y_train) #fitting or training the model (Fitting a regression line)
# predicting the values of testing dataset
y_pred = model.predict (X_test)


# Print the slope and intercept of the regression line
print('Intercept:', model. intercept_)
print('Slope: ', model.coef_)


# predict values
model.predict ([[5.1, 3.5]])

# Create a scatter plot of the actual vs predicted values
plt.scatter(y, model.predict (X))
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Values')
plt.show()

# Printing the metrics for evaluating the model
print("R2 Score: ", r2_score(y_test, y_pred))
print("Mean Absolute Error: ", mean_absolute_error(y_test, y_pred))
print("Mean Squared Error: ", mean_squared_error(y_test, y_pred))

### **Another Example would be**

In [None]:
# Import the necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn. linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

# Load the data
data = sns.load_dataset('tips') # replace 'tips' with your dataset name

# Create a multilinear regression plot
sns.lmplot(x='total_bill', y='tip', hue='smoker', data=data, height=6, aspect=1.5)
# Add labels to the plot
plt.xlabel('Total Bill')
plt.ylabel('Tip')
plt.title('Multilinear Regression Plot')
plt.show()


# Define the predictor and response variables
X = data[['total_bill', 'size', 'smoker']]
y = data['tip']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=42)
# Create a linear regression model object
model = LinearRegression ()
# Fit the model to the training data
model.fit (X_train, y_train)
#Predict the response variable for the test data
y_pred = model.predict(x_test)


# Evaluate the performance of the model
print('R2 Score:', r2_score(y_test, y_pred))
print('Mean Absolute Error:', mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', mean_squared_error(y_test, y_pred))

# Create a multilinear regression plot
sns.lmplot(x='total_bill', y='tip', hue='smoker', data=data, height=6, aspect=1.5, markers=['o', 'x'])
# Add labels to the plot
plt.xlabel('Total Bill')
plt.ylabel('Tip')
plt.title('Multilinear Regression Plot')

'''
**Encoding and numerical dataset**
In the case of the tips dataset, the smoker variable is categorical, with two possible values: "Yes" and "No". Linear
regression models require numerical data, so you need to encode this variable as a numeric value before fitting the
model.
'''
# Load the data
duta = sns.load_dataset('tips')
# Encode the smoker variable using one-hot encoding
data['smoker_bin'] = data['smoker'].apply(lambda x: 1 if x == 'Yes' else 0)
data.head()


# Define the predictor and response variables
X= data[['total_bill', 'size', 'smoker_bin']]
y = data['tip']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a linear regression model object
model = LinearRegression()
# Fit the model to the training data
model.fit(X_train, y_train)
# Predict the response variable for the test data
y_pred = model. predict (X_test)


# Evaluate the performance of the model
print('R2 Score: ', r2_score (y_test, y_pred))
print('Mean Absolute Error:', mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', mean_squared_error(y_test, y_pred))
# Create a multilinear regression plot
sns.lmplot(x= 'total_bill', y='tip', hue='smoker_bin', data=data, height=6, aspect=1.5, markers=['o', 'x'])
# Add labels to the plot
plt.xlabel('Total Bill')
plt.ylabel('Tip')
plt.title('Multilinear Regression Plot')
plt.show()

## **Another Example with Encoding**

Label encoding is a technique to convert categorical data into numerical values. 

Inverse label encoding is a technique to convert numerical values back into their corresponding categorical values.

In [None]:
# Load the data 
data = sns.load_dataset('tips')
# Import LabelEncoder class from sklearn
from sklearn.preprocessing import LabelEncoder
# Create a labelEncoder object
le = LabelEncoder()
# Encode the day column
data['day_encoded'] = le.fit_transform(data['day'])
# Print the unique values of the encoded day column
print(data['day_encoded'].unique())

# make multi-linear regression
# Define the predictor and response variables
X = data[['total_bill', 'size', 'day_encoded']]
y = data['tip']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a linear regression model object
model = LinearRegression()
# Fit the model to the training data
model.fit(X_train, y_train)
# Predict the response variable for the test data
y_pred = model.predict(X_test)


# Evaluate the performance of the model
print('R2 Score:', r2_score (y_test, y_pred))
print('Mean Absolute Error:', mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', mean_squared_error(y_test, y_pred))
# Create a multilinear regression plot
sns.lmplot(x='total_bill', y='tip', hue='day_encoded', data=data, height=6, aspect=1.5)
# Add labels to the plot
plt.xlabel( 'Total Bill')
plt.ylabel('Tip')
plt.title('Multilinear Regression Plot')
plt.show()


# **Inverse Label Encoding**

# Load the data
data = sns.load_dataset('tips')
# Impor LabelEncoder class from sklearn
from sklearn. preprocessing import LabelEncoder
# Create a LabelEncoder object
le = LabelEncoder()
# Encode the day column
data['day_encoded'] = le.fit_transform(data['day'])
# Map the encoded values back to the original day names
data['day_name'] = le.inverse_transform(data['day_encoded'])
# Print the first few rows of the data with the encode and decoded day column
print(data[['day','day_encoded','day_name']].head())

# Create a multilinear regression plot
sns.lmplot(x='total_bill', y='tip', hue='day_name', data=data, height=6, aspect=1.5)
# Add labels to the plot
plt.xlabel( 'Total Bill')
plt.ylabel('Tip')
plt.title('Multilinear Regression Plot')
plt.show()


# **Label Correspondings**

# find the unique values of the encoded day column
unique_encoded_days = data['day_encoded'].unique()

# Map the unique encoded values Back to their original day names
day_names = le.inverse_transform(unique_encoded_days)

# Print the unique encoded days and their corresponding day names
for i in range (len(unique_encoded_days)):
    print(f'Encoded Day: {unique_encoded_days[i]}, Orignal Day: {day_names[i]}')
    
# Load Libraries
import numpy as np
import plotly.express as px
# Load DataSet
data = sns.load_dataset('tips')
# Encode the smoker variable using One-Hot Encoding
data['smoker_bin'] = data['smoker'].apply(lambda x: 1 if x == 'Yes' else 0 )
# Define the predictor and response variable
X = data[['total_bill','size','smoker_bin']]
y = data[['tip']]
# Create a LinearRegression model object
model = LinearRegression()
# Fit the Model to the Data
model.fit(X,y)

# Make predictions over the data grid
x_range = np.linspace(X['total_bill'].min(), X['total_bill'].max(), 50)
y_range = np.linspace(X['size'].min(), X['size'].max(), 50)
x_grid, y_grid = np.meshgrid(x_range,y_range)
z_grid = model.predict(pd.DataFrame({'total_bill': x_grid.ravel(), 'size': y_grid.ravel(), 'smoker_bin': [0] * len(x_grid.ravel())})).reshape(x_grid.shape)
# Make a 3D plot
fig = px.scatter_3d(data, x='total_bill', y='size', z='tip', color='smoker_bin')
fig.update_traces(mode = 'markers')
fig.add_surface(x=x_range, y=y_range, z=z_grid, colorscale='Blues', opacity=0.5, showscale=False)
fig.show()    


# **Interpretation of results**
Simple linear regression is a statistical method used to model the relationship between two variables, where one variable is the predictor or independent variable and the other variable is the response or dependent variable.

• The R2 score, also known as the coefficient of determination, measures how well the linear regression model fits the data. It
ranges from 0 to 1, where a value of 1 indicates a perfect fit and a value of 0 indicates no fit at all. In this case, the R2 score is
0.0328, which means that only 3.28% of the variability in the response variable can be explained by the predictor variable using
the linear regression model.

• The mean absolute error (MAE) is a measure of the average absolute difference between the predicted and actual values of the
response variable. It is expressed in the same units as the response variable. In this case, the MAE is 0.3577, which means that on
average, the linear regression model is off by 0.3577 units in predicting the response variable.

• The mean squared error (MSE) is another measure of the difference between the predicted and actual values of the response
variable. It is expressed in squared units of the response variable. In this case, the MSE is 0.1966, which means that the average
squared difference between the predicted and actual values of the response variable is 0.1966 units. The MSE is generally used in
cases where larger errors have more significant consequences.

### **What happens if R2 is high?**
A high R2 value indicates that the linear regression model fits the data well and can explain a large proportion of the variability in the
response variable. Specifically, an R2 value close to 1 indicates that the model can explain most of the variation in the response variable,
while an R2 value close to 0 indicates that the model does not explain much of the variation in the response variable.


In other words, a high R2 value means that the linear regression model can accurately predict the response variable based on the
predictor variable. This can be useful in many applications, such as predicting sales based on advertising spend, or predicting crop
yields based on weather conditions. However, it is important to keep in mind that a high R2 value does not necessarily mean that the
model is the best one for the data, and other factors such as model complexity and the validity of the assumptions should also be
considered.

### **What happens when MAE and MSE are high?**

• A high Mean Absolute Error (MAE) or Mean Squared Error (MSE) value indicates that the linear regression model is not fitting the
data well.

• In the case of MAE, a high value means that the model's predicted values are, on average, far from the actual values. In other
words, the model's predictions have a high degree of error.

• Similarly, in the case of MSE, a high value means that the squared difference between the model's predicted values and the actual
values is large. This indicates that the model's predictions are not accurate and have a high degree of error.

• In summary, a high MAE or MSE value suggests that the linear regression model may not be the best fit for the data, and there
may be other models or techniques that can provide better predictions.****