#### Author : Sanjoy Biswas
#### Topic : Linear Regression Tutorial With Project Solving
#### Email : sanjoy.eee32@gmail.com

Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical method that is used for predictive analysis. Linear regression makes predictions for continuous/real or numeric variables such as sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and one or more independent (y) variables, hence called as linear regression. Since linear regression shows the linear relationship, which means it finds how the value of the dependent variable is changing according to the value of the independent variable.

#### Import Libraries

In [69]:
import numpy as np
import pandas as pd
from sklearn import linear_model
from word2number import w2n
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

#### Import Dataset

In [162]:
df = pd.read_csv('/AirIndia_Domestic.csv')
df.head()

Unnamed: 0,Month,DEPARTURES\n,HOURS\n,KILOMETER\n(TH),PASSENGERS CARRIED\n,PASSENGER KMS. PERFORMED\n(TH),AVAILABLE SEAT KILOMETRE\n(TH),PAX. LOAD FACTOR#\n(IN %),FY
0,APR,8331,13998.0,7670.0,852743,862694.0,1219041.0,70.8,FY14
1,MAY,8648,14547.0,8039.0,933573,948498.0,1277627.0,74.2,FY14
2,JUNE,8279,14232.0,7367.0,809681,830542.0,1170875.0,70.9,FY14
3,JULY,8562,14860.0,7725.0,803943,822897.0,1227813.0,67.0,FY14
4,AUG,8547,14666.0,7756.0,908224,911328.0,1232691.0,73.9,FY14


In [45]:
# This library is not needed
# pip install word2number

#### Preprocessing Datasets

In [158]:
# df.total_funding= df.total_funding.fillna('Zero')

AttributeError: 'DataFrame' object has no attribute 'total_funding'

In [163]:
df

Unnamed: 0,Month,DEPARTURES\n,HOURS\n,KILOMETER\n(TH),PASSENGERS CARRIED\n,PASSENGER KMS. PERFORMED\n(TH),AVAILABLE SEAT KILOMETRE\n(TH),PAX. LOAD FACTOR#\n(IN %),FY
0,APR,8331,13998.0,7670.0,852743,862694.0,1219041.0,70.8,FY14
1,MAY,8648,14547.0,8039.0,933573,948498.0,1277627.0,74.2,FY14
2,JUNE,8279,14232.0,7367.0,809681,830542.0,1170875.0,70.9,FY14
3,JULY,8562,14860.0,7725.0,803943,822897.0,1227813.0,67.0,FY14
4,AUG,8547,14666.0,7756.0,908224,911328.0,1232691.0,73.9,FY14
...,...,...,...,...,...,...,...,...,...
113,OCT,7973,15282.3,8852.6,1037506,1083365.8,1310613.7,82.7,FY23
114,NOV,7761,14921.9,8586.3,1062524,1099937.6,1283540.4,85.7,FY23
115,DEC,8239,15839.7,9084.0,1170659,1214285.0,1359169.0,89.3,FY23
116,JAN,8295,16310.5,9174.5,1154581,1199173.8,1370587.0,87.5,FY23


In [110]:
# This line is no longer needed after converting AGE column to numeric
df.total_funding = df.employees_count.apply(w2n.word_to_num)

ValueError: Type of input is not string! Please enter a valid number word (eg. 'two million twenty three thousand and forty nine')

In [85]:
df

Unnamed: 0.1,Unnamed: 0,year,Life_Expectancy
0,0,2024,70.62
1,1,2023,70.42
2,2,2022,70.19
3,3,2021,69.96
4,4,2020,69.73
...,...,...,...
70,70,1954,37.57
71,71,1953,36.98
72,72,1952,36.39
73,73,1951,35.80


In [164]:
import math
median_test_score = math.floor(df['HOURS\n'].mean())
median_test_score

15340

In [147]:
dff = df['HOURS\n'].mean()
dff

np.float64(989239.8319327731)

In [165]:
df['HOURS\n'] = df['HOURS\n'].fillna(dff)

In [166]:
df

Unnamed: 0,Month,DEPARTURES\n,HOURS\n,KILOMETER\n(TH),PASSENGERS CARRIED\n,PASSENGER KMS. PERFORMED\n(TH),AVAILABLE SEAT KILOMETRE\n(TH),PAX. LOAD FACTOR#\n(IN %),FY
0,APR,8331,13998.0,7670.0,852743,862694.0,1219041.0,70.8,FY14
1,MAY,8648,14547.0,8039.0,933573,948498.0,1277627.0,74.2,FY14
2,JUNE,8279,14232.0,7367.0,809681,830542.0,1170875.0,70.9,FY14
3,JULY,8562,14860.0,7725.0,803943,822897.0,1227813.0,67.0,FY14
4,AUG,8547,14666.0,7756.0,908224,911328.0,1232691.0,73.9,FY14
...,...,...,...,...,...,...,...,...,...
113,OCT,7973,15282.3,8852.6,1037506,1083365.8,1310613.7,82.7,FY23
114,NOV,7761,14921.9,8586.3,1062524,1099937.6,1283540.4,85.7,FY23
115,DEC,8239,15839.7,9084.0,1170659,1214285.0,1359169.0,89.3,FY23
116,JAN,8295,16310.5,9174.5,1154581,1199173.8,1370587.0,87.5,FY23


In [167]:
### Show Columns Name
df.columns

Index(['Month', 'DEPARTURES\n', 'HOURS\n', 'KILOMETER\n(TH)',
       'PASSENGERS CARRIED\n', 'PASSENGER KMS. PERFORMED\n(TH)',
       'AVAILABLE SEAT KILOMETRE\n(TH)', ' PAX. LOAD FACTOR#\n(IN %)', 'FY'],
      dtype='object')

#### Features Selection

In [177]:
predictors = ['DEPARTURES\n', 'HOURS\n','KILOMETER\n(TH)','PASSENGER KMS. PERFORMED\n(TH)']
x = df[predictors]
y = df['PASSENGERS CARRIED\n']

#### Split Train and test datasets

In [178]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)

In [179]:
x_train.shape,x_test.shape

((94, 4), (24, 4))

In [180]:
y_train.shape,y_test.shape

((94,), (24,))

#### Apply Linear Regression

In [181]:
reg = LinearRegression()

In [182]:
model = reg.fit(x_train,y_train)

In [183]:
model.predict([[5,6,7,8]])



array([-82217.90713623])

#### Accuracy Score

In [184]:
model.score(x_train,y_train)

0.9882980711421426

In [185]:
model.score(x_test,y_test)

0.9923839881404911