Here we see an example of multiple linear regression, where we have multiple independent variables and we will try to get predictions for height from other variables.
We will also compile an OLS model to see the p-values and perform backward elimination.

In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

data = pd.read_csv("data.csv")

le = preprocessing.LabelEncoder()
ohe = preprocessing.OneHotEncoder()

Starting by converting the countries to one hot encodings and adding them to the dataframe

In [8]:
country = data.iloc[:,0:1].values
country[:,0] = le.fit_transform(country)

country = ohe.fit_transform(country).toarray()
country = pd.DataFrame(data=country, index=range(22), columns = ['fr','tr','us'])

data = data.drop(['country'],axis=1)
data = pd.concat([country,data],axis=1)

print(data)

     fr   tr   us  height  weight  age gender
0   0.0  1.0  0.0     130      30   10      m
1   0.0  1.0  0.0     125      36   11      m
2   0.0  1.0  0.0     135      34   10      f
3   0.0  1.0  0.0     133      30    9      f
4   0.0  1.0  0.0     129      38   12      m
5   0.0  1.0  0.0     180      90   30      m
6   0.0  1.0  0.0     190      80   25      m
7   0.0  1.0  0.0     175      90   35      m
8   0.0  1.0  0.0     177      60   22      f
9   0.0  0.0  1.0     185     105   33      m
10  0.0  0.0  1.0     165      55   27      f
11  0.0  0.0  1.0     155      50   44      f
12  0.0  0.0  1.0     160      58   39      f
13  0.0  0.0  1.0     162      59   41      f
14  0.0  0.0  1.0     167      62   55      f
15  1.0  0.0  0.0     174      70   47      m
16  1.0  0.0  0.0     193      90   23      m
17  1.0  0.0  0.0     187      80   27      m
18  1.0  0.0  0.0     183      88   28      m
19  1.0  0.0  0.0     159      40   29      f
20  1.0  0.0  0.0     164      66 

  y = column_or_1d(y, warn=True)


Now converting gender values to 0s and 1s using the label encoder

In [9]:

gender = data.iloc[:,6:7].values
gender[:,0] = le.fit_transform(gender) # convert to 1 if "m", 0 if "f".

gender = pd.DataFrame(data=gender, index=range(22), columns = ['gender'])

# drop the old column and insert the encoded one
data = data.drop(['gender'],axis=1)
data = pd.concat([data,gender],axis=1)
print(data)

     fr   tr   us  height  weight  age gender
0   0.0  1.0  0.0     130      30   10      1
1   0.0  1.0  0.0     125      36   11      1
2   0.0  1.0  0.0     135      34   10      0
3   0.0  1.0  0.0     133      30    9      0
4   0.0  1.0  0.0     129      38   12      1
5   0.0  1.0  0.0     180      90   30      1
6   0.0  1.0  0.0     190      80   25      1
7   0.0  1.0  0.0     175      90   35      1
8   0.0  1.0  0.0     177      60   22      0
9   0.0  0.0  1.0     185     105   33      1
10  0.0  0.0  1.0     165      55   27      0
11  0.0  0.0  1.0     155      50   44      0
12  0.0  0.0  1.0     160      58   39      0
13  0.0  0.0  1.0     162      59   41      0
14  0.0  0.0  1.0     167      62   55      0
15  1.0  0.0  0.0     174      70   47      1
16  1.0  0.0  0.0     193      90   23      1
17  1.0  0.0  0.0     187      80   27      1
18  1.0  0.0  0.0     183      88   28      1
19  1.0  0.0  0.0     159      40   29      0
20  1.0  0.0  0.0     164      66 

  y = column_or_1d(y, warn=True)


Predicting the height from other variables

In [10]:
height = data[['height']]
data = data.drop(['height'],axis=1)

x_train, x_test,y_train,y_test = train_test_split(data,height,test_size=0.33, random_state=42)

linreg = LinearRegression()
linreg.fit(x_train,y_train)

y_pred = linreg.predict(x_test)
print(y_pred)                        

[[126.23083027]
 [161.90564073]
 [163.32006753]
 [131.67355   ]
 [174.21037819]
 [181.29693297]
 [179.09453874]
 [154.02906858]]


This is where we will create the OLS model and perform the backward elimination. In the results, here we see that p value of x5 and x6 are very high, so they can be eliminated.

In [11]:
#backward elimination
import statsmodels.api as sm

x_list = data.iloc[:,[0,1,2,3,4,5]]
x_list = np.array(x_list,dtype=float)

model = sm.OLS(height,x_list).fit()

model.summary()


0,1,2,3
Dep. Variable:,height,R-squared:,0.885
Model:,OLS,Adj. R-squared:,0.849
Method:,Least Squares,F-statistic:,24.69
Date:,"Mon, 20 Jun 2022",Prob (F-statistic):,5.41e-07
Time:,20:22:27,Log-Likelihood:,-73.95
No. Observations:,22,AIC:,159.9
Df Residuals:,16,BIC:,166.4
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,114.0688,8.145,14.005,0.000,96.802,131.335
x2,108.3030,5.736,18.880,0.000,96.143,120.463
x3,104.4714,9.195,11.361,0.000,84.978,123.964
x4,0.9211,0.119,7.737,0.000,0.669,1.174
x5,0.0814,0.221,0.369,0.717,-0.386,0.549
x6,-10.5980,5.052,-2.098,0.052,-21.308,0.112

0,1,2,3
Omnibus:,1.031,Durbin-Watson:,2.759
Prob(Omnibus):,0.597,Jarque-Bera (JB):,0.624
Skew:,0.407,Prob(JB):,0.732
Kurtosis:,2.863,Cond. No.,524.0
