# Comparing Linear Regression in R and Python

Scikit-learn versus lm

In [1]:
# No warnings please
import warnings
warnings.filterwarnings('ignore')

## Python: Scikit-Learn

The goal is to predict `grade3` using a linear regression

In [2]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder

In [3]:
raw_data = pd.read_csv('~/Documents/datasets/generated/cont/student-grades.csv')
raw_data.drop('ID', axis=1, inplace=True)
raw_data.head()

Unnamed: 0,IQ,Age,Sex,study_hrs,SAT,grade1,grade2,grade3
0,108,15,Female,3.454358,1516,89.2,84.9,89.3
1,101,16,Male,3.902377,1553,90.1,90.8,84.4
2,96,19,Female,2.022188,1480,88.5,86.6,84.8
3,96,14,Female,2.97989,1625,85.5,89.0,81.5
4,104,19,Female,4.833261,1426,88.8,86.8,84.8


Label encode the `Sex` feature

In [4]:
# Create and fit the label encoder to the binary Sex feature
enc = LabelEncoder()
enc.fit([x for x in raw_data.Sex.value_counts().index])

# Encode the values of the Sex feature within the dataset
raw_data.Sex = enc.transform(raw_data.Sex)

In [5]:
raw_data.describe()

Unnamed: 0,IQ,Age,Sex,study_hrs,SAT,grade1,grade2,grade3
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,100.0113,16.5075,0.5162,2.988733,1500.4392,87.98698,87.97241,87.88972
std,9.960778,1.707816,0.499762,0.995197,99.834457,4.021202,4.035037,3.971543
min,66.0,14.0,0.0,-0.847397,1116.0,73.4,73.3,71.9
25%,93.0,15.0,0.0,2.32383,1434.0,85.3,85.3,85.2
50%,100.0,16.0,1.0,2.98155,1500.0,88.0,87.9,87.9
75%,107.0,18.0,1.0,3.666447,1568.0,90.7,90.7,90.6
max,137.0,19.0,1.0,6.540091,1871.0,100.0,100.0,100.0


Split into X and y components

In [6]:
X = raw_data.iloc[:,:-1].values
y = raw_data.iloc[:,7:].values

Fit the linear regression model

In [14]:
lr = LinearRegression()
lr.fit(X, y)
tuple(lr.coef_[0])

(0.00011432789788459506,
 0.014821301820488762,
 0.06736104761973905,
 -0.01127881987272566,
 7.01345762108771e-05,
 -0.010351427550158032,
 0.008329558753014587)

## R: Linear Models

In [8]:
%load_ext rpy2.ipython

In [9]:
%%R -i raw_data

summary(raw_data)

       IQ           Age             Sex           study_hrs      
 Min.   : 66   Min.   :14.00   Min.   :0.0000   Min.   :-0.8474  
 1st Qu.: 93   1st Qu.:15.00   1st Qu.:0.0000   1st Qu.: 2.3238  
 Median :100   Median :16.00   Median :1.0000   Median : 2.9815  
 Mean   :100   Mean   :16.51   Mean   :0.5162   Mean   : 2.9887  
 3rd Qu.:107   3rd Qu.:18.00   3rd Qu.:1.0000   3rd Qu.: 3.6664  
 Max.   :137   Max.   :19.00   Max.   :1.0000   Max.   : 6.5401  
      SAT           grade1           grade2           grade3      
 Min.   :1116   Min.   : 73.40   Min.   : 73.30   Min.   : 71.90  
 1st Qu.:1434   1st Qu.: 85.30   1st Qu.: 85.30   1st Qu.: 85.20  
 Median :1500   Median : 88.00   Median : 87.90   Median : 87.90  
 Mean   :1500   Mean   : 87.99   Mean   : 87.97   Mean   : 87.89  
 3rd Qu.:1568   3rd Qu.: 90.70   3rd Qu.: 90.70   3rd Qu.: 90.60  
 Max.   :1871   Max.   :100.00   Max.   :100.00   Max.   :100.00  


In [10]:
%%R -o R_coeff

linR <- lm(grade3 ~ .,
           data = raw_data)
R_coeff <- summary(linR)$coefficients
R_coeff

                 Estimate   Std. Error     t value  Pr(>|t|)
(Intercept)  8.770535e+01 1.4765762560 59.39777736 0.0000000
IQ           1.143279e-04 0.0039905474  0.02864968 0.9771446
Age          1.482130e-02 0.0232650243  0.63706367 0.5240980
Sex          6.736105e-02 0.0795147159  0.84715196 0.3969307
study_hrs   -1.127882e-02 0.0399338958 -0.28243725 0.7776141
SAT          7.013458e-05 0.0003979806  0.17622610 0.8601199
grade1      -1.035143e-02 0.0098834736 -1.04734711 0.2949648
grade2       8.329559e-03 0.0098477248  0.84583585 0.3976646


In [11]:
R_coeff = np.array([R_coeff[1:8]])

In [12]:
difference = np.subtract(lr.coef_, R_coeff)
print(np.mean(difference))

2.4708193080821442e-17


The difference is very small.