Feature Engineering: Feature Scaling, Data Cleaning, Train test split
Problem Statement: Build a model that can predict the employee salaries on basis of their experience
Step 1: Data Gathering

In [1]:
import pandas as pd
path = r"https://raw.githubusercontent.com/sindhura-nk/Datasets/refs/heads/main/Salary_dataset.csv"
df = pd.read_csv(path)
df.head()

Unnamed: 0.1,Unnamed: 0,YearsExperience,Salary
0,0,1.2,39344.0
1,1,1.4,46206.0
2,2,1.6,37732.0
3,3,2.1,43526.0
4,4,2.3,39892.0


In [3]:
df.shape

(30, 3)

In [4]:
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Unnamed: 0       30 non-null     int64  
 1   YearsExperience  30 non-null     float64
 2   Salary           30 non-null     float64
dtypes: float64(2), int64(1)
memory usage: 852.0 bytes


In [5]:
# check for duplicated rows
df.duplicated().sum()

np.int64(0)

In [6]:
df=df.drop_duplicates()

In [7]:
X = df[['YearsExperience']]
Y = df[['Salary']]

In [8]:
X.head()

Unnamed: 0,YearsExperience
0,1.2
1,1.4
2,1.6
3,2.1
4,2.3


In [9]:
Y.head()

Unnamed: 0,Salary
0,39344.0
1,46206.0
2,37732.0
3,43526.0
4,39892.0


Step 4: Feature Engineering
Data Cleaning
Feature Scaling - Data Pre-processing

In [10]:
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
num_pipe = make_pipeline(
    # data cleaning
    SimpleImputer(strategy='mean'),
    # feature scaling
    StandardScaler()
).set_output(transform='pandas')

In [12]:
print(num_pipe)

Pipeline(steps=[('simpleimputer', SimpleImputer()),
                ('standardscaler', StandardScaler())])


In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [13]:
X_pre = num_pipe.fit_transform(X)
X_pre.head()

Unnamed: 0,YearsExperience
0,-1.510053
1,-1.438373
2,-1.366693
3,-1.187494
4,-1.115814


Split the data into training and testing

In [14]:
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest = train_test_split(X_pre,Y,train_size=0.8,test_size=0.2,random_state=21)

In [15]:
xtrain.head()

Unnamed: 0,YearsExperience
19,0.2461
7,-0.757416
27,1.536336
11,-0.470697
18,0.210261


In [16]:
ytrain.head()

Unnamed: 0,Salary
19,93941.0
7,54446.0
27,112636.0
11,55795.0
18,81364.0


In [17]:
xtest.head()

Unnamed: 0,YearsExperience
5,-0.864935
23,1.034577
22,0.927058
28,1.787215
1,-1.438373


In [18]:
ytest.head()

Unnamed: 0,Salary
5,56643.0
23,113813.0
22,101303.0
28,122392.0
1,46206.0



Model Building

In [20]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(xtrain,ytrain)

0,1,2
,"fit_intercept  fit_intercept: bool, default=True Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).",True
,"copy_X  copy_X: bool, default=True If True, X will be copied; else, it may be overwritten.",True
,"tol  tol: float, default=1e-6 The precision of the solution (`coef_`) is determined by `tol` which specifies a different convergence criterion for the `lsqr` solver. `tol` is set as `atol` and `btol` of :func:`scipy.sparse.linalg.lsqr` when fitting on sparse training data. This parameter has no effect when fitting on dense data. .. versionadded:: 1.7",1e-06
,"n_jobs  n_jobs: int, default=None The number of jobs to use for the computation. This will only provide speedup in case of sufficiently large problems, that is if firstly `n_targets > 1` and secondly `X` is sparse or if `positive` is set to `True`. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.",
,"positive  positive: bool, default=False When set to ``True``, forces the coefficients to be positive. This option is only supported for dense arrays. For a comparison between a linear regression model with positive constraints on the regression coefficients and a linear regression without such constraints, see :ref:`sphx_glr_auto_examples_linear_model_plot_nnls.py`. .. versionadded:: 0.24",False


In [21]:
model.score(xtrain,ytrain)

0.9557540289043547

In [26]:
ypred=model.predict(xtrain)

In [27]:
ypred

array([[ 81318.59136997],
       [ 54931.14114298],
       [115245.31309039],
       [ 62470.4126364 ],
       [ 80376.18243329],
       [114302.90415371],
       [ 35140.55347273],
       [ 88857.8628634 ],
       [ 53046.32326962],
       [ 63412.82157308],
       [123726.9935205 ],
       [ 62470.4126364 ],
       [ 67182.45731979],
       [ 54931.14114298],
       [ 61528.00369972],
       [ 74721.72881322],
       [ 38910.18921944],
       [ 43622.23390283],
       [109590.85947032],
       [ 72836.91093986],
       [ 45507.05177619],
       [106763.63266029],
       [ 70952.09306651],
       [ 59643.18582637]])

In [25]:
ypred_test=model.predict(xtest)
ypred_test

array([[ 52103.91433294],
       [102051.58797689],
       [ 99224.36116686],
       [121842.17564714],
       [ 37025.37134609],
       [ 91685.08967343]])

In [30]:
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score

mse_tr = mean_squared_error(ytrain,ypred)
rmse_tr = mse_tr**(1/2)
mae_tr = mean_absolute_error(ytrain,ypred)
r2_tr = r2_score(ytrain,ypred)
print("Training results")
print(f"MSE for training data : {mse_tr:.2f}")
print(f"RMSE for training data : {rmse_tr:.2f}")
print(f"MAE for training data : {mae_tr:.2f}")
print(f"r2 score for training data : {r2_tr*100:.2f}%")

print("======================================")
print("Testing results")
mse = mean_squared_error(ytest,ypred_test)
rmse = mse**(1/2)
mae = mean_absolute_error(ytest,ypred_test)
r2 = r2_score(ytest,ypred_test)

print(f"MSE for testing data : {mse}")
print(f"RMSE for testing data : {rmse}")
print(f"MAE for testing data : {mae}")
print(f"r2 score for testing data : {r2*100:.2f}%")

Training results
MSE for training data : 28631788.29
RMSE for training data : 5350.87
MAE for training data : 4392.34
r2 score for training data : 95.58%
Testing results
MSE for testing data : 48542473.24340335
RMSE for testing data : 6967.242872428329
MAE for testing data : 5783.083309441339
r2 score for testing data : 93.99%
