Predicting employee salary based on their experience using linear regression.
# Import th relevant libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as snb
snb.set()
data=pd.read_csv("Salary_Data.csv")
data
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
YearsExperience | Salary | |
---|---|---|
0 | 1.1 | 39343.0 |
1 | 1.3 | 46205.0 |
2 | 1.5 | 37731.0 |
3 | 2.0 | 43525.0 |
4 | 2.2 | 39891.0 |
5 | 2.9 | 56642.0 |
6 | 3.0 | 60150.0 |
7 | 3.2 | 54445.0 |
8 | 3.2 | 64445.0 |
9 | 3.7 | 57189.0 |
10 | 3.9 | 63218.0 |
11 | 4.0 | 55794.0 |
12 | 4.0 | 56957.0 |
13 | 4.1 | 57081.0 |
14 | 4.5 | 61111.0 |
15 | 4.9 | 67938.0 |
16 | 5.1 | 66029.0 |
17 | 5.3 | 83088.0 |
18 | 5.9 | 81363.0 |
19 | 6.0 | 93940.0 |
20 | 6.8 | 91738.0 |
21 | 7.1 | 98273.0 |
22 | 7.9 | 101302.0 |
23 | 8.2 | 113812.0 |
24 | 8.7 | 109431.0 |
25 | 9.0 | 105582.0 |
26 | 9.5 | 116969.0 |
27 | 9.6 | 112635.0 |
28 | 10.3 | 122391.0 |
29 | 10.5 | 121872.0 |
data.describe()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
YearsExperience | Salary | |
---|---|---|
count | 30.000000 | 30.000000 |
mean | 5.313333 | 76003.000000 |
std | 2.837888 | 27414.429785 |
min | 1.100000 | 37731.000000 |
25% | 3.200000 | 56720.750000 |
50% | 4.700000 | 65237.000000 |
75% | 7.700000 | 100544.750000 |
max | 10.500000 | 122391.000000 |
X=data['YearsExperience']
y=data['Salary']
X
0 1.1
1 1.3
2 1.5
3 2.0
4 2.2
5 2.9
6 3.0
7 3.2
8 3.2
9 3.7
10 3.9
11 4.0
12 4.0
13 4.1
14 4.5
15 4.9
16 5.1
17 5.3
18 5.9
19 6.0
20 6.8
21 7.1
22 7.9
23 8.2
24 8.7
25 9.0
26 9.5
27 9.6
28 10.3
29 10.5
Name: YearsExperience, dtype: float64
y
0 39343.0
1 46205.0
2 37731.0
3 43525.0
4 39891.0
5 56642.0
6 60150.0
7 54445.0
8 64445.0
9 57189.0
10 63218.0
11 55794.0
12 56957.0
13 57081.0
14 61111.0
15 67938.0
16 66029.0
17 83088.0
18 81363.0
19 93940.0
20 91738.0
21 98273.0
22 101302.0
23 113812.0
24 109431.0
25 105582.0
26 116969.0
27 112635.0
28 122391.0
29 121872.0
Name: Salary, dtype: float64
plt.scatter(X,y,color="blue")
plt.xlabel("YEAR OF EXPERERIENCE", fontsize=20,color="blue")
plt.ylabel("SALARY",fontsize=20,color="blue")
plt.show()
x=sm.add_constant(X)
results=sm.OLS(y,x).fit()
results.summary()
Dep. Variable: | Salary | R-squared: | 0.957 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.955 |
Method: | Least Squares | F-statistic: | 622.5 |
Date: | Thu, 12 Nov 2020 | Prob (F-statistic): | 1.14e-20 |
Time: | 17:21:23 | Log-Likelihood: | -301.44 |
No. Observations: | 30 | AIC: | 606.9 |
Df Residuals: | 28 | BIC: | 609.7 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 2.579e+04 | 2273.053 | 11.347 | 0.000 | 2.11e+04 | 3.04e+04 |
YearsExperience | 9449.9623 | 378.755 | 24.950 | 0.000 | 8674.119 | 1.02e+04 |
Omnibus: | 2.140 | Durbin-Watson: | 1.648 |
---|---|---|---|
Prob(Omnibus): | 0.343 | Jarque-Bera (JB): | 1.569 |
Skew: | 0.363 | Prob(JB): | 0.456 |
Kurtosis: | 2.147 | Cond. No. | 13.2 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
plt.scatter(X,y ,color="red")
y_pred=2.579e+04+X*9449.9623
fig=plt.plot(X,y_pred,color="black")
plt.xlabel("YEAR EXPERIENCE",fontsize="20")
plt.ylabel("SALARY",fontsize="20")
plt.show()