# Salary Prediction using Linear Regression

This notebook demonstrates a simple machine learning workflow using Python to predict employee salaries based on basic attributes.

## 1. Import Libraries and Set Random Seed

We import the required Python libraries and set a random seed to ensure reproducible results.

In [1]:
# Project: Salary Prediction using Linear Regression

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

np.random.seed(42)


## 2. Generate Synthetic Dataset

We generate a synthetic dataset representing employee attributes such as experience, education, age, and company rating.
A small amount of noise is added to simulate real-world data.

In [2]:
# Number of samples
samples = 100

# Feature generation
experience = np.random.randint(0, 21, samples)
education = np.random.randint(1, 6, samples)
age = np.random.randint(21, 61, samples)
rating = np.random.randint(1, 6, samples)

# Target variable (Salary) with noise
salary = (
    experience * 550 +
    education * 320 -
    age * 1 +
    rating * 100 +
    np.random.normal(0, 180, samples)
)


## 3. Create and Save Dataset

The generated data is stored in a pandas DataFrame and saved as a CSV file for reuse.

In [3]:
df = pd.DataFrame({
    "Experience": experience,
    "Education": education,
    "Age": age,
    "CompanyRating": rating,
    "Salary": salary
})

df.to_csv("salary_prediction.csv", index=False)
df.head()


Unnamed: 0,Experience,Education,Age,CompanyRating,Salary
0,6,4,54,5,5093.913328
1,19,1,26,5,11541.243178
2,14,4,42,3,8999.170282
3,10,2,31,4,6840.868844
4,7,1,36,1,4102.466547


## 4. Data Exploration

We perform basic exploratory data analysis to understand the structure, data types, and statistical summary of the dataset.

In [4]:
df = pd.read_csv("salary_prediction.csv")

df.info()
df.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Experience     100 non-null    int64  
 1   Education      100 non-null    int64  
 2   Age            100 non-null    int64  
 3   CompanyRating  100 non-null    int64  
 4   Salary         100 non-null    float64
dtypes: float64(1), int64(4)
memory usage: 4.0 KB


Unnamed: 0,Experience,Education,Age,CompanyRating,Salary
count,100.0,100.0,100.0,100.0,100.0
mean,10.09,2.93,41.73,2.97,6764.501065
std,6.078767,1.437274,11.6036,1.466494,3392.55047
min,0.0,1.0,21.0,1.0,859.895592
25%,6.0,1.75,31.75,2.0,4083.311345
50%,10.0,3.0,42.0,3.0,6605.476774
75%,15.0,4.0,53.0,4.0,9385.951168
max,20.0,5.0,60.0,5.0,12957.491441


## 5. Feature and Target Selection

We separate the independent variables (features) and the dependent variable (salary).

In [5]:
X = df[["Experience", "Education", "Age", "CompanyRating"]]
y = df["Salary"]

X.shape, y.shape


((100, 4), (100,))

## 6. Train-Test Split

The dataset is split into training and testing sets to evaluate the model on unseen data.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


## 7. Model Training

A Linear Regression model is trained using the training dataset.

In [7]:
model = LinearRegression()
model.fit(X_train, y_train)
model.coef_, model.intercept_


(array([554.1364692 , 286.28964197,  -2.70228851,  76.31117972]),
 np.float64(232.74557101400933))

## 8. Model Evaluation

We evaluate the model using Mean Absolute Error (MAE), Mean Squared Error (MSE), and R² score.

In [8]:
y_pred = model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("MAE:", round(mae, 2))
print("MSE:", round(mse, 2))
print("R² Score:", round(r2, 2))


MAE: 121.06
MSE: 24668.01
R² Score: 1.0


## 9. Prediction Comparison

A comparison between actual and predicted salary values is displayed.

In [9]:
comparison = pd.DataFrame({
    "Actual Salary": y_test.values,
    "Predicted Salary": y_pred
})

comparison.head()


Unnamed: 0,Actual Salary,Predicted Salary
0,939.057486,801.214481
1,9184.886782,9386.512923
2,4755.722099,4636.325105
3,11951.815837,11885.525847
4,2077.653846,2131.896075


## Conclusion

This project demonstrates a complete machine learning workflow including data generation, preprocessing, model training, and evaluation.
The results show that Linear Regression is effective for this dataset.