<a href="https://colab.research.google.com/github/michalinagers/linearRegressionPython/blob/main/SalaryData_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Week 8**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# **Program 8.1**

# **Aim: Simple Linear Regression on Salary_Data.csv dataset.**

 In this task, you are expected to predict the "Salary" based on "experience" using Linear Regression. To accomplish this, you will work with the dataset titled "Salary_Data.csv", which is available on BB (Blackboard). You are required to download and utilize this dataset for the analysis.

 **Note: Before you start working on this file, you must:**

1. Download this GoogleColab file to your Google Drive before making any edits.
2. Download the Salary_Data.csv dataset file from BB to your GoogleDrive.
3. Mount your Google Drive to Google Colab Session.



About the "**Salary_Data.csv**" Dataset

The dataset contains the following columns:

- **Age:** Represents the employee's age (in years).

- **Gender**:Categorical variable indicating the employee’s gender (e.g., Male, Female, Other).

- **Education Level**: Describes the highest qualification achieved (e.g., High School, Bachelor’s, Master’s, PhD).


- **Job Title**: Indicates the employee’s role or position (e.g., Data Analyst, Software Engineer, Manager).

- **Years of Experience**: Shows how many years the employee has worked in their field. Generally, more experience leads to higher pay.

- **Salary**:The target variable representing the employee's annual income.
Predicted based on factors like age, education, job title, and experience.



In this exercise, your main focus will be on understanding how "Years of experience" affects the "Salary" given to an employee. By implementing linear regression, you will establish a predictive model that estimates salary  based on this one feature i.e., "Years of Experience".

**1. Import Required Libraries**

In [None]:

# Importing essential libraries for data manipulation, visualization, and linear regression
import pandas as pd  # For handling dataframes
import matplotlib.pyplot as plt  # For plotting graphs
import seaborn as sns  # For enhanced visualizations from seaborn import style
from sklearn.model_selection import train_test_split  # To split data into training and testing sets
from sklearn.linear_model import LinearRegression  # For performing linear regression
from sklearn.metrics import mean_squared_error, r2_score  # To evaluate model performance

# Enable inline plotting in Jupyter Notebook
%matplotlib inline

**2. Load the Dataset**

In [None]:
#Write your code here
Salary = pd.read_csv('/content/drive/MyDrive/Salary_Data.csv')

**3. Find the column head in the dataset.**

In [None]:
#Write your code here
Salary.head()

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
0,32.0,Male,Bachelor's,Software Engineer,5.0,90000.0
1,28.0,Female,Master's,Data Analyst,3.0,65000.0
2,45.0,Male,PhD,Senior Manager,15.0,150000.0
3,36.0,Female,Bachelor's,Sales Associate,7.0,60000.0
4,52.0,Male,Master's,Director,20.0,200000.0


**4. Understand the information of the data**

In [None]:
Salary.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6704 entries, 0 to 6703
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Age                  6702 non-null   float64
 1   Gender               6702 non-null   object 
 2   Education Level      6701 non-null   object 
 3   Job Title            6702 non-null   object 
 4   Years of Experience  6701 non-null   float64
 5   Salary               6699 non-null   float64
dtypes: float64(3), object(3)
memory usage: 314.4+ KB


**5. Remove column that Linear Regression model cannot use.**

Hint: Remove 'Gender','Job Title','Education Level' column because it only has text info that the linear regression model can't use.


In [None]:
experience = Salary[['YearsExperience', 'Age', 'Salary']]
experience.head()

KeyError: "['YearsExperience'] not in index"

**6. Check for any missing values**

In [None]:
experience.isnull().sum()

NameError: name 'experience' is not defined

**7. Remove rows containing any null values**

In [None]:
experience = experience.dropna() #remove rows that have null data
print(experience.isnull().sum()) #checking if any null values are still present

NameError: name 'experience' is not defined

**8. Summarise dataset statistical information**

In [None]:
experience.describe()

# **Exploratory Data Analysis**

**9. Create Pair Plots**

Create pair plots to explore the types of relationship across the entire dataset.

In [None]:
sns.pairplot(Salary)

**10. Find Correlation of all numerical columns.**

* **`1.0`**: perfect positive correlation; that is, when one attribute rises, the other attribute rises.
* **`-1.0`**: perfect negative correlation; that is, when one attribute rises, the other attribute falls.
* **`0.0`**: no correlation; the two columns (are not linearly related).

In [None]:
experience.corr() #which column is suitable for your prediction

**11. Display the correlation of all columns through Heatmap.**

Explore the values obtained in this map to identify the feature whcih is closely related to "House Price".

In [None]:
sns.heatmap(experience.corr(), annot=True)

# **Training a Linear Regression Model**

**12. Load Features and Class labels.**

In [None]:
x = experience[['Years of Experience']]
y = experience['Salary']

NameError: name 'experience' is not defined

**13.Find the shape of  features and label**

In [None]:
print(x.shape)
print(y.shape)

In [None]:
#Write your code here

**14.  Split Dataset onto training and testing sets.**

We will need to first split up our data into an X array that contains the features to train on, and a y array with the target variable, in this case the 'Salary' column.


In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

**15. Create the Model.**

In [None]:
model = LinearRegression()

**16. Train the Data Model**

In [None]:
model.fit(x_train, y_train)

# **Model Evaluation**

**17. Find the intercept.**

In [None]:
print(model.intercept_)

**18. Find the coefficcient.**

In [None]:
coeff_df = pd.DataFrame(model.coef_, x.columns, columns=['Coefficient'])
coeff_df

# **Make Predictions from the trained Model**

**19. Make salary predictions on test data.**

In [None]:
predictions = model.predict(x_test)
print(predictions)

**20. Plot the data and the regression line.**

- y_test : The test data on which the model was tested.
- predictions: Values predicted by the model.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(8, 6))

# Scatter plot: Actual vs Predicted values
plt.scatter(y_test, predictions, edgecolor='black', alpha=0.7, color='pink', label='Predicted Points')

# Regression line (best fit) through predicted vs actual values
z = np.polyfit(y_test, predictions, 1)  # Linear fit (degree=1)
p = np.poly1d(z)
plt.plot(y_test, p(y_test), color='red', linewidth=2, label='Regression Line')

# Perfect prediction line (y=x)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], linestyle='--', color='green', linewidth=2, label='Perfect Prediction')

# Labels and Title
plt.xlabel('Actual Salary (Y Test)', fontsize=12, weight='bold')
plt.ylabel('Predicted Salary (Y Pred)', fontsize=12, weight='bold')
plt.title('Actual vs Predicted Salary with Regression Line', fontsize=14, weight='bold')

plt.legend()
plt.grid(True, linestyle='--', alpha=0.4)
plt.show()


# **Regression Evaluation Metrics**

**21. Calculate evaluation metrics.**

Example:
- **Mean Absolute Error (MAE):** Determines on average, how much does the salary predictions differ from the actual values. [MAE](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.median_absolute_error.html#sklearn.metrics.median_absolute_error)
- **Mean Squared Error (MSE)** :It quantifies how far predictions are from the actual data, with larger errors penalized more due to squaring.[MSE](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error)
- **Root Mean Squared Error (RMSE)** : Measures the average magnitude of prediction errors in a regression model. It represents how far, on average, the predicted values are from the actual values in the same units as the target variable. Lower RMSE values indicate better predictive accuracy. [RSME](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.root_mean_squared_error.html#sklearn.metrics.root_mean_squared_error)

In [None]:
from sklearn import metrics
import numpy as np
print('MAE', metrics.mean_absolute_error(y_test, predictions))
print('MSE', metrics.mean_squared_error(y_test, predictions))
print('RMSE', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

**22. Take input from user to make predictions.**

Take 'Years of Experience' as input to make a prediction.


In [None]:
years_experience = float(input("Enter year of experience"))
input_features = np.array([[years_experience]])
predicted_salary = model.predict(input_features)
print(f"Predicted Salary for {years_experience} years of experience: ${predicted_salary[0]:.2f}")