# Linear Regression

Dataset link: https://www.kaggle.com/karthickveerakumar/salary-data-simple-linear-regression  <br>
This data set gives an insight on how salary varies with years of experience.

In [None]:
#importing all necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn import linear_model
%matplotlib inline

In [None]:
#reading the csv file  
data = pd.read_csv('LR2.csv')
data.columns =['YearsExperience', 'Salary']
data.head()

In [None]:
#using describe function to see various mathematical stats
data.describe()

In [None]:
#Plotting a scatter plot of the dataset
data.plot(x='YearsExperience', y='Salary', style='x', color='blue')
plt.title('Years of Experience vs Salary')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

In [None]:
#regplot----> This method is used to plot data and a linear regression model fit.
sns.regplot(x = "YearsExperience", y = "Salary", data = data, ci = None)

In [None]:
X = data.YearsExperience
Y = data.Salary

plt.show()
Xmean = np.mean(X)
Ymean = np.mean(Y)

a = 0
b = 0
for i in range(len(X)):
    a += (X[i] - Xmean)*(Y[i] - Ymean)
    b += (X[i] - Xmean)**2
m = a / b
c = Ymean - m*Xmean
print (f'slope m = {m:.3f}')
print(f'y intercept c = {c:.3f}')

# or we can use the inbuilt function to make our work easier
from scipy.stats import linregress
print(linregress(X, Y))

In [None]:
Y_predicted = m*X + c
plt.scatter(X, Y) 
plt.plot([min(X), max(X)], [min(Y_predicted), max(Y_predicted)], color='orange')
plt.title('Least Squares regression line through the scatter plot')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()
print(Y_predicted)

**Coefficient of determination** <br>
The coefficient of determination is a measurement used to explain how much variability of one factor can be caused by its relationship to another related factor. 

In [None]:
#coefficient of determination
x = data.YearsExperience
y= data.Salary
correlation_matrix = np.corrcoef(x, y)
correlation_xy = correlation_matrix[0,1]
rsquared = correlation_xy**2
print(f'coefficient of determination = {rsquared:.4f}')

**Correlation coefficient** <br>
The correlation coefficient is a statistical measure of the strength of the relationship between the relative movements of two variables.

In [None]:
#correlation coefficient
print('Coefficient of Correlation')
stats.pearsonr(data['YearsExperience'], data['Salary'])

# Observations:

--The relationship between variable x(Years of experience) and variable y(Salary) is a positive linear regression relationship. <br>
--It indicates a direct proportional relationship between x(Years of experience) and y(Salary) <br>
--With this graph, we can predict the salary of an experienced individual with an accuracy of 95% (Correlation coefficient- denotes a strong positive relationship)