# **Data Science and Business Analytics (GRIP May'21)**
## **Task 1 : Prediction using supervised ML**
### **Author : Jeet Sahoo**
#### Objective: Predict the percentage of students based on no. of study hours using Linear Regression and also predict the score if a student studies for 9.25 hours per day

## **Linear Regression with Python Scikit Learn**
In this task we will see how the Python Scikit-Learn(sklearn) library for machine learning can be used to implement regressions. We will start with simple linear regression involving two variables.

#### Importing Required Libraries

In [None]:
# Importing the required libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn import metrics
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#### Exploring and Understanding Data

In [None]:
# Reading data from remote link
url = r"http://bit.ly/w-data"
sample_data = pd.read_csv(url)
print("Data import successful")

sample_data.head(20) #To see first 20 rows of data

In [None]:
# Understanding the data
sample_data.describe() #Data Description
sample_data.info() #Info of Dataset
sample_data.shape #To find the shape of data
sample_data.corr()
sample_data.isnull().sum()

#### Visualizing Data

In [None]:
# Plotting the distribution of scores
font1 = {'family':'Calibri','color':'blue','size':20}
font2 = {'family':'serif','color':'darkred','size':15}
sample_data.plot(x='Hours', y='Scores', style='o')  
plt.title('Hours vs Percentage')  
plt.xlabel('Hours Studied')  
plt.ylabel('Percentage Score')  
plt.show()

#### PreProcessing Data

In [None]:
x = sample_data.iloc[:, :-1].values  
y = sample_data.iloc[:, 1].values 

#### Splitting Data and Training Algorithm

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
print('Splitting complete.')

regressor = LinearRegression()  
regressor.fit(x_train.reshape(-1,1), y_train) 
print("Training complete.")

#### Visualizing the best fit Line of Regression

In [None]:
# Plotting the regression line
print('Intercept value is:',regressor.intercept_)
print('Linear coefficient is:',regressor.coef_)
line = regressor.coef_*x+regressor.intercept_

# Plotting for the test data
plt.scatter(x, y)
plt.title('Linear Regression vs trained model',fontdict=font1)
plt.xlabel('Hours studied',fontdict=font2)
plt.ylabel('Score obtained',fontdict=font2)
plt.plot(x, line,color='red');
plt.show()

In [None]:
print("Training Score: ",regressor.score(x_train,y_train)*100)

#### Predicting Data

In [None]:
# Testing data
print(x_test) # In Hours
# Model Prediction 
y_pred = regressor.predict(x_test) # Predicting the scores
print(y_pred)

#### Comparison between actual result and predicted result

In [None]:
# Comparing Actual vs Predicted
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred, 'Difference/Error': y_test - y_pred})
df

In [None]:
#Estimating training and test score
print("Training Score:",regressor.score(x_train,y_train))
print("Test Score:",regressor.score(x_test,y_test))

In [None]:
#Comparing the actual and predicted value through visualization
sns.distplot(y_test,hist=False,color="purple",label="actual")
sns.distplot(y_pred,hist=False,color="green",label="Predicted")

#### Application as per Requirements

In [None]:
# Testing the model with our own data
hours = 9.25
test = np.array([hours])
test = test.reshape(-1, 1)
pred = regressor.predict(test)
print("Score obtained by the student if he studies for 9.25 hours/day = {}".format(pred[0]))

#### Evaluating the Data

In [None]:
print('Mean Absolute Error:',metrics.mean_absolute_error(y_test, y_pred)) # Mean_absolute_error
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred)) # Mean_squared_error ( MSE Value)
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred))) # Mean_squared_error ( RMSE Value)
print('R-2:', metrics.r2_score(y_test, y_pred)) #R2_Score
print("Slope of regression line ",regressor.coef_)
print("Y intercept of regression line",regressor.intercept_)

### Conclusion

I was successfully able to carry-out Prediction using Supervised ML task and was able to evaluate the model's performance on various parameters.

#### Thank You