# The Sparks Foundation - Data Science and Business Analytics Internship


# Task 1 - Prediction Using Supervised Machine Learning

## By Joel Ayappa

Aim - To predict the percentage of a student on the basis of the number of hours studied using the Linear Regression Supervised Machine Learning Algorithm.

Dataset: http://bit.ly/w-data

## **Importing the Dataset**

Here we import the dataset through the link with the help of pandas library and then observe the data.

In [None]:
# Importing all libraries required in this notebook

import pandas as pd
import numpy as np  
import matplotlib.pyplot as plt  
import seaborn as sns
%matplotlib inline

In [None]:
# Reading data from remote link

url = "https://raw.githubusercontent.com/AdiPersonalWorks/Random/master/student_scores%20-%20student_scores.csv"
s_data = pd.read_csv(url)
print("Data imported successfully")

s_data.head(10)

In [None]:
# To check the columns of the dataset
s_data.columns

In [None]:
# To find the number of columns and rows
s_data.shape

In [None]:
# To get information about the dataset
s_data.info()

In [None]:
# To view some basic statistical details 
s_data.describe()

In [None]:
# Check if our dataset contains null or missing values
s_data.isnull().sum()

## **Visualizing the Dataset**

Here we plot the dataset to fetch relation between the two variables

In [None]:
# Plotting the distribution of scores
sns.set_style('darkgrid')
plt.rcParams["figure.figsize"]= 9,6
plt.rc('font', size=12)
s_data.plot(x='Hours', y='Scores', style='o')  
plt.title('Hours vs Percentage')  
plt.xlabel('Hours Studied')  
plt.ylabel('Percentage Score')  
plt.show()

**From the above graph, we can observe that there is a linear relationship between "hours studied" and "percentage score". So we can now use the linear regression supervised machine model on it to predict further values.**

## **Preparing the Data**

Here we divide independent variable(input) as attributes('hours') and dependent variable(outpiut) as labels('score')

In [None]:
X = s_data.iloc[:, :-1].values
y = s_data.iloc[:, 1].values

In [None]:
X

In [None]:
y

## **Training the Model**

Here we split the whole dataset into 2 parts - Testing Data and Training Data and further train our Model.

In [None]:
# Splitting into testing data and training data
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                            test_size=0.2, random_state=0) 

In [None]:
from sklearn.linear_model import LinearRegression  
regressor = LinearRegression()  
regressor.fit(X_train, y_train) 

print("Training complete.")

## **Visualizing the Model**

Here we visualize our trained model

In [None]:
# Plotting the regression line
line = regressor.coef_*X+regressor.intercept_

# Plotting for the test data
sns.set_style('darkgrid')
plt.rcParams["figure.figsize"]= 9,6
plt.rc('font', size=12)
plt.scatter(X, y)
plt.plot(X, line, color="black");
plt.xlabel('Hours Studied')
plt.ylabel('Percentage Score')
plt.show()

## **Making Predictions**

Here we make the required predictions using our trained model 

In [None]:
# Testing data - In Hours
print(X_test) 

# Predicting the scores
y_pred = regressor.predict(X_test) 

In [None]:
# Testing data - In Hours
print(X_test) 

# Predicting the scores
y_pred = regressor.predict(X_test) 

In [None]:
# Comparing Actual vs Predicted
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})  
df 

In [None]:
# Estimating the Training Data and Test Data Score
print("Training score:", regressor.score(X_train, y_train))
print("Testing score:", regressor.score(X_test, y_test))

In [None]:
# Testing the model with our own data
hours = 9.25
own_pred = regressor.predict([[hours]])
print("No of Hours = {}".format(hours))
print("Predicted Score = {}".format(own_pred[0]))
print("The predicted score of a person who studies for",hours,"hours is",own_pred[0])

**Conclusion - The predicted score if a person studies for 9.25 hours is 93.69173248737535**

## **Evalating the Model**

Here we evaluate our trained model by finding the accuracy and calculating mean absolute error.

In [None]:
# Finding the accuracy
from sklearn import metrics
metrics.r2_score(y_test,y_pred)

In [None]:
# Calculating mean absolute error
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))

**High accuracy value and a small value of mean absolute error concludes that the above model is a good model** 