ADMISSION PREDICTION
======
## Introduction

This is a simple dataset about US university admission acceptance rate which I wrote some simple code to enhance my skill in Linear Regression.

## Data Set

The Admission_Predict dataset contains 400 lines of students' scores and the output is an acceptance chance (in percent) of each student bases on their scores. There were 6 scores and a columns about the university rank

## Objective

In this dataset I will use 3 methods to predict the output (which is the acceptance chance ) include:

**Normal Equation**

**Sklearn Tool**

**Gradient Descent**

In [158]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [159]:
df=pd.read_csv(r'C:\Users\Asus\Downloads\Admission_Predict.csv')

In [160]:
df.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [161]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 9 columns):
Serial No.           400 non-null int64
GRE Score            400 non-null int64
TOEFL Score          400 non-null int64
University Rating    400 non-null int64
SOP                  400 non-null float64
LOR                  400 non-null float64
CGPA                 400 non-null float64
Research             400 non-null int64
Chance of Admit      400 non-null float64
dtypes: float64(4), int64(5)
memory usage: 28.2 KB


In [162]:
df=df.drop(['Serial No.'],axis=1)

In [163]:
df_X=df[['GRE Score','TOEFL Score','University Rating','SOP','LOR ','CGPA','Research']]
X=np.array(df_X)
y=np.array([df['Chance of Admit ']]).T

In [164]:
print(X.shape)
print(y.shape)

(400, 7)
(400, 1)


# METHOD 1: NORMAL EQUATION

In [165]:
#create a vector with all value 1
one=np.ones([X.shape[0],1])
#combine the vector "one" to matrix X:
Xone=np.concatenate([one,X],axis=1)

In [166]:
#check shape again
Xone.shape

(400, 8)

In [167]:
#Split the test and the train set:
from sklearn.model_selection import train_test_split
Xtrain,Xtest,ytrain,ytest=train_test_split(Xone,y, test_size=0.3, random_state=42)

In [168]:
#Train model using Normal Equation:
A=np.dot(Xtrain.T,Xtrain)
b=np.dot(Xtrain.T,ytrain)
w=np.dot(np.linalg.pinv(A),b)
print(w)

[[-1.28417806e+00]
 [ 1.83981105e-03]
 [ 3.17072240e-03]
 [ 4.86625520e-03]
 [ 9.94694263e-04]
 [ 1.36946012e-02]
 [ 1.17818232e-01]
 [ 1.84391289e-02]]


In [169]:
#Test the accuracy of the model using MSE
ypred=np.dot(Xtest,w)
from sklearn.metrics import mean_squared_error
mean_squared_error(ytest,ypred)

0.004652821846448185

# METHOD 2: SKLEARN TOOL

In [13]:
#import and train model
from sklearn.linear_model import LinearRegression
lr=LinearRegression()
lr.fit(Xtrain,ytrain)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [14]:
ypred1=lr.predict(Xtest)

In [15]:
#Test the accuracy by MSE, it's the same as Normal Equation.
mean_squared_error(ytest,ypred1)

0.0046528218464812745

# METHOD 3: GRADIENT DESCENT

In [154]:
def np_solve_via_gradient_descent(x, y,w_init,niter=1000000, alpha=0.0000005,tol=1e-4):
    m, n = np.shape(x) #m is the number of lines(280- train set) and n is the number of features (8)
    w=[w_init] #the input weights are the random one
    count=0
    check=20
    while count<niter: #set the loop
        mix=np.random.permutation(m) 
        for i in mix: #randomly pick-up a line to run gradient descent
            xi=x[i,:].reshape(n,1) #pick a set of values of 8 features
            yi=y[i] #take a value of y, a float
            zi=np.dot(w[-1].T,xi) #hypothesis, a float
            w_new=w[-1]+alpha*(yi-zi)*xi #new weights= old weights- learning rate* loss function derivative 
            count+=1
            #stop
            if count%check==0:
                if np.linalg.norm(w_new-w[-check])<tol: #stopping contition
                    return w
            w.append(w_new) # update the weights
    return w
m, n = np.shape(Xtrain)
w_init = np.zeros([n, 1])
w = np_solve_via_gradient_descent(Xtrain, ytrain, w_init)
print(w[-1])

[[6.17512705e-06]
 [1.99919356e-03]
 [6.83867294e-04]
 [2.34782513e-05]
 [2.47863684e-05]
 [2.44841119e-05]
 [5.60352862e-05]
 [4.88936800e-06]]


In [155]:
ypred2=np.dot(Xtest,w[-1])

In [156]:
mean_squared_error(ytest,ypred2)

0.016925081136135375

CONCLUSION
=====
This is a very simple dataset that does not have any null value. 

I prefer using **Gradient Descent** instead of **Sklearn Tool** and **Normal Equation**. The mean square error (MSE) of the later is too low and it is very easy to be overfitting. The MSE of Gradient Descent Method is about 0.017/1 (1,7%), an acceptable rate.