# Graduate Admissions Data Set
dataset and context found here: 
https://www.kaggle.com/ashwathbalaji/graduate-admissions-amazing-insights 
https://towardsdatascience.com/the-complete-guide-to-linear-regression-in-python-3d3f8f06bf8
https://www.kaggle.com/mohansacharya/graduate-admissions
## Context
This dataset is created for prediction of Graduate Admissions from an Indian perspective.

## Content
The dataset contains several parameters which are considered important during the application for Masters Programs. The parameters included are : 1. GRE Scores ( out of 340 ) 2. TOEFL Scores ( out of 120 ) 3. University Rating ( out of 5 ) 4. Statement of Purpose and Letter of Recommendation Strength ( out of 5 ) 5. Undergraduate GPA ( out of 10 ) 6. Research Experience ( either 0 or 1 ) 7. Chance of Admit ( ranging from 0 to 1 )

## Acknowledgements
This dataset is inspired by the UCLA Graduate Dataset. The test scores and GPA are in the older format. The dataset is owned by Mohan S Acharya.

## Inspiration
This dataset was built with the purpose of helping students in shortlisting universities with their profiles. The predicted output gives them a fair idea about their chances for a particular university.

## Citation
Please cite the following if you are interested in using the dataset : Mohan S Acharya, Asfia Armaan, Aneeta S Antony : A Comparison of Regression Models for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019

I would like to thank all of you for contributing to this dataset through discussions and questions. I am in awe of the number of kernels built on this dataset. Some results and visualisations are fantastic and makes me a proud owner of the dataset. Keep em' coming! Thank You.

In [None]:
# Install a pip package in the current Jupyter kernel
# First we will make sure that the python environment has the correct packages installed for this notebook
import sys
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install plotly


In [None]:
# load packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import statsmodels.api as sm

import seaborn as sns
import warnings
import statistics as st
warnings.filterwarnings('ignore')
%matplotlib inline

In [None]:
# read csv dataset into an object
data = pd.read_csv("data/graduate-admissions/Admission_Predict.csv")
data.head()

In [None]:
print('Number of Rows & Columns: ' , data.shape)

In [None]:
data.describe().T[1:7]

In [None]:
# declare variables to represent the column names
chanceOfAdmitColumn = 'Chance of Admit '
greScoreColumn = 'GRE Score'

# create and show the scatter plot of gre score vs. chance of admission

plt.figure(figsize=(16, 8))
plt.scatter(
    data[greScoreColumn],
    data[chanceOfAdmitColumn],
    c='black'
)
plt.xlabel("Chance of Admission")
plt.ylabel("GRE Score")
plt.show()

In [None]:
# get the linear regression best fit line

X = data[greScoreColumn].values.reshape(-1,1)
Y = data[chanceOfAdmitColumn].values.reshape(-1,1)
reg = LinearRegression()
reg.fit(X, Y)
print("The linear model is: Y = {:.5} + {:.5}X".format(reg.intercept_[0], reg.coef_[0][0]))

In [None]:

predictions = reg.predict(X)

# create the scatter plot again
plt.figure(figsize=(16, 8))
plt.scatter(
    data[greScoreColumn],
    data[chanceOfAdmitColumn],
    c='black'
)

# add the best fit line ontop of the scatter plot

plt.plot(
    data[greScoreColumn],
    predictions,
    c='blue',
    linewidth=2
)
plt.xlabel("GRE Score")
plt.ylabel("Chance of Admission)")
plt.show()

In [None]:
# print summary about the linear regression relationship. 

X = data[greScoreColumn]
y = data[chanceOfAdmitColumn]
X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())

In [None]:
#Label 1 if x>0.80 and 0 if x<=0.80

# we are adding a new column to the data set
admitThreshholdColumn = 'ChanceAdmit'
data[admitThreshholdColumn] = data[chanceOfAdmitColumn].map(lambda x : 1 if x>0.80 else 0)

plt.figure(figsize=(16, 8))
plt.scatter(
    data[greScoreColumn],
    data[chanceOfAdmitColumn],
    c=data[admitThreshholdColumn]
)
plt.xlabel("Chance of Admission")
plt.ylabel("GRE Score")
plt.show()

In [None]:
# multiple linear regression

Xs = data.drop(['Serial No.','Chance of Admit ', 'ChanceAdmit'], axis=1)
y = data[chanceOfAdmitColumn].values.reshape(-1,1)
reg = LinearRegression()
reg.fit(Xs, y)
print("The linear model is: Y = {:.5} + {:.5}*GRE Score + {:.5}*TOEFL Score + {:.5}*University Rating + + {:.5}*SOP + {:.5}*LOR + {:.5}*CGPA + {:.5}*Research".format(reg.intercept_[0], reg.coef_[0][0], reg.coef_[0][1], reg.coef_[0][2], reg.coef_[0][3], reg.coef_[0][4], reg.coef_[0][5], reg.coef_[0][6]))

### Determine if there is a strong correlation with the variables


In [None]:
X = np.column_stack((data['GRE Score'], data['TOEFL Score'],  data['University Rating'], data['SOP'], data['LOR '], data['CGPA'], data['Research']))
y = data['Chance of Admit ']
X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())

### Results of the summary

R-squared value is 0.803 which indicates a very high relationship
F-statistic value is 228.9 which is much higher than 1, indicating a strong correlation

we can look at our P values to determine which variables were significant or not to our model. it appears that the only value that was high was the variable corresponding with x6. 

x3 and x4 have high P values. this indicates that they are not strongly correlated with our Y value, chance of admission.

# Additional Questions that can be answered via analysis
Does GRE score influence the chance of getting admitted ? ...Yes!

How does the University Rating improve the chance of getting admitted ?
Does CGPA influence my University Rating?
Does TOEFL score influence the chance of getting admitted ?