## **Admission Prediction With Pyspark ML**

The objective of this project is to predict if a student who sent their application to get admitted in a university graduate program will be accepted or not.

We will be using Pyspark for this project.

### **1. Import Libraries & Run a SparkSession**


In [77]:
#install pyspark
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression, LinearRegressionModel
from pyspark.ml.evaluation import RegressionEvaluator

In [2]:
#create a sparksession
spark = SparkSession.builder.appName('admission-prediction').getOrCreate()

### **2. Load and Explore dataset**

In [5]:
#create a spark dataframe
df = spark.read.csv('admission_dataset/Admission_Predict_Ver1.1.csv', header=True, inferSchema=True)

The dataset consists of:
|Field|Description|
|----|----|
|GRE Score|The score obtained in the GRE test.|
|TOEFL Score|The score obtained in the standarized English test.|
|University Rating|The score obtained for the university ranking from 0-5.|
|SOP|The score obtained for the statement of purpose from 0-5.|
|LOI|The score obtained for the letter of reccommendation from 0-5.|
|CGPA|Cumulated GPA from 0-10|
|Research| If you have done research activities, 0 or 1|
|Change of Admit|The chances of get admited from 0-1|

In [6]:
#display dataframe
df.show(5)

+---------+---------+-----------+-----------------+---+---+----+--------+---------------+
|Serial No|GRE Score|TOEFL Score|University Rating|SOP|LOR|CGPA|Research|Chance of Admit|
+---------+---------+-----------+-----------------+---+---+----+--------+---------------+
|        1|      337|        118|                4|4.5|4.5|9.65|       1|           0.92|
|        2|      324|        107|                4|4.0|4.5|8.87|       1|           0.76|
|        3|      316|        104|                3|3.0|3.5| 8.0|       1|           0.72|
|        4|      322|        110|                3|3.5|2.5|8.67|       1|            0.8|
|        5|      314|        103|                2|2.0|3.0|8.21|       0|           0.65|
+---------+---------+-----------+-----------------+---+---+----+--------+---------------+
only showing top 5 rows



In [11]:
#get the no.of rows & columns
print(f'The number of rows: {df.count()}')
print(f'The number of columns: {len(df.columns)}')

The number of rows: 500
The number of columns: 9


In [12]:
#print schema
df.printSchema()

root
 |-- Serial No: integer (nullable = true)
 |-- GRE Score: integer (nullable = true)
 |-- TOEFL Score: integer (nullable = true)
 |-- University Rating: integer (nullable = true)
 |-- SOP: double (nullable = true)
 |-- LOR: double (nullable = true)
 |-- CGPA: double (nullable = true)
 |-- Research: integer (nullable = true)
 |-- Chance of Admit: double (nullable = true)



In [19]:
#get the summary statistics
df.describe().show()

+-------+-----------------+------------------+-----------------+-----------------+------------------+------------------+------------------+------------------+-------------------+
|summary|        Serial No|         GRE Score|      TOEFL Score|University Rating|               SOP|               LOR|              CGPA|          Research|    Chance of Admit|
+-------+-----------------+------------------+-----------------+-----------------+------------------+------------------+------------------+------------------+-------------------+
|  count|              500|               500|              500|              500|               500|               500|               500|               500|                500|
|   mean|            250.5|           316.472|          107.192|            3.114|             3.374|             3.484| 8.576440000000003|              0.56| 0.7217399999999996|
| stddev|144.4818327679989|11.295148372354712|6.081867659564538|1.143511800759815|0.9910036207566072|0.92

### **3. Data Cleaning**

In [20]:
#drop the unnecessary column
df = df.drop('Serial No')

In [21]:
#display the dataframe
df.show(5)

+---------+-----------+-----------------+---+---+----+--------+---------------+
|GRE Score|TOEFL Score|University Rating|SOP|LOR|CGPA|Research|Chance of Admit|
+---------+-----------+-----------------+---+---+----+--------+---------------+
|      337|        118|                4|4.5|4.5|9.65|       1|           0.92|
|      324|        107|                4|4.0|4.5|8.87|       1|           0.76|
|      316|        104|                3|3.0|3.5| 8.0|       1|           0.72|
|      322|        110|                3|3.5|2.5|8.67|       1|            0.8|
|      314|        103|                2|2.0|3.0|8.21|       0|           0.65|
+---------+-----------+-----------------+---+---+----+--------+---------------+
only showing top 5 rows



In [24]:
#check for null values
for i in df.columns:
    print(i+":", df[df[i].isNull()].count())

GRE Score: 0
TOEFL Score: 0
University Rating: 0
SOP: 0
LOR: 0
CGPA: 0
Research: 0
Chance of Admit: 0


### **4. Correlation Analysis & Feature Selection**

In [31]:
# correlation analysis
for col in df.columns:
    print('The correlation between {} and the Change to get admitted is: {}'.format(col, df.stat.corr('Chance of Admit', col)))

The correlation between GRE Score and the Change to get admitted is: 0.8103506354632598
The correlation between TOEFL Score and the Change to get admitted is: 0.7922276143050823
The correlation between University Rating and the Change to get admitted is: 0.6901323687886892
The correlation between SOP and the Change to get admitted is: 0.6841365241316723
The correlation between LOR and the Change to get admitted is: 0.6453645135280112
The correlation between CGPA and the Change to get admitted is: 0.882412574904574
The correlation between Research and the Change to get admitted is: 0.5458710294711379
The correlation between Chance of Admit and the Change to get admitted is: 1.0


In [34]:
# feature selection
list_features =  ['GRE Score','TOEFL Score','CGPA']
assembler = VectorAssembler(inputCols=list_features, outputCol='features')
output_data = assembler.transform(df)

In [35]:
#display dataframe
output_data.show(5)

+---------+-----------+-----------------+---+---+----+--------+---------------+------------------+
|GRE Score|TOEFL Score|University Rating|SOP|LOR|CGPA|Research|Chance of Admit|          features|
+---------+-----------+-----------------+---+---+----+--------+---------------+------------------+
|      337|        118|                4|4.5|4.5|9.65|       1|           0.92|[337.0,118.0,9.65]|
|      324|        107|                4|4.0|4.5|8.87|       1|           0.76|[324.0,107.0,8.87]|
|      316|        104|                3|3.0|3.5| 8.0|       1|           0.72| [316.0,104.0,8.0]|
|      322|        110|                3|3.5|2.5|8.67|       1|            0.8|[322.0,110.0,8.67]|
|      314|        103|                2|2.0|3.0|8.21|       0|           0.65|[314.0,103.0,8.21]|
+---------+-----------+-----------------+---+---+----+--------+---------------+------------------+
only showing top 5 rows



### 5. Build the Linear Regression Model

In [37]:
#import Linearregression and create final data
final_data  = output_data.select('features', 'Chance of Admit')

In [38]:
#print schema of final data
final_data.printSchema()

root
 |-- features: vector (nullable = true)
 |-- Chance of Admit: double (nullable = true)



In [40]:
#split the dataset into training and testing set
train_data, test_data = final_data.randomSplit([0.75,0.25],seed=0)

In [42]:
#build & train the model
models= LinearRegression(featuresCol='features', labelCol='Chance of Admit')
model = models.fit(train_data)

In [50]:
#get coefficients & intercept
print(f'Coefficients: {model.coefficients}')
print(f'Intercept: {model.intercept}')

Coefficients: [0.002278442144001862,0.003703672661705527,0.14285516808888252]
Intercept: -1.6212241463868469


In [63]:
#get summary of the model
summary = model.summary

In [65]:
#print the rmse & r2 score
print(f'RMSE: {summary.rootMeanSquaredError}')
print(f'R2: {summary.r2}')

RMSE: 0.06120160758156184
R2: 0.8170367380343689


### **6. Evaluate & Save the Model**

In [67]:
#transform on the test data
predictions = model.transform(test_data)

In [69]:
#display the predictions
predictions.show()

+------------------+---------------+-------------------+
|          features|Chance of Admit|         prediction|
+------------------+---------------+-------------------+
|[290.0,100.0,7.56]|           0.47| 0.4898764122961974|
|[297.0,101.0,7.67]|           0.57| 0.5252432484556935|
| [298.0,92.0,7.88]|           0.51| 0.5241882219430107|
|[298.0,101.0,7.69]|           0.53| 0.5303787939614728|
|[298.0,101.0,7.86]|           0.54| 0.5546641725365826|
|[298.0,105.0,8.54]|           0.69| 0.6666203774838451|
| [299.0,97.0,7.66]|           0.38| 0.5135568904159862|
|[299.0,100.0,7.88]|           0.68| 0.5560960453806567|
|[299.0,100.0,7.89]|           0.59| 0.5575245970615454|
|[299.0,102.0,8.62]|           0.56| 0.6692162150898411|
|  [300.0,97.0,8.1]|           0.65| 0.5786916065190959|
|[300.0,102.0,7.87]|           0.56| 0.5643532811671812|
|[300.0,102.0,8.17]|           0.63| 0.6072098315938459|
| [301.0,96.0,7.56]|           0.54|  0.500124585233396|
|[301.0,104.0,7.89]|           

In [71]:
#evaluate the model
evaluator = RegressionEvaluator(predictionCol='prediction', labelCol='Chance of Admit' , metricName='r2')
print('r2 on the test data: ', evaluator.evaluate(predictions))

r2 on the test data:  0.7588464671010928


### **7. Conclusions**

The model performs very good on training and slightly decrease on testing. Overall, we obtained good scores during our evaluation that indicates the good performance of our model predicting if a student will get admited or not to university. 