In [26]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt

df = pd.read_csv('admissions.csv')


In [27]:
df.head(n=5)

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.0,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0


Data Dictionary

Variable | Description | Type of Variable
---| ---| ---
admit    | 0 = rejected 1 = accepted | Nominal
gre      | GRE score from 0.00 - 800.00  | Ratio
gpa      | Undergraduate GPA from 0.00 to 4.00  | Ratio
prestige | Rank of undergraduate universities with 1.0 as the highest and 4.0 as the lowest  | Ordinal

In [28]:
df.describe()

Unnamed: 0,admit,gre,gpa,prestige
count,400.0,398.0,398.0,399.0
mean,0.3175,588.040201,3.39093,2.486216
std,0.466087,115.628513,0.38063,0.945333
min,0.0,220.0,2.26,1.0
25%,0.0,520.0,3.13,2.0
50%,0.0,580.0,3.395,2.0
75%,1.0,660.0,3.67,3.0
max,1.0,800.0,4.0,4.0


In [29]:
df_filter = df[(df['admit'] >= 0) & (df['gre'] > 0) & (df['gpa'] > 0) & (df['prestige'] > 0)]

In [30]:
df_filter.describe()

Unnamed: 0,admit,gre,gpa,prestige
count,397.0,397.0,397.0,397.0
mean,0.31738,587.858942,3.392242,2.488665
std,0.466044,115.717787,0.380208,0.947083
min,0.0,220.0,2.26,1.0
25%,0.0,520.0,3.13,2.0
50%,0.0,580.0,3.4,2.0
75%,1.0,660.0,3.67,3.0
max,1.0,800.0,4.0,4.0


In [31]:
df_filter_admit = df_filter[(df_filter['admit'] == 1)]

In [32]:
df_filter_reject = df_filter[(df_filter['admit'] == 0)]

In [33]:
print('Accepted Applicants')
print(df_filter_admit.describe())
print('-------------------------------------------------')
print('Rejected Applicants')
print(df_filter_reject.describe())


Accepted Applicants
       admit         gre         gpa    prestige
count  126.0  126.000000  126.000000  126.000000
mean     1.0  618.571429    3.489206    2.150794
std      0.0  109.257233    0.371655    0.921455
min      1.0  300.000000    2.420000    1.000000
25%      1.0  540.000000    3.220000    1.000000
50%      1.0  620.000000    3.545000    2.000000
75%      1.0  680.000000    3.757500    3.000000
max      1.0  800.000000    4.000000    4.000000
-------------------------------------------------
Rejected Applicants
       admit         gre         gpa    prestige
count  271.0  271.000000  271.000000  271.000000
mean     0.0  573.579336    3.347159    2.645756
std      0.0  116.052798    0.376355    0.918922
min      0.0  220.000000    2.260000    1.000000
25%      0.0  500.000000    3.080000    2.000000
50%      0.0  580.000000    3.340000    3.000000
75%      0.0  660.000000    3.610000    3.000000
max      0.0  800.000000    4.000000    4.000000


In [34]:
print(df_filter_admit.corr())
print(df_filter_reject.corr())

          admit       gre       gpa  prestige
admit       NaN       NaN       NaN       NaN
gre         NaN  1.000000  0.232765 -0.080485
gpa         NaN  0.232765  1.000000 -0.039360
prestige    NaN -0.080485 -0.039360  1.000000
          admit       gre       gpa  prestige
admit       NaN       NaN       NaN       NaN
gre         NaN  1.000000  0.418175 -0.086004
gpa         NaN  0.418175  1.000000 -0.010311
prestige    NaN -0.086004 -0.010311  1.000000


In [35]:
df_filter_admit['gre'].plot(kind='box'); #unsure why this is not plotting

In [36]:
df.skew()

admit       0.787051
gre        -0.150127
gpa        -0.211765
prestige    0.093663
dtype: float64

In [37]:
df.kurt()

admit      -1.387513
gre        -0.330065
gpa        -0.574623
prestige   -0.894759
dtype: float64

1.) Create a data dictionary

Variable | Description | Type of Variable
---| ---| ---
admit    | 0 = rejected 1 = accepted | Nominal
gre      | GRE score from 0.00 - 800.00  | Ratio
gpa      | Undergraduate GPA from 0.00 to 4.00  | Ratio
prestige | Rank of undergraduate universities with 1.0 as the highest and 4.0 as the lowest  | Ordinal

2.) What is the outcome?

The outcome would be who is admitted or rejected from the graduate school.

3.) What are the predictors/covariates?

The covariates are GPA, GRE score, and the prestige of the previous undergraduate institution.

4.) What timeframe is this data relevant for?

The timeframe is not specified. This would undermine the validity of analysis yielded from this dataset. We would need to know several factors including: over what time period was this data collected and when the data was collected, as admission methodologies often change. The analysis is relevant for future data in an analgous environment.

5.) What is the hypothesis?

The hypothesis is that the relationship between GPA, GRE, school prestige, or a combination of the covariates and the outcome will be statistically significant to infer if a future candidate will be accepted into the graduate program.

My hypothesis would be that of the three covariates, GPA would correlate the strongest with acceptance.

6.) Using the above information, write a well-informed problem statement

In a hyper-competitive employment period, the importance of a strong educational background is increasingly important. The cascading effect is an equally competitive process to be admitted into elite graduate programs. Understanding how quantitative variables such as GPA, GRE, and undergraduate university rank affect graduate school admissions would help employers identify potential high-performaning job applicants.


1.) What are the goals of exploratory analysis?

The goals of EDA are to gain additional insights which may have not been originally addressed by the study or hypothesis. EDA helps to drive future research in any given field.

2a.) What are the assumptions regarding the distribution of the data?

The assumptions regarding distribution are that if we determine the sample to be normally distributed we can assert that it is representative of the population, and thus is likely to produce similar results when given another random, but normally distributed, sample.

2b.) How will you determine the distribution of your data?

Measure the standard deviation and kurtosis for each variable

3a.) How might outlier impact your analysis?

Too many outliers would negatively impact our analysis. With too many outliers we might find that our data is not normally distributed, or the statistical significance threshold might not be achieved.

3b.) How will you test for outliers?

To test for outliers I would follow two primary methods, both relating to the distribution. The first would be to create a pandas describe table. For each variable, I would look at the max and min values. At a glance we would be able to see how large the range is wih respect to the distribution of the data. Additionally, I would look to see how many data points fall outside of 2 standard deviations from the mean. Coinciding with the above analysis, I would also plot the distribution of the same. A visual representation is also helpful to understand if your data is normal or skewed, and if skewed, in which direction.

4.) What is your exploratory analysis plan?

Using a trusted source, retrieve admissions data from a graduate program similar to UCLA. Ensure the data set has the same number of data points, as well as was collected from the same time period. This will help to reduce potential biases. In addition to the hypothesis proposed above, I would also look at all possible outcomes of covariate combinations (i.e. effect of GPA & GRE, GPA & GRE & Prestige, etc.) instead of just focusing on a single variable. The future analysis would see if the previous hypothesis was confirmed, as well as if the relationship between the covariates and the outcome differ by year or by institution
