# CONTEXT

### The students in the USA are seeking Loans on regular basis for their post-secondary education through certifications or college degrees. Different types of income status student’s pursue Federal Loans and grants like PELL for low income students to fulfill their objectives. In deciding which institution is better for the students whether it is linked with affording, education, market opportunities or repaying loans after completion, the answer is the IPEDS numeric data. IPEDS is the Integrated Postsecondary Education Data System. It is a system of interrelated surveys conducted annually by the U.S. Department of Education’s National Center for Education Statistics (NCES). IPEDS gathers information from every college, university, and technical and vocational institution that participates in the federal student financial aid programs. All the IPEDS data is based on 5.05% interest rate. We will use the respective data from Loan perspective.    

# OBJECTIVE

### Predicting Loan applications as "Approved or Rejected" for the applicants. The focus is mostly on Low income students (who have earnings up to 48000 dollars per annum) to check whether they will be able to pay their loans as they face difficulties in paying their tuition fees and other related expenses.  

# PREPROCESSING

### The preprocessing involves data exploration, data cleansing, and feature selection involving feature engineering by creating new features

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning) 
simplefilter(action='ignore', category=UserWarning) 
pd.options.mode.chained_assignment = None 
# Displaying Full Data Frame
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)

# decimals limit
pd.options.display.precision = 3

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

## DATA EXPLORATION

### IPEDS numeric data will be explored

In [None]:
cohort_data=pd.read_csv('/kaggle/input/post-secondary-education-data-ipeds/Most-Recent-Cohorts-All-Data-Elements.csv', low_memory=False)
print (cohort_data.shape)
cohort_data.head(1)

## Column/Variables Defination and some relevant information

In [None]:
# Note: ## shows the statement is continued...
# For Institution Names, 'INSTNM'
# Highest award (HIGHDEG) identifies the highest award level conferred at the institution.
# Level of institution (ICLEVEL) conveys the highest level of award offered at the institution: 4-year, 2-year,\
## or less-than-2-year. This designation differs from the highest degree element in that it is based on \
## an institution’s reported offerings, rather than on degree or certificate completions.
# Public/Private Nonprofit/ Private For-Profit: Variable is CONTROL. 1 For Public, 2 For Private Non-Profit,\
## 3 For Private Profit
# CURROPER (0=Not operating , 1=operating), See for institution
# Institutions are identified as distance education-only (DISTANCEONLY) if all their programs are available \
## only via distance education.0 For Non-Distance while 1 For Distance Only
# Average Cost of Attendance, Tuition and Fees "COSTT4_A,COSTT4_P,TUITIONFEE_IN,TUITIONFEE_OUT,TUITIONFEE_PROG"
# 'FAMINC','MD_FAMINC' :- We choose only one varilable ('MD_FAMINC':- Median family income in real 2015 dollars) \
## due to median value as medians are resistant to outliers. That why we will drop ('FAMINC') average value variable.
# 'FAMINC_IND':- Average family income for independent students in real 2015 dollars
# 'DEBT_MDN':- Cumulative Median Debt. This is the median loan debt accumulated at the institution24 by all \
## student borrowers of federal loans25 who separate (i.e., either graduate or withdraw) in a given fiscal year
# Currently, institutions report (via the IPEDS Graduation Rates component) on the completion rates for full-time, \
## first-time students who complete within 100 or 150 percent of the expected time to completion (C[100 or 150]_4 \
## for four-year institutions and C[100 or 150]_L4 for less-than-four-year institutions).
# 'GT_25K_P6':- Share of students earning over $25,000/year (threshold earnings) 6 years after entry
# 'GT_28K_P6':- Share of students earning over $28,000/year (threshold earnings) 6 years after 
# 'CDR2' & 'CDR3' :- Cohort default rates are produced annually34 as an institutional accountability metric; \
## institutions with high default rates may lose access to federal financial aid
# 'RPY_1YR_RT':- Fraction of repayment cohort who are not in default, and with loan balances that have declined \
## one year since entering repayment, excluding enrolled and military deferment from calculation. (Rolling averages)
# 'COMPLETERPY1YR':- One-year repayment rate for completers
# 'LO_INC_RPY_1YR_RT':- One-year repayment rate by family income ($0-30,000)
# 'LOINCOMERPY1YR':- One-year repayment rate by family income ($30,000-75,000)
# 'MDINCOMERPY1YR':- One-year repayment rate by family income ($75,000+)


## DATA CLEANSING

### In data cleansing, we delete all the non-operating institutions, replacing null or garbage values with some suitable numbers or categories, renaming the columns for our understanding etc.

In [None]:
# Taking only operating institutes
cohort_data=cohort_data.loc[cohort_data['CURROPER']==1]
print (cohort_data.shape)
# Reset the index
cohort_data=cohort_data.reset_index()
# check the last row after index reset
cohort_data.tail(1)
# Avoid copy warning
cohort_data.is_copy = False
# Replacing all strings ("PrivacySuppressed") with some null values. Let replace with '0'
cohort_data=cohort_data.replace('PrivacySuppressed',0)
# filling nan values with 0
cohort_data=cohort_data.fillna(0)

In [None]:
print (cohort_data[['INSTNM','HIGHDEG','ICLEVEL','CONTROL','DISTANCEONLY',]].isnull().sum())
# Mostly institutes provide non-online programs, so filling all null values with '0' as non-online status
cohort_data['DISTANCEONLY']=cohort_data['DISTANCEONLY'].fillna(0.0)
print (' "DISTANCEONLY" feature has null values are', (cohort_data ['DISTANCEONLY'].isnull().sum()))

# Checking how many institutes are giving online ('1') and non-online ('0') programs
print (cohort_data.loc[cohort_data['DISTANCEONLY']==1.0].shape)
print(cohort_data.loc[cohort_data['DISTANCEONLY']==0.0].shape)
# 'CONTROL' feature has 3 values showing institue status:- 0 For Public, 1 For Private non profit and 2 For Private for-profit
cohort_data['CONTROL'].unique()

In [None]:
# renaming the columns
cohort_data=cohort_data.rename(columns={'LO_INC_RPY_1YR_RT':'LOINCOMERPY1YR', 'COMPL_RPY_1YR_RT' : 'COMPLETERPY1YR',\
                         'MD_INC_RPY_1YR_RT':'MDINCOMERPY1YR','COMPL_RPY_3YR_RT':'COMPLETERPY3YR',\
                         'LO_INC_RPY_3YR_RT':'LOINCOMERPY3YR','MD_INC_RPY_3YR_RT':'MDINCOMERPY3YR',\
                         'COMPL_RPY_5YR_RT':'COMPLETERPY5YR','LO_INC_RPY_5YR_RT':'LOINCOMERPY5YR',\
                         'MD_INC_RPY_5YR_RT':'MDINCOMERPY5YR','COMPL_RPY_7YR_RT':'COMPLETERPY7YR',\
                          'LO_INC_RPY_7YR_RT':'LOINCOMERPY7YR','MD_INC_RPY_7YR_RT':'MDINCOMERPY7YR'} )

## FEATURES SELECTION

### There are six categories involving School, Student, AID/DEBT, COMPLETION, EARNINGS & REPAYMENT. We will select those features which can be correlated with our output. If those are enough to generate the algorithm, we will bypass the feature engineering.

In [None]:
# Picking the features
data=cohort_data[['INSTNM','HIGHDEG','ICLEVEL','CONTROL','DISTANCEONLY','COSTT4_A','COSTT4_P',\
                'TUITIONFEE_IN','TUITIONFEE_OUT',\
                'MD_FAMINC','DEBT_MDN','GT_25K_P6','GT_28K_P6',\
               'CDR3','COMPLETERPY1YR','LOINCOMERPY1YR', 'MDINCOMERPY1YR',\
                'COMPLETERPY3YR','LOINCOMERPY3YR','MDINCOMERPY3YR',\
                'COMPLETERPY5YR','LOINCOMERPY5YR', 'MDINCOMERPY5YR',\
                'COMPLETERPY7YR','LOINCOMERPY7YR', 'MDINCOMERPY7YR',\
                 'C100_4','C100_L4','C150_4','C150_L4','C200_4','C200_L4','C150_4_PELL','C150_L4_PELL']]
data.head(1)

In [None]:
# To avoid some warnings
data.is_copy=False

In [None]:
data.dtypes

In [None]:
# converting objects into numericals
data['LOINCOMERPY1YR'] = data['LOINCOMERPY1YR'].astype(float)
data['MDINCOMERPY1YR'] = data['MDINCOMERPY1YR'].astype(float)
data['COMPLETERPY1YR'] = data['COMPLETERPY1YR'].astype(float)
data['LOINCOMERPY3YR'] = data['LOINCOMERPY3YR'].astype(float)
data['MDINCOMERPY3YR'] = data['MDINCOMERPY3YR'].astype(float)
data['COMPLETERPY3YR'] = data['COMPLETERPY3YR'].astype(float)
data['LOINCOMERPY5YR'] = data['LOINCOMERPY5YR'].astype(float)
data['MDINCOMERPY5YR'] = data['MDINCOMERPY5YR'].astype(float)
data['COMPLETERPY5YR'] = data['COMPLETERPY5YR'].astype(float)
data['LOINCOMERPY7YR'] = data['LOINCOMERPY7YR'].astype(float)
data['MDINCOMERPY7YR'] = data['MDINCOMERPY7YR'].astype(float)
data['COMPLETERPY7YR'] = data['COMPLETERPY7YR'].astype(float)
data['GT_25K_P6']=data['GT_25K_P6'].astype(float)
data['GT_28K_P6']= data['GT_28K_P6'].astype(float)
data['MD_FAMINC'] = data['MD_FAMINC'].astype(float)
data['DEBT_MDN'] = data['DEBT_MDN'].astype(float)

## Feature Engineering

### After analyzing the selected features, every feature is giving some relevance information but not as a whole. Let take earnings category, it will give information in terms of percentages for earnings over 25k & 28k per annum for each institution. But to capitalize from loan perspective, you have to use in certain way to make it helpful, like set conditions, then extract specific information and put it in a new column or feature. So now it will be necessary to creature new features by using some of the selected features. We will create new features on the basis of each category (if necessary) variables to make things easier. 

## 1). Earnings Over 25K or 28K Per Year After Graduation

### The data is not available for those students who completed their graduation within specific years and started straight away earnings. We have chosen 6 years data after graduation or certification completion from 10,9,8,7 and 6 years earnings as it is the close one to resemble with the graduates who have completed within 4 years, six years and eight years. Because they will earn the same salaries as fresh graduates. One thing is important as far as concerned to eight years graduation time as the debt is increased over the time and will lead towards maximum risk due to default chances. Though six years completion rate students have to pay more debt due to two years extending but involved medium risk as compared to eight years completion rate.   

In [None]:
print ('Share of students earning over $25,000/year (threshold earnings) 6 years after entry')
print ('90%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['GT_25K_P6'] >=0.9].shape)
print ('80%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['GT_25K_P6'] >=0.8].shape)
print ('70%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['GT_25K_P6'] >=0.7].shape)
print ('65%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['GT_25K_P6'] >=0.65].shape)
print ('60%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['GT_25K_P6'] >=0.6].shape)
print ('50%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['GT_25K_P6'] >=0.5].shape)
print ('40%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['GT_25K_P6'] >=0.4].shape)
print ('30%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['GT_25K_P6'] >=0.3].shape)
print ('20%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['GT_25K_P6'] >=0.2].shape)
print ('10%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['GT_25K_P6'] >=0.1].shape)

print ('Share of students earning over $28,000/year (threshold earnings) 6 years after entry')
print ('90%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['GT_28K_P6'] >=0.9].shape)
print ('80%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['GT_28K_P6'] >=0.8].shape)
print ('70%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['GT_28K_P6'] >=0.7].shape)
print ('65%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['GT_28K_P6'] >=0.65].shape)
print ('60%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['GT_28K_P6'] >=0.6].shape)
print ('50%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['GT_28K_P6'] >=0.5].shape)
print ('40%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['GT_28K_P6'] >=0.4].shape)
print ('30%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['GT_28K_P6'] >=0.3].shape)
print ('20%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['GT_28K_P6'] >=0.2].shape)
print ('10%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['GT_28K_P6'] >=0.1].shape)

### From above statistical information, as the percentage increases, there will be less institutions who are fulfilling the threshold earnings.
### Let see four percentages, 50%, 60%, 65% & 70% for each threshold level
### Over 25K, 2716, 1320, 1778, & 898 institutions can be considered
### Over 25K, 2716, 1778, 1038 & 898 institutions can be considered

### As the earnings is the main factor in paying the debts and also linked with years plan (1, 3, 5 or 7). Hence we have to take slightly higher percentage like 65 %( which is not on a very high side too) to see the proportion of students for each institution.  


In [None]:
# Taking instituions who has selected earnings percentage over and upto 65%
earnings_percent_25K =  data.loc[data['GT_25K_P6'] >=0.65]
earnings_percent_25K = earnings_percent_25K[['INSTNM','GT_25K_P6']]
earnings_percent_28K =  data.loc[data['GT_28K_P6'] >=0.65]
earnings_percent_28K = earnings_percent_28K[['INSTNM','GT_28K_P6']]
print (earnings_percent_25K.shape)
print (earnings_percent_28K.shape)

## All THE SELECTED INSTITUIONS FOR BOTH EARNINGS CATEGORIES FOR 65% LEVEL

In [None]:
# "EARNINGS OVER 25K"
# 1325 Instituiotns
earnings_percent_25K.head()

In [None]:
# "EARNINGS OVER 28K"
# 1038 Instituiotns
earnings_percent_28K.head()


## Distribution of Earnings (Over 25K & 28K)

In [None]:
import matplotlib.pyplot as plt
earnings_percent_25K.plot(kind='hist', figsize=(15, 6))

plt.xlabel('Earnings over 25k') # add to x-label to the plot
plt.ylabel('Number of INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES') # add y-label to the plot
# add title to the plot
plt.title('Earnings over 25k VS NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES')

plt.show()


earnings_percent_28K.plot(kind='hist', figsize=(15, 6))

plt.xlabel('Earnings over 28k') # add to x-label to the plot
plt.ylabel('Number of INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES') # add y-label to the plot
# add title to the plot
plt.title('Earnings over 28k VS NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES')

plt.show()

## TOP 50 Institutions who have highest percentage of students in getting earnings over 25K and 28K per year

In [None]:
# Top 50 Institutions
data_earnings25k=earnings_percent_25K.groupby(['INSTNM']).mean()
data_25k = data_earnings25k.sort_values(['GT_25K_P6'], ascending=False, axis=0)
data_earnings28k=earnings_percent_28K.groupby(['INSTNM']).mean()
data_28k = data_earnings28k.sort_values(['GT_28K_P6'], ascending=False, axis=0)

# Chosing Top 50 instituions
data_25k=  data_25k.reset_index()
data_25k=data_25k.head(50)
data_25k_bar=data_25k[['INSTNM', 'GT_25K_P6']]
data_25k_bar=data_25k_bar.set_index('INSTNM')
data_28k=  data_28k.reset_index()
data_28k=data_28k.head(50)
data_28k_bar=data_28k[['INSTNM', 'GT_28K_P6']]
data_28k_bar=data_28k_bar.set_index('INSTNM')

## Showing Top 50 Institutions with earnings percentages


In [None]:
data_25k.head(50)

In [None]:
data_28k.head(50)

In [None]:
# plot data For over 25k & 28k Earnings

data_25k_bar.plot(kind='bar', figsize=(15, 6))

plt.xlabel('INSTITUTIONS') # add to x-label to the plot
plt.ylabel('Earnings over 25k') # add y-label to the plot
# add title to the plot
plt.title('Top 50 Institutions VS Earnings over 25k')

plt.show()

data_28k_bar.plot(kind='bar', figsize=(15, 6))

plt.xlabel('INSTITUTIONS') # add to x-label to the plot
plt.ylabel('Earnings over 28k') # add y-label to the plot
# add title to the plot
plt.title('Top 50 Institutions VS Earnings over 28k') 

plt.show()

In [None]:
def earnings_threshold (twentyfive_threshold,twentyeight_threshold):
    if ((twentyfive_threshold or twentyeight_threshold) >=0.65):
        match=1
    else:
        match=0
    return match

In [None]:
data['earnings'] = data.apply(lambda x: earnings_threshold (x['GT_25K_P6'],x['GT_28K_P6']),axis=1)

In [None]:
data.head()

## 2). REPAYMENT

## Default rate for each instituion

### Institutions with high default rates may lose access to federal financial aid. We will select three-year cohort default rate (CDR3) to see each institution default rate. We will check against different percentages from higher to low.

In [None]:
print ('Default rate for instituions according to percentages')
print ('90%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['CDR3'] >=0.9].shape)
print ('80%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['CDR3'] >=0.8].shape)
print ('70%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['CDR3'] >=0.7].shape)
print ('60%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['CDR3'] >=0.6].shape)
print ('50%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['CDR3'] >=0.5].shape)
print ('40%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['CDR3'] >=0.4].shape)
print ('30%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['CDR3'] >=0.3].shape)
print ('20%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['CDR3'] >=0.2].shape)
print ('10%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['CDR3'] >=0.1].shape)

### Only 3 and 11 institutions have Default rate lies at 50% and 40% respectively. So we can discard only 14 institutions if we consider 40% default rate as low. Let take 40% as threshold for default rate.

In [None]:
default_rate_40= data.loc[data['CDR3'] >=0.4]

# Top 50 Institutions
default_rate_40=default_rate_40.groupby(['INSTNM']).mean()
default_rate_40 = default_rate_40.sort_values(['CDR3'], ascending=False, axis=0)

# Chosing Top 50 instituions
default_rate_40=  default_rate_40.reset_index()
default_rate_40=default_rate_40[['INSTNM','CDR3']]
default_rate_40=default_rate_40.set_index('INSTNM')

default_rate_40.head(11)


In [None]:
# plot data For deafault rate over and upto 40%

default_rate_40.plot(kind='bar', figsize=(15, 6))

plt.xlabel('INSTITUTIONS') # add to x-label to the plot
plt.ylabel('Default_rate') # add y-label to the plot
# add title to the plot
plt.title('Institutions VS Default_rate') 

## LOW-MIDDLE INCOME

### Besides default rate, the data for low income students in repayment rate is showing for (0-30K), Middle income (30K to 75k) and Hi income (75K+). While the low income category is up to 48K according to IPEDS. So we take the mean of low and middle income in order to fill the gap somehow.   

In [None]:
def lo_mi_income_mean (YRlo_income,YRmi_income):
    mean_income= ((YRlo_income + YRmi_income)/2)
    return mean_income

In [None]:
data['LOMIAVGRPY1YR'] = data.apply (lambda x: lo_mi_income_mean (x['LOINCOMERPY1YR'],x['MDINCOMERPY1YR'])\
                                                               ,axis=1)
data['LOMIAVGRPY3YR'] = data.apply (lambda x: lo_mi_income_mean (x['LOINCOMERPY3YR'],x['MDINCOMERPY3YR'])\
                                                               ,axis=1)
data['LOMIAVGRPY5YR'] = data.apply (lambda x: lo_mi_income_mean (x['LOINCOMERPY5YR'],x['MDINCOMERPY5YR'])\
                                                               ,axis=1)
data['LOMIAVGRPY7YR'] = data.apply (lambda x: lo_mi_income_mean (x['LOINCOMERPY7YR'],x['MDINCOMERPY7YR'])\
                                                               ,axis=1)

## 1, 3, 5 & 7 years repayment completion rate as well as low income and low-middle income proportion from combined students

In [None]:
print ('1 year repayment completion rate for different percentages')
print ('90%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY1YR'] >=0.9].shape)
print ('80%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY1YR'] >=0.8].shape)
print ('70%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY1YR'] >=0.7].shape)
print ('60%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY1YR'] >=0.6].shape)
print ('50%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY1YR'] >=0.5].shape)
print ('40%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY1YR'] >=0.4].shape)
print ('30%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY1YR'] >=0.3].shape)
print ('20%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY1YR'] >=0.2].shape)
print ('10%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY1YR'] >=0.1].shape)

print ('1 year repayment completion rate for low income students according to different percentages')
print ('90%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY1YR'] >=0.9].shape)
print ('80%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY1YR'] >=0.8].shape)
print ('70%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY1YR'] >=0.7].shape)
print ('60%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY1YR'] >=0.6].shape)
print ('50%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY1YR'] >=0.5].shape)
print ('40%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY1YR'] >=0.4].shape)
print ('30%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY1YR'] >=0.3].shape)
print ('20%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY1YR'] >=0.2].shape)
print ('10%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY1YR'] >=0.1].shape)

print ('1 year repayment completion rate for low-middle income students according to different percentages')
print ('90%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY1YR'] >=0.9].shape)
print ('80%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY1YR'] >=0.8].shape)
print ('70%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY1YR'] >=0.7].shape)
print ('60%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY1YR'] >=0.6].shape)
print ('50%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY1YR'] >=0.5].shape)
print ('40%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY1YR'] >=0.4].shape)
print ('30%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY1YR'] >=0.3].shape)
print ('20%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY1YR'] >=0.2].shape)
print ('10%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY1YR'] >=0.1].shape)

print ('3 years repayment completion rate for different percentages')
print ('90%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY3YR'] >=0.9].shape)
print ('80%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY3YR'] >=0.8].shape)
print ('70%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY3YR'] >=0.7].shape)
print ('60%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY3YR'] >=0.6].shape)
print ('50%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY3YR'] >=0.5].shape)
print ('40%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY3YR'] >=0.4].shape)
print ('30%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY3YR'] >=0.3].shape)
print ('20%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY3YR'] >=0.2].shape)
print ('10%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY3YR'] >=0.1].shape)

print ('3 years repayment completion rate for low income students according to different percentages')
print ('90%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY3YR'] >=0.9].shape)
print ('80%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY3YR'] >=0.8].shape)
print ('70%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY3YR'] >=0.7].shape)
print ('60%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY3YR'] >=0.6].shape)
print ('50%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY3YR'] >=0.5].shape)
print ('40%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY3YR'] >=0.4].shape)
print ('30%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY3YR'] >=0.3].shape)
print ('20%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY3YR'] >=0.2].shape)
print ('10%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY3YR'] >=0.1].shape)

print ('3 years repayment completion rate for low-middle income students according to different percentages')
print ('90%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY3YR'] >=0.9].shape)
print ('80%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY3YR'] >=0.8].shape)
print ('70%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY3YR'] >=0.7].shape)
print ('60%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY3YR'] >=0.6].shape)
print ('50%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY3YR'] >=0.5].shape)
print ('40%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY3YR'] >=0.4].shape)
print ('30%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY3YR'] >=0.3].shape)
print ('20%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY3YR'] >=0.2].shape)
print ('10%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY3YR'] >=0.1].shape)

print ('5 years repayment completion rate for different percentages')
print ('90%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY5YR'] >=0.9].shape)
print ('80%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY5YR'] >=0.8].shape)
print ('70%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY5YR'] >=0.7].shape)
print ('60%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY5YR'] >=0.6].shape)
print ('50%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY5YR'] >=0.5].shape)
print ('40%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY5YR'] >=0.4].shape)
print ('30%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY5YR'] >=0.3].shape)
print ('20%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY5YR'] >=0.2].shape)
print ('10%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY5YR'] >=0.1].shape)

print ('5 years repayment completion rate for low income students according to different percentages')
print ('90%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY5YR'] >=0.9].shape)
print ('80%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY5YR'] >=0.8].shape)
print ('70%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY5YR'] >=0.7].shape)
print ('60%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY5YR'] >=0.6].shape)
print ('50%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY5YR'] >=0.5].shape)
print ('40%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY5YR'] >=0.4].shape)
print ('30%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY5YR'] >=0.3].shape)
print ('20%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY5YR'] >=0.2].shape)
print ('10%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY5YR'] >=0.1].shape)

print ('5 years repayment completion rate for low-middle income students according to different percentages')
print ('90%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY5YR'] >=0.9].shape)
print ('80%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY5YR'] >=0.8].shape)
print ('70%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY5YR'] >=0.7].shape)
print ('60%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY5YR'] >=0.6].shape)
print ('50%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY5YR'] >=0.5].shape)
print ('40%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY5YR'] >=0.4].shape)
print ('30%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY5YR'] >=0.3].shape)
print ('20%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY5YR'] >=0.2].shape)
print ('10%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY5YR'] >=0.1].shape)

print ('7 years repayment completion rate for different percentages')
print ('90%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY7YR'] >=0.9].shape)
print ('80%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY7YR'] >=0.8].shape)
print ('70%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY7YR'] >=0.7].shape)
print ('60%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY7YR'] >=0.6].shape)
print ('50%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY7YR'] >=0.5].shape)
print ('40%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY7YR'] >=0.4].shape)
print ('30%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY7YR'] >=0.3].shape)
print ('20%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY7YR'] >=0.2].shape)
print ('10%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['COMPLETERPY7YR'] >=0.1].shape)

print ('7 years repayment completion rate for low income students according to different percentages')
print ('90%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY7YR'] >=0.9].shape)
print ('80%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY7YR'] >=0.8].shape)
print ('70%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY7YR'] >=0.7].shape)
print ('60%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY7YR'] >=0.6].shape)
print ('50%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY7YR'] >=0.5].shape)
print ('40%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY7YR'] >=0.4].shape)
print ('30%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY7YR'] >=0.3].shape)
print ('20%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY7YR'] >=0.2].shape)
print ('10%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOINCOMERPY7YR'] >=0.1].shape)

print ('7 years repayment completion rate for low-middle income students according to different percentages')
print ('90%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY7YR'] >=0.9].shape)
print ('80%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY7YR'] >=0.8].shape)
print ('70%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY7YR'] >=0.7].shape)
print ('60%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY7YR'] >=0.6].shape)
print ('50%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY7YR'] >=0.5].shape)
print ('40%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY7YR'] >=0.4].shape)
print ('30%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY7YR'] >=0.3].shape)
print ('20%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY7YR'] >=0.2].shape)
print ('10%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['LOMIAVGRPY7YR'] >=0.1].shape)

### From above statistical information, it is obvious that increasing the years for repayment rate, the better the result. Increasing the percentage for each category, the result on the lower side. We take three parameters to analyze the repayment completion rate as overall or cohort students, low income students and low-middle income students (Average of low and middle income students). The reason behind about the specific income status is to see the repayment status, as there will be less problem with higher income students (above 48K).The result for each category is given below.

### For 1 year, 2455,842 and 1066 institutions can be considered.
### For 3 years, 2923, 1242 and 1437 institutions can be considered.
### For 5 years, 3152, 1555 and 1748 institutions can be considered.
### For 7 years, 3149, 1859 and 2009 institutions can be considered.

### One year repayment rate is not so good, so we take a combination of 3, 5 & 7 years repayment completion rate for low income as well as low-middle income students to produce a new feature to represent all the information. We will take 50% as threshold to see the respective results.


In [None]:
# Taking instituions who has repayment completion rate, low income & low-middle income students percentage \
## over and upto 50%
repayment_completion_1YR = data.loc[data['COMPLETERPY1YR'] >=0.5]
repayment_completion_1YR = repayment_completion_1YR[['INSTNM','COMPLETERPY1YR']]
repayment_loincome_1YR = data.loc[data['LOINCOMERPY1YR'] >=0.5]
repayment_loincome_1YR = repayment_loincome_1YR[['INSTNM','LOINCOMERPY1YR']]
repayment_lomiavg_1YR = data.loc[data['LOMIAVGRPY1YR'] >=0.5]
repayment_lomiavg_1YR = repayment_lomiavg_1YR[['INSTNM','LOMIAVGRPY1YR']]
repayment_completion_3YR = data.loc[data['COMPLETERPY3YR'] >=0.5]
repayment_completion_3YR = repayment_completion_3YR[['INSTNM','COMPLETERPY3YR']]
repayment_loincome_3YR = data.loc[data['LOINCOMERPY3YR'] >=0.5]
repayment_loincome_3YR = repayment_loincome_3YR[['INSTNM','LOINCOMERPY3YR']]
repayment_lomiavg_3YR = data.loc[data['LOMIAVGRPY3YR'] >=0.5]
repayment_lomiavg_3YR = repayment_lomiavg_3YR[['INSTNM','LOMIAVGRPY3YR']]
repayment_completion_5YR = data.loc[data['COMPLETERPY5YR'] >=0.5]
repayment_completion_5YR = repayment_completion_5YR[['INSTNM','COMPLETERPY5YR']]
repayment_loincome_5YR = data.loc[data['LOINCOMERPY5YR'] >=0.5]
repayment_loincome_5YR = repayment_loincome_5YR[['INSTNM','LOINCOMERPY5YR']]
repayment_lomiavg_5YR = data.loc[data['LOMIAVGRPY5YR'] >=0.5]
repayment_lomiavg_5YR = repayment_lomiavg_5YR[['INSTNM','LOMIAVGRPY5YR']]
repayment_completion_7YR = data.loc[data['COMPLETERPY7YR'] >=0.5]
repayment_completion_7YR = repayment_completion_7YR[['INSTNM','COMPLETERPY7YR']]
repayment_loincome_7YR = data.loc[data['LOINCOMERPY7YR'] >=0.5]
repayment_loincome_7YR = repayment_loincome_7YR[['INSTNM','LOINCOMERPY7YR']]
repayment_lomiavg_7YR = data.loc[data['LOMIAVGRPY7YR'] >=0.5] 
repayment_lomiavg_7YR = repayment_lomiavg_7YR[['INSTNM','LOMIAVGRPY7YR']]

## Showing all the instituions lies (50% threshold) in 1, 3, 5 & 7 years repayment completion rate as well as low income and low-middle income proportion from combined students. 

In [None]:
# 2455 Instituiotns
repayment_completion_1YR.head()

In [None]:
# 2923 Instituiotns
repayment_completion_3YR.head()

In [None]:
# 3152 Instituiotns
repayment_completion_5YR.head()

In [None]:
# 3149 Instituiotns
repayment_completion_7YR.head()

In [None]:
# 842 Instituiotns
repayment_loincome_1YR.head()

In [None]:
# 1242 Instituiotns
repayment_loincome_3YR.head()

In [None]:
# 1555 Instituiotns
repayment_loincome_5YR.head()

In [None]:
# 1859 Institutions
repayment_loincome_7YR.head()

In [None]:
# 1066 Institutions
repayment_lomiavg_1YR.head()

In [None]:
# 1437 Institutions
repayment_lomiavg_3YR.head()

In [None]:
# 1748 Institutions
repayment_lomiavg_5YR.head()

In [None]:
# 2009 Institutions
repayment_lomiavg_7YR.head()

## Distribution of the Repayment Data

In [None]:
# One Year Completion,Low & Low-Middle Income
repayment_completion_1YR.plot(kind='hist', figsize=(15, 6))

plt.xlabel('ONE YEAR REPAYMENT COMPLETION RATE') # add to x-label to the plot
plt.ylabel('NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES') # add y-label to the plot
# add title to the plot
plt.title('ONE YEAR REPAYMENT VS NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES')

plt.show()


repayment_loincome_1YR.plot(kind='hist', figsize=(15, 6))

plt.xlabel('ONE YEAR REPAYMENT LOW-INCOME RATE') # add to x-label to the plot
plt.ylabel('NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES') # add y-label to the plot
# add title to the plot
plt.title('ONE YEAR LOW INCOME REPAYMENT VS NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES')

plt.show()


repayment_lomiavg_1YR.plot(kind='hist', figsize=(15, 6))

plt.xlabel('ONE YEAR LOW-MIDDLE INCOME REPAYMENT RATE') # add to x-label to the plot
plt.ylabel('NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES') # add y-label to the plot
# add title to the plot
plt.title('ONE YEAR LOW-MIDDLE INCOME REPAYMENT VS NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES')

plt.show()

# 3 Years Completion,Low & Low-Middle Income
repayment_completion_3YR.plot(kind='hist', figsize=(15, 6))

plt.xlabel('3 YEARS REPAYMENT COMPLETION RATE') # add to x-label to the plot
plt.ylabel('NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES') # add y-label to the plot
# add title to the plot
plt.title('3 YEARS REPAYMENT VS NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES')

plt.show()


repayment_loincome_3YR.plot(kind='hist', figsize=(15, 6))

plt.xlabel('3 YEARS REPAYMENT LOW-INCOME RATE') # add to x-label to the plot
plt.ylabel('NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES') # add y-label to the plot
# add title to the plot
plt.title('3 YEARS LOW INCOME REPAYMENT VS NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES')

plt.show()


repayment_lomiavg_3YR.plot(kind='hist', figsize=(15, 6))

plt.xlabel('3 YEARS LOW-MIDDLE INCOME REPAYMENT RATE') # add to x-label to the plot
plt.ylabel('NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES') # add y-label to the plot
# add title to the plot
plt.title('3 YEARS LOW-MIDDLE INCOME REPAYMENT VS NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES')

plt.show()

# 5 Years Completion,Low & Low-Middle Income
repayment_completion_5YR.plot(kind='hist', figsize=(15, 6))

plt.xlabel('5 YEARS REPAYMENT COMPLETION RATE') # add to x-label to the plot
plt.ylabel('NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES') # add y-label to the plot
# add title to the plot
plt.title('5 YEARS REPAYMENT VS NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES')

plt.show()

import matplotlib.pyplot as plt
repayment_loincome_5YR.plot(kind='hist', figsize=(15, 6))

plt.xlabel('5 YEARS REPAYMENT LOW-INCOME RATE') # add to x-label to the plot
plt.ylabel('NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES') # add y-label to the plot
# add title to the plot
plt.title('5 YEARS LOW INCOME REPAYMENT VS NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES')

plt.show()

import matplotlib.pyplot as plt
repayment_lomiavg_5YR.plot(kind='hist', figsize=(15, 6))

plt.xlabel('5 YEARS LOW-MIDDLE INCOME REPAYMENT RATE') # add to x-label to the plot
plt.ylabel('NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES') # add y-label to the plot
# add title to the plot
plt.title('5 YEARS LOW-MIDDLE INCOME REPAYMENT VS NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES')

plt.show()

# 7 Years Completion,Low & Low-Middle Income
repayment_completion_7YR.plot(kind='hist', figsize=(15, 6))

plt.xlabel('7 YEARS REPAYMENT COMPLETION RATE') # add to x-label to the plot
plt.ylabel('NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES') # add y-label to the plot
# add title to the plot
plt.title('7 YEARS REPAYMENT VS NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES')

plt.show()

import matplotlib.pyplot as plt
repayment_loincome_7YR.plot(kind='hist', figsize=(15, 6))

plt.xlabel('7 YEARS REPAYMENT LOW-INCOME RATE') # add to x-label to the plot
plt.ylabel('NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES') # add y-label to the plot
# add title to the plot
plt.title('7 YEARS LOW INCOME REPAYMENT VS NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES')

plt.show()


repayment_lomiavg_7YR.plot(kind='hist', figsize=(15, 6))

plt.xlabel('7 YEARS LOW-MIDDLE INCOME REPAYMENT RATE') # add to x-label to the plot
plt.ylabel('NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES') # add y-label to the plot
# add title to the plot
plt.title('7 YEARS LOW-MIDDLE INCOME REPAYMENT VS NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES')

plt.show()

## REPAYMENT RATE FOR TOP 50 INSTITUTIONS 

In [None]:
# ONE YEAR REPAYMENT RATE FOR COMPLETION, LOW INCOME & LOW-MIDDLE INCOME STUDENTS
# Top 50 Institutions
repayment_completion_1YR = repayment_completion_1YR.groupby(['INSTNM']).mean()
repayment_completion_1YR = repayment_completion_1YR.sort_values(['COMPLETERPY1YR'], ascending=False, axis=0)

# Chosing Top 50 instituions
repayment_completion_1YR =  repayment_completion_1YR.reset_index()
repayment_completion_1YR=repayment_completion_1YR.head(50)
repayment_completion_1YR=repayment_completion_1YR[['INSTNM', 'COMPLETERPY1YR']]
repayment_completion_1YR=repayment_completion_1YR.set_index('INSTNM')

# plot data For One Year Repayment Rate

repayment_completion_1YR.plot(kind='bar', figsize=(15, 6))

plt.xlabel('INSTITUTIONS') # add to x-label to the plot
plt.ylabel('ONE YEAR REPAYMENT RATE') # add y-label to the plot
# add title to the plot
plt.title('Top 50 Institutions VS ONE YEAR REPAYMENT RATE')

plt.show()

# Top 50 Institutions
repayment_loincome_1YR = repayment_loincome_1YR.groupby(['INSTNM']).mean()
repayment_loincome_1YR = repayment_loincome_1YR.sort_values(['LOINCOMERPY1YR'], ascending=False, axis=0)

# Chosing Top 50 instituions
repayment_loincome_1YR =  repayment_loincome_1YR.reset_index()
repayment_loincome_1YR=repayment_loincome_1YR.head(50)
repayment_loincome_1YR=repayment_loincome_1YR[['INSTNM', 'LOINCOMERPY1YR']]
repayment_loincome_1YR=repayment_loincome_1YR.set_index('INSTNM')

# plot data For 1 year repayment rate for low income students

repayment_loincome_1YR.plot(kind='barh', figsize=(15, 20))

plt.xlabel('INSTITUTIONS') # add to x-label to the plot
plt.ylabel('ONE YEAR REPAYMENT RATE OF LOW INCOME STUDENTS') # add y-label to the plot
# add title to the plot
plt.title('Top 50 Institutions VS ONE YEAR REPAYMENT RATE OF LOW INCOME STUDENTS')

plt.show()

# Top 50 Institutions
repayment_lomiavg_1YR=  repayment_lomiavg_1YR.groupby(['INSTNM']).mean()
repayment_lomiavg_1YR = repayment_lomiavg_1YR.sort_values(['LOMIAVGRPY1YR'], ascending=False, axis=0)


# Chosing Top 50 instituions
repayment_lomiavg_1YR = repayment_lomiavg_1YR.reset_index()
repayment_lomiavg_1YR=repayment_lomiavg_1YR.head(50)
repayment_lomiavg_1YR=repayment_lomiavg_1YR[['INSTNM', 'LOMIAVGRPY1YR']]
repayment_lomiavg_1YR=repayment_lomiavg_1YR.set_index('INSTNM')

# plot data For 1 year repayment rate for low-middle income students

repayment_lomiavg_1YR.plot(kind='barh', figsize=(15, 20))

plt.xlabel('INSTITUTIONS') # add to x-label to the plot
plt.ylabel('ONE YEAR REPAYMENT RATE OF LOW-MIDDLE INCOME STUDENTS') # add y-label to the plot
# add title to the plot
plt.title('Top 50 Institutions VS ONE YEAR REPAYMENT RATE OF LOW-MIDDLE INCOME STUDENTS')

plt.show()

# 3 YEAR REPAYMENT RATE FOR COMPLETION, LOW INCOME & MIDDLE INCOME STUDENTS

# Top 50 Institutions
repayment_completion_3YR = repayment_completion_3YR.groupby(['INSTNM']).mean()
repayment_completion_3YR = repayment_completion_3YR.sort_values(['COMPLETERPY3YR'], ascending=False, axis=0)

# Chosing Top 50 instituions
repayment_completion_3YR =  repayment_completion_3YR.reset_index()
repayment_completion_3YR=repayment_completion_3YR.head(50)
repayment_completion_3YR=repayment_completion_3YR[['INSTNM', 'COMPLETERPY3YR']]
repayment_completion_3YR=repayment_completion_3YR.set_index('INSTNM')

# plot data For Three Years Repayment Rate

repayment_completion_3YR.plot(kind='bar', figsize=(15, 6))

plt.xlabel('INSTITUTIONS') # add to x-label to the plot
plt.ylabel('THREE YEAR REPAYMENT RATE') # add y-label to the plot
# add title to the plot
plt.title('Top 50 Institutions VS THREE YEAR REPAYMENT RATE')

plt.show()

# Top 50 Institutions
repayment_loincome_3YR = repayment_loincome_3YR.groupby(['INSTNM']).mean()
repayment_loincome_3YR = repayment_loincome_3YR.sort_values(['LOINCOMERPY3YR'], ascending=False, axis=0)

# Chosing Top 50 instituions
repayment_loincome_3YR =  repayment_loincome_3YR.reset_index()
repayment_loincome_3YR=repayment_loincome_3YR.head(50)
repayment_loincome_3YR=repayment_loincome_3YR[['INSTNM', 'LOINCOMERPY3YR']]
repayment_loincome_3YR=repayment_loincome_3YR.set_index('INSTNM')

# plot data For 1 year repayment rate for low income students

repayment_loincome_3YR.plot(kind='barh', figsize=(15, 20))

plt.xlabel('INSTITUTIONS') # add to x-label to the plot
plt.ylabel('THREE YEARS REPAYMENT RATE OF LOW INCOME STUDENTS') # add y-label to the plot
# add title to the plot
plt.title('Top 50 Institutions VS THREE YEARS REPAYMENT RATE OF LOW INCOME STUDENTS')

plt.show()

# Top 50 Institutions
repayment_lomiavg_3YR=  repayment_lomiavg_3YR.groupby(['INSTNM']).mean()
repayment_lomiavg_3YR = repayment_lomiavg_3YR.sort_values(['LOMIAVGRPY3YR'], ascending=False, axis=0)

# Chosing Top 50 instituions
repayment_lomiavg_3YR = repayment_lomiavg_3YR.reset_index()
repayment_lomiavg_3YR=repayment_lomiavg_3YR.head(50)
repayment_lomiavg_3YR=repayment_lomiavg_3YR[['INSTNM', 'LOMIAVGRPY3YR']]
repayment_lomiavg_3YR=repayment_lomiavg_3YR.set_index('INSTNM')

# plot data For 3 years repayment rate for low-middle income students

repayment_lomiavg_3YR.plot(kind='barh', figsize=(15, 20))

plt.xlabel('INSTITUTIONS') # add to x-label to the plot
plt.ylabel('THREE YEARS REPAYMENT RATE OF LOW-MIDDLE INCOME STUDENTS') # add y-label to the plot
# add title to the plot
plt.title('Top 50 Institutions VS THREE YEAR REPAYMENT RATE OF LOW-MIDDLE INCOME STUDENTS')

plt.show()

# 5 YEARS REPAYMENT RATE FOR COMPLETION, LOW INCOME & MIDDLE INCOME STUDENTS
# Top 50 Institutions
repayment_completion_5YR = repayment_completion_5YR.groupby(['INSTNM']).mean()
repayment_completion_5YR = repayment_completion_5YR.sort_values(['COMPLETERPY5YR'], ascending=False, axis=0)

# Chosing Top 50 instituions
repayment_completion_5YR =  repayment_completion_5YR.reset_index()
repayment_completion_5YR=repayment_completion_5YR.head(50)
repayment_completion_5YR=repayment_completion_5YR[['INSTNM', 'COMPLETERPY5YR']]
repayment_completion_5YR=repayment_completion_5YR.set_index('INSTNM')

# plot data For Five Years Repayment Rate

repayment_completion_5YR.plot(kind='bar', figsize=(15, 6))

plt.xlabel('INSTITUTIONS') # add to x-label to the plot
plt.ylabel('FIVE YEARS REPAYMENT RATE') # add y-label to the plot
# add title to the plot
plt.title('Top 50 Institutions VS FIVE YEARS REPAYMENT RATE')

plt.show()

# Top 50 Institutions
repayment_loincome_5YR = repayment_loincome_5YR.groupby(['INSTNM']).mean()
repayment_loincome_5YR = repayment_loincome_5YR.sort_values(['LOINCOMERPY5YR'], ascending=False, axis=0)

# Chosing Top 50 instituions
repayment_loincome_5YR =  repayment_loincome_5YR.reset_index()
repayment_loincome_5YR=repayment_loincome_5YR.head(50)
repayment_loincome_5YR=repayment_loincome_5YR[['INSTNM', 'LOINCOMERPY5YR']]
repayment_loincome_5YR=repayment_loincome_5YR.set_index('INSTNM')

# plot data For Five years repayment rate for low income students

repayment_loincome_5YR.plot(kind='barh', figsize=(15, 20))

plt.xlabel('INSTITUTIONS') # add to x-label to the plot
plt.ylabel('FIVE YEARS REPAYMENT RATE OF LOW INCOME STUDENTS') # add y-label to the plot
# add title to the plot
plt.title('Top 50 Institutions VS FIVE YEARS REPAYMENT RATE OF LOW INCOME STUDENTS')

plt.show()

# Top 50 Institutions
repayment_lomiavg_5YR=  repayment_lomiavg_5YR.groupby(['INSTNM']).mean()
repayment_lomiavg_5YR = repayment_lomiavg_5YR.sort_values(['LOMIAVGRPY5YR'], ascending=False, axis=0)


# Chosing Top 50 instituions
repayment_lomiavg_5YR = repayment_lomiavg_5YR.reset_index()
repayment_lomiavg_5YR=repayment_lomiavg_5YR.head(50)
repayment_lomiavg_5YR=repayment_lomiavg_5YR[['INSTNM', 'LOMIAVGRPY5YR']]
repayment_lomiavg_5YR=repayment_lomiavg_5YR.set_index('INSTNM')

# plot data For 1 year repayment rate for low-middle income students

repayment_lomiavg_5YR.plot(kind='barh', figsize=(15, 20))

plt.xlabel('INSTITUTIONS') # add to x-label to the plot
plt.ylabel('ONE YEAR REPAYMENT RATE OF LOW-MIDDLE INCOME STUDENTS') # add y-label to the plot
# add title to the plot
plt.title('Top 50 Institutions VS ONE YEAR REPAYMENT RATE OF LOW-MIDDLE INCOME STUDENTS')

plt.show()

# 7 YEARS REPAYMENT RATE FOR COMPLETION, LOW INCOME & MIDDLE INCOME STUDENTS

# Top 50 Institutions
repayment_completion_7YR = repayment_completion_7YR.groupby(['INSTNM']).mean()
repayment_completion_7YR = repayment_completion_7YR.sort_values(['COMPLETERPY7YR'], ascending=False, axis=0)

# Chosing Top 50 instituions
repayment_completion_7YR =  repayment_completion_7YR.reset_index()
repayment_completion_7YR=repayment_completion_7YR.head(50)
repayment_completion_7YR=repayment_completion_7YR[['INSTNM', 'COMPLETERPY7YR']]
repayment_completion_7YR=repayment_completion_7YR.set_index('INSTNM')

# plot data For Seven Years Repayment Rate

repayment_completion_7YR.plot(kind='bar', figsize=(15, 6))

plt.xlabel('INSTITUTIONS') # add to x-label to the plot
plt.ylabel('SEVEN YEARS REPAYMENT RATE') # add y-label to the plot
# add title to the plot
plt.title('Top 50 Institutions VS SEVEN YEARS REPAYMENT RATE')

plt.show()

# Top 50 Institutions
repayment_loincome_7YR = repayment_loincome_7YR.groupby(['INSTNM']).mean()
repayment_loincome_7YR = repayment_loincome_7YR.sort_values(['LOINCOMERPY7YR'], ascending=False, axis=0)

# Chosing Top 50 instituions
repayment_loincome_7YR =  repayment_loincome_7YR.reset_index()
repayment_loincome_7YR = repayment_loincome_7YR.head(50)
repayment_loincome_7YR = repayment_loincome_7YR[['INSTNM', 'LOINCOMERPY7YR']]
repayment_loincome_7YR = repayment_loincome_7YR.set_index('INSTNM')

# plot data For seven years repayment rate for low income students

repayment_loincome_7YR.plot(kind='barh', figsize=(15, 20))

plt.xlabel('INSTITUTIONS') # add to x-label to the plot
plt.ylabel('SEVEN YEARS REPAYMENT RATE OF LOW INCOME STUDENTS') # add y-label to the plot
# add title to the plot
plt.title('Top 50 Institutions VS SEVEN YEARS REPAYMENT RATE OF LOW INCOME STUDENTS')

plt.show()

# Top 50 Institutions
repayment_lomiavg_7YR=  repayment_lomiavg_7YR.groupby(['INSTNM']).mean()
repayment_lomiavg_7YR = repayment_lomiavg_7YR.sort_values(['LOMIAVGRPY7YR'], ascending=False, axis=0)


# Chosing Top 50 instituions
repayment_lomiavg_7YR = repayment_lomiavg_7YR.reset_index()
repayment_lomiavg_7YR=repayment_lomiavg_7YR.head(50)
repayment_lomiavg_7YR=repayment_lomiavg_7YR[['INSTNM', 'LOMIAVGRPY7YR']]
repayment_lomiavg_7YR=repayment_lomiavg_7YR.set_index('INSTNM')

# plot data For seven years repayment rate for low-middle income students

repayment_lomiavg_7YR.plot(kind='barh', figsize=(15, 20))

plt.xlabel('INSTITUTIONS') # add to x-label to the plot
plt.ylabel('SEVEN YEARS REPAYMENT RATE OF LOW-MIDDLE INCOME STUDENTS') # add y-label to the plot
# add title to the plot
plt.title('Top 50 Institutions VS SEVEN YEARS REPAYMENT RATE OF LOW-MIDDLE INCOME STUDENTS')

plt.show()

In [None]:
def repayment_completion_rate (default_rate,threeyear_loincome_repayment,threeyear_lomiincome_repayment,\
              fiveyear_loincome_repayment,fiveyear_lomiincome_repayment,\
              sevenyear_loincome_repayment,sevenyear_lomiincome_repayment):
    if default_rate >=0.4:
        match = 0
    
    elif ( (threeyear_loincome_repayment > 0.5) or (threeyear_lomiincome_repayment > 0.5) \
         or (fiveyear_loincome_repayment > 0.5) or (fiveyear_lomiincome_repayment > 0.5) \
          or (sevenyear_loincome_repayment > 0.5) or (sevenyear_lomiincome_repayment > 0.5)):
          match = 1
    elif (((threeyear_loincome_repayment) or (threeyear_lomiincome_repayment)\
          or (fiveyear_loincome_repayment) or (fiveyear_lomiincome_repayment)\
          or (sevenyear_loincome_repayment) or (sevenyear_lomiincome_repayment)) <= 0.5):
            match = 0
        
    return match

In [None]:
data['repayment'] = data.apply (lambda x:repayment_completion_rate (x['CDR3'],x['LOINCOMERPY3YR'],x['LOMIAVGRPY3YR'],\
                                                                   x['LOINCOMERPY5YR'],x['LOMIAVGRPY5YR'],\
                                                                   x['LOINCOMERPY7YR'],x['LOMIAVGRPY7YR'])\
                                                               ,axis=1)

## 3). 200%,100%,150% & 150% PELL COMPLETION RATE

### College completion is associated with other positive outcomes, like finding a job and successfully repaying student loans, and is an important metric for evaluating the experiences of students at the institution. However, both existing and new methods of measuring completion have limitations.

### For institutions primarily following an academic year calendar system, the IPEDS completion rates are limited to full-time, first-time students beginning in the fall semester. For institutions primarily following a non-academic year calendar system (program or continuous enrollment), the IPEDS completion rates cover all full-time, and first-time students. The exclusion of part-time students, transfer students, and students who do not start during the fall from IPEDS completion rates makes the rates less relevant for those populations of students.

### IPEDS data have several important limitations for measuring institutional performance. Perhaps the most significant is that many outcomes are recorded for a limited subset of students. Most importantly, graduation rates are only reported for cohorts of full-time, first-time students, so graduation rate information is not available for students who may have previous higher education experience or for               part-time students. Another limitation is that outcomes are not recorded for students who transfer from the institution. Thus, information on graduation rate outcomes is limited.

### Cohort graduation rates are for full-time, first-time students. This is the official measure of graduation rates mandated by the Higher Education Act, measuring the fraction of full-time, first-time students who complete their program of study within 100, 150, or 200 percent of the ‘normal’ completion time—e.g., the 150 percent completion rate measures the fraction of the cohort that graduates within six years for students pursuing a four-year degree or three years for students pursuing a two-year degree.


In [None]:
print ('200% Completion rate for 4 years degree according to different percentages')
print ('90%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C200_4'] >=0.9].shape)
print ('80%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C200_4'] >=0.8].shape)
print ('70%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C200_4'] >=0.7].shape)
print ('60%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C200_4'] >=0.6].shape)
print ('50%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C200_4'] >=0.5].shape)
print ('40%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C200_4'] >=0.4].shape)
print ('30%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C200_4'] >=0.3].shape)
print ('20%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C200_4'] >=0.2].shape)
print ('10%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C200_4'] >=0.1].shape)

print ('200% Completion rate for less than 4 years degree according to different percentages')
print ('90%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C200_L4'] >=0.9].shape)
print ('80%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C200_L4'] >=0.8].shape)
print ('70%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C200_L4'] >=0.7].shape)
print ('60%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C200_L4'] >=0.6].shape)
print ('50%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C200_L4'] >=0.5].shape)
print ('40%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C200_L4'] >=0.4].shape)
print ('30%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C200_L4'] >=0.3].shape)
print ('20%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C200_L4'] >=0.2].shape)
print ('10%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C200_L4'] >=0.1].shape)

print ('150% PELL Completion rate for 4 years degree according to different percentages')
print ('90%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_4_PELL'] >=0.9].shape)
print ('80%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_4_PELL'] >=0.8].shape)
print ('70%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_4_PELL'] >=0.7].shape)
print ('60%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_4_PELL'] >=0.6].shape)
print ('50%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_4_PELL'] >=0.5].shape)
print ('40%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_4_PELL'] >=0.4].shape)
print ('30%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_4_PELL'] >=0.3].shape)
print ('20%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_4_PELL'] >=0.2].shape)
print ('10%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_4_PELL'] >=0.1].shape)

print ('150% PELL Completion rate for less than 4 years degree according to different percentages')
print ('90%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_L4_PELL'] >=0.9].shape)
print ('80%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_L4_PELL'] >=0.8].shape)
print ('70%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_L4_PELL'] >=0.7].shape)
print ('60%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_L4_PELL'] >=0.6].shape)
print ('50%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_L4_PELL'] >=0.5].shape)
print ('40%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_L4_PELL'] >=0.4].shape)
print ('30%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_L4_PELL'] >=0.3].shape)
print ('20%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_L4_PELL'] >=0.2].shape)
print ('10%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_4_PELL'] >=0.1].shape)

print ('150% Completion rate for 4 years degree according to different percentages')
print ('90%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_4'] >=0.9].shape)
print ('80%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_4'] >=0.8].shape)
print ('70%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_4'] >=0.7].shape)
print ('60%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_4'] >=0.6].shape)
print ('50%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_4'] >=0.5].shape)
print ('40%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_4'] >=0.4].shape)
print ('30%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_4'] >=0.3].shape)
print ('20%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_4'] >=0.2].shape)
print ('10%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_4'] >=0.1].shape)

print ('150% Completion rate for less than 4 years degree according to different percentages')
print ('90%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_L4'] >=0.9].shape)
print ('80%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_L4'] >=0.8].shape)
print ('70%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_L4'] >=0.7].shape)
print ('60%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_L4'] >=0.6].shape)
print ('50%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_L4'] >=0.5].shape)
print ('40%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_L4'] >=0.4].shape)
print ('30%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_L4'] >=0.3].shape)
print ('20%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_L4'] >=0.2].shape)
print ('10%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C150_L4'] >=0.1].shape)

print ('100% Completion rate for 4 years degree according to different percentages')
print ('90%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C100_4'] >=0.9].shape)
print ('80%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C100_4'] >=0.8].shape)
print ('70%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C100_4'] >=0.7].shape)
print ('60%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C100_4'] >=0.6].shape)
print ('50%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C100_4'] >=0.5].shape)
print ('40%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C100_4'] >=0.4].shape)
print ('30%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C100_4'] >=0.3].shape)
print ('20%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C100_4'] >=0.2].shape)
print ('10%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C100_4'] >=0.1].shape)

print ('100% Completion rate for less than 4 years degree according to different percentages')
print ('90%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C100_L4'] >=0.9].shape)
print ('80%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C100_L4'] >=0.8].shape)
print ('70%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C100_L4'] >=0.7].shape)
print ('60%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C100_L4'] >=0.6].shape)
print ('50%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C100_L4'] >=0.5].shape)
print ('40%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C100_L4'] >=0.4].shape)
print ('30%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C100_L4'] >=0.3].shape)
print ('20%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C100_L4'] >=0.2].shape)
print ('10%: Number of institutions are showing on left side, & columns on the right side',\
       data.loc[data['C100_L4'] >=0.1].shape)

### From above statistical information, it is obvious that percentages are on the lower side especially for those completing degree within four years or less than four years. 150% completion rate (including PELL students) lies between 100% & 200% completion rate. We take 50% as our bench mark and according to that, the proportion is given below for four years degree and less than four years for each category.

### For 200% completion rate, 1050 & 2007 institutions can be considered.
### For 150% completion rate for PELL students, 914 & 1898 institutions can be considered.
### For 150% completion rate, 1142 & 2048 institutions can be considered.
### For 100% completion rate, 564 & 1060 institutions can be considered.     

### So now there will be two options to use completion rate in understanding the loan approval. If we take only 100% completion rate and in that case the percentage is very low which will give us more of the negative results in terms of loan rejected. For other case, if we take a combination of all the rates, it will give us a better picture. Though we can't say 100 percent achievement in that case but we will consider more institutions or students, and also as we are not totally dependent on this feature. Therefore overall we will get better results. We will take medium percentage (50%) for that purpose


In [None]:
c100_4= data.loc[data['C100_4'] >=0.5]
c100_4 = c100_4[['INSTNM','C100_4']]
c100_L4= data.loc[data['C100_L4'] >=0.5]
c100_L4 = c100_L4[['INSTNM','C100_L4']]
c150_4_pell= data.loc[data['C150_4_PELL'] >=0.5]
c150_4_pell = c150_4_pell[['INSTNM','C150_4_PELL']]
c150_L4_pell= data.loc[data['C150_L4_PELL'] >=0.5]
c150_L4_pell = c150_L4_pell[['INSTNM','C150_L4_PELL']]
c150_4= data.loc[data['C150_4'] >=0.5]
c150_4 = c150_4[['INSTNM','C150_4']]
c150_L4= data.loc[data['C150_L4'] >=0.5]
c150_L4 = c150_L4[['INSTNM','C150_L4']]
c200_4= data.loc[data['C200_4'] >=0.5]
c200_4 = c200_4[['INSTNM','C200_4']]
c200_L4= data.loc[data['C200_L4'] >=0.5]
c200_L4 = c200_L4[['INSTNM','C200_L4']]

## Showing all the institutions (50% threshold) for 200%,100%,150% & 150% PELL COMPLETION RATE

In [None]:
# 564 Institutions
c100_4.head()

In [None]:
# 1060 Institutions
c100_L4.head()

In [None]:
# 1142 Institutions
c150_4.head()

In [None]:
# 2048 Institutions
c150_L4.head()

In [None]:
# 914 Institutions
c150_4_pell.head()

In [None]:
# 1898 Institutions
c150_L4_pell.head()

In [None]:
# 1050 Institutions
c200_4.head()

In [None]:
# 2007 Institutions
c100_L4.head()

## DISTRIBUTION OF COMPLETION RATE AS 100%,150%,& 200%

In [None]:
c100_4.plot(kind='hist', figsize=(15, 6))

plt.xlabel('100 COMPLETION RATE FOR 4 YEARS DEGREE') # add to x-label to the plot
plt.ylabel('NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES') # add y-label to the plot
# add title to the plot
plt.title('100 COMPLETION RATE FOR 4 YEARS DEGREE VS NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES')

plt.show()

c100_L4.plot(kind='hist', figsize=(15, 6))

plt.xlabel('100 COMPLETION RATE FOR LESS THAN 4 YEARS DEGREE') # add to x-label to the plot
plt.ylabel('NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES') # add y-label to the plot
# add title to the plot
plt.title('100 COMPLETION RATE FOR LESS THAN 4 YEARS DEGREE VS NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES')

plt.show()

c150_4.plot(kind='hist', figsize=(15, 6))

plt.xlabel('150 COMPLETION RATE FOR 4 YEARS DEGREE') # add to x-label to the plot
plt.ylabel('NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES') # add y-label to the plot
# add title to the plot
plt.title('150 COMPLETION RATE FOR 4 YEARS DEGREE VS NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES')

plt.show()

c150_L4.plot(kind='hist', figsize=(15, 6))

plt.xlabel('150 COMPLETION RATE FOR LESS THAN 4 YEARS DEGREE') # add to x-label to the plot
plt.ylabel('NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES') # add y-label to the plot
# add title to the plot
plt.title('150 COMPLETION RATE FOR LESS THAN 4 YEARS DEGREE VS NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES')

plt.show()

c150_4_pell.plot(kind='hist', figsize=(15, 6))

plt.xlabel('150 PELL COMPLETION RATE FOR 4 YEARS DEGREE') # add to x-label to the plot
plt.ylabel('NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES') # add y-label to the plot
# add title to the plot
plt.title('150 PELL COMPLETION RATE FOR 4 YEARS DEGREE VS NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES')

plt.show()

c150_L4_pell.plot(kind='hist', figsize=(15, 6))

plt.xlabel('150 PELL COMPLETION RATE FOR LESS THAN 4 YEARS DEGREE') # add to x-label to the plot
plt.ylabel('NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES') # add y-label to the plot
# add title to the plot
plt.title('150 PELL COMPLETION RATE FOR LESS THAN 4 YEARS DEGREE VS NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES')

plt.show()

c200_4.plot(kind='hist', figsize=(15, 6))

plt.xlabel('200 COMPLETION RATE FOR 4 YEARS DEGREE') # add to x-label to the plot
plt.ylabel('NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES') # add y-label to the plot
# add title to the plot
plt.title('200 COMPLETION RATE FOR 4 YEARS DEGREE VS NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES')

plt.show()

c200_L4.plot(kind='hist', figsize=(15, 6))

plt.xlabel('200 COMPLETION RATE FOR LESS THAN 4 YEARS DEGREE') # add to x-label to the plot
plt.ylabel('NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES') # add y-label to the plot
# add title to the plot
plt.title('200 COMPLETION RATE FOR LESS THAN 4 YEARS DEGREE VS NUMBER OF INSTITUTIONS AGAINST RESPECTIVE PERCENTAGES')

plt.show()

## TOP 50 INSTITUTIONS FOR COMPLETION RATES

In [None]:
# COMPLETION RATE 100% , 200% , & 150% FOR 4 YEARS DEGREE
# Top 50 Institutions
c100_4 = c100_4.groupby(['INSTNM']).mean()
c100_4 = c100_4.sort_values(['C100_4'], ascending=False, axis=0)

# Chosing Top 50 instituions
c100_4 =  c100_4.reset_index()
c100_4=c100_4.head(50)
c100_4=c100_4[['INSTNM', 'C100_4']]
c100_4=c100_4.set_index('INSTNM')

# plot data For One Year Repayment Rate
import matplotlib.pyplot as plt
c100_4.plot(kind='bar', figsize=(15, 6))

plt.xlabel('INSTITUTIONS') # add to x-label to the plot
plt.ylabel('100% COMPLETION RATE 4 YEARS DEGREE ') # add y-label to the plot
# add title to the plot
plt.title('Top 50 Institutions VS 100% COMPLETION RATE 4 YEARS DEGREE')

plt.show()

# Top 50 Institutions
c200_4 = c200_4.groupby(['INSTNM']).mean()
c200_4 = c200_4.sort_values(['C200_4'], ascending=False, axis=0)

# Chosing Top 50 instituions
c200_4 =  c200_4.reset_index()
c200_4=c200_4.head(50)
c200_4=c200_4[['INSTNM', 'C200_4']]
c200_4=c200_4.set_index('INSTNM')

# plot data For One Year Repayment Rate
import matplotlib.pyplot as plt
c200_4.plot(kind='bar', figsize=(15, 6))

plt.xlabel('INSTITUTIONS') # add to x-label to the plot
plt.ylabel('200% COMPLETION RATE 4 YEARS ') # add y-label to the plot
# add title to the plot
plt.title('Top 50 Institutions VS 200% COMPLETION RATE 4 YEARS')

plt.show()

# Top 50 Institutions
c150_4 = c150_4.groupby(['INSTNM']).mean()
c150_4 = c150_4.sort_values(['C150_4'], ascending=False, axis=0)

# Chosing Top 50 instituions
c150_4 =  c150_4.reset_index()
c150_4=c150_4.head(50)
c150_4=c150_4[['INSTNM', 'C150_4']]
c150_4=c150_4.set_index('INSTNM')

# plot data For One Year Repayment Rate
import matplotlib.pyplot as plt
c150_4.plot(kind='bar', figsize=(15, 6))

plt.xlabel('INSTITUTIONS') # add to x-label to the plot
plt.ylabel('150% COMPLETION RATE 4 YEARS ') # add y-label to the plot
# add title to the plot
plt.title('Top 50 Institutions VS 150% COMPLETION RATE 4 YEARS')

# Top 50 Institutions
c150_4_pell = c150_4_pell.groupby(['INSTNM']).mean()
c150_4_pell = c150_4_pell.sort_values(['C150_4_PELL'], ascending=False, axis=0)

# Chosing Top 50 instituions
c150_4_pell =  c150_4_pell.reset_index()
c150_4_pell=c150_4_pell.head(50)
c150_4_pell=c150_4_pell[['INSTNM', 'C150_4_PELL']]
c150_4_pell=c150_4_pell.set_index('INSTNM')

# plot data For One Year Repayment Rate
import matplotlib.pyplot as plt
c150_4_pell.plot(kind='bar', figsize=(15, 6))

plt.xlabel('INSTITUTIONS') # add to x-label to the plot
plt.ylabel('150% PELL COMPLETION RATE 4 YEARS ') # add y-label to the plot
# add title to the plot
plt.title('Top 50 Institutions VS 150% PELL COMPLETION RATE 4 YEARS')

plt.show()

# COMPLETION RATE 100% , 200% , & 150% FOR 4 LESS THAN YEARS DEGREE
# Top 50 Institutions
c100_L4 = c100_L4.groupby(['INSTNM']).mean()
c100_L4 = c100_L4.sort_values(['C100_L4'], ascending=False, axis=0)

# Chosing Top 50 instituions
c100_L4 =  c100_L4.reset_index()
c100_L4=c100_L4.head(50)
c100_L4=c100_L4[['INSTNM', 'C100_L4']]
c100_L4=c100_L4.set_index('INSTNM')

# plot data For One Year Repayment Rate
import matplotlib.pyplot as plt
c100_L4.plot(kind='bar', figsize=(15, 6))

plt.xlabel('INSTITUTIONS') # add to x-label to the plot
plt.ylabel('100% COMPLETION RATE LESS THAN 4 YEARS DEGREE ') # add y-label to the plot
# add title to the plot
plt.title('Top 50 Institutions VS 100% COMPLETION RATE LESS THAN 4 YEARS DEGREE')

plt.show()

# Top 50 Institutions
c200_L4 = c200_L4.groupby(['INSTNM']).mean()
c200_L4 = c200_L4.sort_values(['C200_L4'], ascending=False, axis=0)

# Chosing Top 50 instituions
c200_L4 =  c200_L4.reset_index()
c200_L4=c200_L4.head(50)
c200_L4=c200_L4[['INSTNM', 'C200_L4']]
c200_L4=c200_L4.set_index('INSTNM')

# plot data For One Year Repayment Rate
import matplotlib.pyplot as plt
c200_L4.plot(kind='bar', figsize=(15, 6))

plt.xlabel('INSTITUTIONS') # add to x-label to the plot
plt.ylabel('200% COMPLETION RATE LESS THAN 4 YEARS DEGREE ') # add y-label to the plot
# add title to the plot
plt.title('Top 50 Institutions VS 200% COMPLETION RATE LESS THAN 4 YEARS DEGREE')

plt.show()

# Top 50 Institutions
c150_L4 = c150_L4.groupby(['INSTNM']).mean()
c150_L4 = c150_L4.sort_values(['C150_L4'], ascending=False, axis=0)

# Chosing Top 50 instituions
c150_L4 =  c150_L4.reset_index()
c150_L4=c150_L4.head(50)
c150_L4=c150_L4[['INSTNM', 'C150_L4']]
c150_L4=c150_L4.set_index('INSTNM')

# plot data For One Year Repayment Rate
import matplotlib.pyplot as plt
c150_L4.plot(kind='bar', figsize=(15, 6))

plt.xlabel('INSTITUTIONS') # add to x-label to the plot
plt.ylabel('150% COMPLETION RATE LESS THAN 4 YEARS DEGREE ') # add y-label to the plot
# add title to the plot
plt.title('Top 50 Institutions VS 150% COMPLETION RATE LESS THAN 4 YEARS DEGREE')

plt.show()

# Top 50 Institutions
c150_L4_pell = c150_L4_pell.groupby(['INSTNM']).mean()
c150_L4_pell = c150_L4_pell.sort_values(['C150_L4_PELL'], ascending=False, axis=0)

# Chosing Top 50 instituions
c150_L4_pell =  c150_L4_pell.reset_index()
c150_L4_pell=c150_L4_pell.head(50)
c150_L4_pell=c150_L4_pell[['INSTNM', 'C150_L4_PELL']]
c150_L4_pell=c150_L4_pell.set_index('INSTNM')

# plot data For One Year Repayment Rate
import matplotlib.pyplot as plt
c150_L4_pell.plot(kind='bar', figsize=(15, 6))

plt.xlabel('INSTITUTIONS') # add to x-label to the plot
plt.ylabel('150% PELL COMPLETION RATE LESS THAN 4 YEARS DEGREE ') # add y-label to the plot
# add title to the plot
plt.title('Top 50 Institutions VS 150% PELL COMPLETION RATE LESS THAN 4 YEARS DEGREE')

plt.show()

In [None]:
def completion_rate (c100_4,c100_l4,c150_4,c150_l4,c150_4_pell,c150_l4_pell,c200_4,c200_l4):
    if c100_4 >=0.5 or c100_l4 >= 0.5 or c150_4 >= 0.5 or c150_l4 >= 0.5 or c150_4_pell >= 0.5 or \
    c150_l4_pell >= 0.5  or c200_4 >= 0.5 or c200_l4 >= 0.5:
        score=1
    else:
        score=0
    return score
    

In [None]:
data['completionrate'] = data.apply (lambda x:completion_rate  (x['C100_4'],x['C100_L4'],x['C150_4'],\
                                                                         x['C150_L4'],x['C150_4_PELL'],
                                                                         x['C150_L4_PELL'],
                                                                         x['C200_4'],x['C200_L4']),axis=1)

In [None]:
data.head(1)

## SELECTED FEATURES & TARGET VARIABLE

### Now after feature engineering, we can create the target variable on that basis. The target variable will be comprised of earning, repayment and completion rate. The target varibale has binary values (1 or 0) and the equation for the target variable will be,
### y = x1.x2.x3 (“. “ Dot sign is showing multiplying between variables)

In [None]:
def predict (earn,repay,completion):
    if earn==1 and repay==1 and completion==1:
        match=1
    else:
        match=0
    return match
    
    

In [None]:
data['Loan_Approved'] = data.apply (lambda x:predict  (x['earnings'],x['repayment'],x['completionrate']),axis=1)

In [None]:
Loan_data=data[['INSTNM','earnings','repayment','completionrate','Loan_Approved']]
Loan_data.head()

In [None]:
loan=data[['INSTNM','earnings','repayment','completionrate','Loan_Approved']]
loan.head()

In [None]:
Loan_data['INSTNM'] = Loan_data['INSTNM'].astype('category')
Loan_data['INSTNM'] = Loan_data['INSTNM'].cat.codes

In [None]:
# Normalize 'INSTNM' feature for good view
Loan_data['INSTNM'] = Loan_data['INSTNM']/max(Loan_data['INSTNM'])
Loan_data.head()

# MODEL DESIGNING

### In model designing, we will split the data in to training and test set. 80% for training and 20% for testing. We will build our algorithm through Deep Learning with keras, based on artificial neural networks.

### Artificial Neural Networks (ANNs) were inspired by information processing and distributed communication nodes in biological systems. ANNs have various differences from biological brains. Specifically, neural networks tend to be static and symbolic, while the biological brain of most living organisms is dynamic (plastic) and analog.

### The original goal of the neural network approach was to solve problems in the same way that a human brain would. Over time, attention focused on matching specific mental abilities, leading to deviations from biology such as backpropagation, or passing information in the reverse direction and adjusting the network to reflect that information.

### A deep neural network (DNN) is an artificial neural network (ANN) with multiple layers between the input and output layers. The DNN finds the correct mathematical manipulation to turn the input into the output, whether it be a linear relationship or a non-linear relationship. The network moves through the layers calculating the probability of each output. 


In [None]:
# split into input and output variables
X = Loan_data[['INSTNM','earnings','repayment','completionrate']].values
Y = Loan_data['Loan_Approved'].values

In [None]:
import keras
from keras.callbacks import EarlyStopping
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import adam,sgd
from sklearn.model_selection import train_test_split

In [None]:
# seed for reproducing same results
seed = 20
np.random.seed(seed)

# split the data into training (80%) and testing (20%)
(X_train, X_test, Y_train, Y_test) = train_test_split(X, Y, test_size=0.20, random_state=seed)

### We will use Deep Neural Network architecture which has 1 input layer, 2 hidden layers and a single output layer.
### The input data which is of size 4 is sent to the first hidden layer that has randomly initialized 6 neurons. This is a very useful approach, if we don’t have any clue about the number of neurons to specify at the very first attempt. From here, we can easily perform trial-and-error procedure to increase the network architecture to produce good results. The next hidden layer has 6 neurons and the final output layer has 1 neuron that outputs whether the loan is approved or not. 


In [None]:
# create the model
model = Sequential()
model.add(Dense(6, input_dim=4, init='uniform', activation='relu'))
#model.add(keras.layers.Dropout(0.5))
model.add(Dense(6, init='uniform', activation='relu'))
#model.add(keras.layers.Dropout(0.5))
model.add(Dense(1, init='uniform', activation='sigmoid'))

# compile the model
model.compile(loss='binary_crossentropy', optimizer='sgd', metrics=['acc'])

# Early Stopping
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)

# fit the model
history=model.fit(X_train, Y_train, validation_data=(X_test, Y_test), epochs=50, batch_size=256,callbacks =[es])

# MODEL EVALUATION

### In model evaluation, four parameters are tested. That’s are, the training loss, validation loss, training accuracy and validation accuracy. There are three conditions to see your model working on the mentioned parameters.

### 1). Under fitting

### This is the only case where loss > validation loss, but only slightly, if loss is far higher than validation loss, please post your code and data so that we can have a look at

### 2). Overfitting

### loss << validation loss

### This means that your model is fitting very nicely the training data but not at all the validation data, in other words it's not generalizing correctly to unseen data

### 3). Perfect fitting

### loss == validation loss

### If both values end up to be roughly the same and also if the values are converging (plot the loss over time) then chances are very high that you are doing it right 


In [None]:
import matplotlib.pyplot as plt
# list all data in history
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

### As you can see from the above graph that both losses, training and validation are almost the same. It means the model is based on third condition. Therefore we can say that our model is working very well with accuracy 0.87 or 87%.

## Saving The File

In [None]:
output = pd.DataFrame({'Institution': loan.INSTNM,'Earnings':loan.earnings,'Repayment':loan.repayment, \
                       'Completion Rate':loan.completionrate,'Approved': Loan_data.Loan_Approved})
output.to_csv('submission.csv', index=False)