# <u><b>Final Project: Student Loan Dataset</b></u>
## Milestone 2
## By Paulina Mostek
## <u>Objective:</u> Explore the National Student Loan Data System dataset using Python, and perform data cleaning using pandas.
### Datasets used: https://catalog.data.gov/dataset/national-student-loan-data-system-722b0
* 1617FedSchoolCodeList.xlsx (Federal School Code List) - This may be needed to crossreference school data
* FL_Dashboard_AY_2009_2010_Q1.xls (Q1 of 2009-2010 School Year)
### We will be exploring the original Excel files and converted CSV files in order to determine which file type should be used for the final project.

# <u>Data Exploration: Why this data?</u>
#### In this project, we will be analyzing the data of student loans by a varying number of factors such as school type, state, and type of loan. 
#### Clients: The U.S. Department of Education or higher education policy analysts may benefit from analysis of this data. 
#### The analysis will....
* identify loan distribution patterns by institution type (public vs private) and state.
* help target policy changes to address student debt concentration.
* inform funding allocation for community/technical colleges vs universities.
* assist in developing guidelines for financial aid reform or loan caps.
* highlight trends that will identify priorities for educational investment.
#### Additionally, if the findings of these data queries are published publicly, <u>students</u> may be a potential client for this analysis. The analysis could aid students’ decisions on what states to attend school and what universities to attend. Students may be more inclined to apply to universities with a higher loan disbursement rate.


# <u>Data Exploration: Questions</u>
### We will be asking the following questions of the data and plotting the results:
1. Do community colleges or technical colleges originate more loans overall?
2. Which states have the highest total loan disbursements?
3. Do private or public schools disburse more in federal student loans?
4. Do institutions with “college” in the name receive more loans than those with “university”?
5. Are unsubsidized loans more common at private colleges than public ones?
### For relevancy, we will only explore data concerning schools in the US. Foreign schools will be excluded from our analysis.

# <u>Data Cleaning</u>
### After observing the data in Excel, potential issues have been identified.
### Possible quality check issues (for data cleaning):
* Additional school types - "PROPRIETARY" may require deletion if containing empty values or comparing private vs public schools.
* Additional school types - "FOREIGN PUBLIC, FOREIGN PRIVATE" are displayed throughout the data, which need to be deleted as we are only investigating US schools.
* Missing rows for school code and financial information: values are "-" or "0" instead of numbers.
* Other columns may need to be investigated for missing rows.
* Several Rows in the "Award Year Summary" Table are missing, particularly for parent and grad loans.


# <u>Import libraries</u>
## The following libraries will be imported for our code:
* csv: For reading xls files as csv.
* pandas: For loading data into dataframes.
* numpy: For handling arrays.
* matplotlib and seaborns: For plotting data observations.
* statistics: For exploratory analysis and checking basic information of the table.
* xlrd: This library must be installed (pip install xlrd) AND imported to read XLS files.

In [1]:
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statistics as stats
import seaborn as sns
import xlrd

# <u>Initial Exploratory Analysis</u>
### We will import the data into a pandas dataframe using the appropriate Python library for the dataset file type.
### We will explore basic properties of each dataset such as...
* describe()
* columns
* shape
* dtypes
* head(), tail(), sample()
* info()

### Load in each dataset into a pandas dataframe
* The original dataset is an Excel file containing multiple sheets.
* We will be loading the first two sheets into a dictionary.
* We will also be using a CSV version of the same file.
* We will compare the CSV and Excel files to decide which file to use for data exploration.

### Step 1: Load both original XLS files and CSV Files
* Use a dictionary to load in the first two sheets: "Quarterly Activity" and "Award Year Summary"
* Access each sheet via key (sheet name)
* Load in the CSV files, which are saved separately.

In [2]:
#XLS APPROACH
#Use a dictionary to load in multiple sheets
#Load two specific sheets by name
sheets = pd.read_excel("FL_Dashboard_AY2009_2010_Q1.xls", sheet_name=["Quarterly Activity", "Award Year Summary"])

#Access each sheet (excel)
qa = sheets["Quarterly Activity"]
ays = sheets["Award Year Summary"]

#CSV APPROACH
#Load the CSV versions
qa_csv = pd.read_csv("FL_Dashboard_AY2009_2010_Q1 - Quarterly Activity.csv")
ays_csv = pd.read_csv("FL_Dashboard_AY2009_2010_Q1 - Award Year Summary.csv")

### Step 2: Compare XLS and CSV Files 
* Check the shape of both: they are identical
* Compare column names of both: they are identical in column name, number, and object type
* Compare full dataframes: Missing values

#### Shape

In [4]:
#Quarterly Activity
print("Excel QA shape:", qa.shape)
print("CSV QA shape:", qa_csv.shape)

#Award Year Summary
print("Excel AYS shape:", ays.shape)
print("CSV AYS shape:", ays_csv.shape)

Excel qa shape: (3825, 25)
CSV qa shape: (3825, 25)
Excel ays shape: (3825, 25)
CSV ays shape: (3825, 25)


#### Columns

In [10]:
#Quarterly Activity
print("QA column names:")
print("Excel:",qa.columns)
print("CSV:",qa_csv.columns)

#Award Year Summary
print("AYS column names:")
print("Excel:",ays.columns)
print("CSV:",ays_csv.columns)


QA column names:
Excel: Index(['2009-2010 Award Year FFEL Volume by School', 'Unnamed: 1',
       'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6',
       'Unnamed: 7', 'Unnamed: 8', 'Unnamed: 9', 'Unnamed: 10', 'Unnamed: 11',
       'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14', 'Unnamed: 15',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24'],
      dtype='object')
CSV: Index(['2009-2010 Award Year FFEL Volume by School', 'Unnamed: 1',
       'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6',
       'Unnamed: 7', 'Unnamed: 8', 'Unnamed: 9', 'Unnamed: 10', 'Unnamed: 11',
       'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14', 'Unnamed: 15',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24'],
      dtype='object')
AYS column names:
Excel: Index(['

#### Compare full dataframes

In [11]:
#Quarterly Analysis
comparison_qa = qa.compare(qa_csv, keep_shape=True, keep_equal=False)
print("Differences in QA:")
print(comparison_qa)


Differences in QA:
     2009-2010 Award Year FFEL Volume by School       Unnamed: 1        \
                                           self other       self other   
0                                           NaN   NaN        NaN   NaN   
1                                           NaN   NaN        NaN   NaN   
2                                           NaN   NaN        NaN   NaN   
3                                           NaN   NaN        NaN   NaN   
4                                           NaN   NaN        NaN   NaN   
...                                         ...   ...        ...   ...   
3820                                        NaN   NaN        NaN   NaN   
3821                                        NaN   NaN        NaN   NaN   
3822                                        NaN   NaN        NaN   NaN   
3823                                        NaN   NaN        NaN   NaN   
3824                                        NaN   NaN        NaN   NaN   

     Unnamed: 2   

In [12]:
#Award Year Summary
comparison_ays = ays.compare(ays_csv, keep_shape=True, keep_equal=False)
print("Differences in AYS:")
print(comparison_ays)

Differences in AYS:
     2009-2010 Award Year FFEL Volume by School       Unnamed: 1        \
                                           self other       self other   
0                                           NaN   NaN        NaN   NaN   
1                                           NaN   NaN        NaN   NaN   
2                                           NaN   NaN        NaN   NaN   
3                                           NaN   NaN        NaN   NaN   
4                                           NaN   NaN        NaN   NaN   
...                                         ...   ...        ...   ...   
3820                                        NaN   NaN        NaN   NaN   
3821                                        NaN   NaN        NaN   NaN   
3822                                        NaN   NaN        NaN   NaN   
3823                                        NaN   NaN        NaN   NaN   
3824                                        NaN   NaN        NaN   NaN   

     Unnamed: 2  

#### Observation: Missing values
* Since there are missing values in the comparison, we will try to detect which file (csv or xls) has the missing values.

## Step 3: Identify Missing values
* View columns with missing values in both dataframes
* View rows with missing values in both dataframes
* Count missing values in each dataframe (CSV and XLS) to determine which file type to use

In [13]:
#View columns
#Quarterly Analysis
#For Excel version
print("Missing values in Excel (qa):")
print(qa.isna().sum())

#For CSV version
print("Missing values in CSV (qa_csv):")
print(qa_csv.isna().sum())

Missing values in Excel (qa):
2009-2010 Award Year FFEL Volume by School     1
Unnamed: 1                                     4
Unnamed: 2                                     4
Unnamed: 3                                    22
Unnamed: 4                                     4
Unnamed: 5                                     3
Unnamed: 6                                     4
Unnamed: 7                                     4
Unnamed: 8                                     4
Unnamed: 9                                     4
Unnamed: 10                                    3
Unnamed: 11                                    4
Unnamed: 12                                    4
Unnamed: 13                                    4
Unnamed: 14                                    4
Unnamed: 15                                    3
Unnamed: 16                                    4
Unnamed: 17                                    4
Unnamed: 18                                    4
Unnamed: 19                            

In [14]:
#Award Year Summary
#For Excel version
print("Missing values in Excel (ays):")
print(ays.isna().sum())

#For CSV version
print("Missing values in CSV (ays_csv):")
print(ays_csv.isna().sum())

Missing values in Excel (ays):
2009-2010 Award Year FFEL Volume by School     1
Unnamed: 1                                     4
Unnamed: 2                                     4
Unnamed: 3                                    22
Unnamed: 4                                     4
Unnamed: 5                                     3
Unnamed: 6                                     4
Unnamed: 7                                     4
Unnamed: 8                                     4
Unnamed: 9                                     4
Unnamed: 10                                    3
Unnamed: 11                                    4
Unnamed: 12                                    4
Unnamed: 13                                    4
Unnamed: 14                                    4
Unnamed: 15                                    3
Unnamed: 16                                    4
Unnamed: 17                                    4
Unnamed: 18                                    4
Unnamed: 19                           

#### Identify missing rows

In [15]:
#Identify missing rows

#Quarterly Analysis
print("Quarterly Analysis")
print("Rows with missing values in Excel:")
print(qa[qa.isna().any(axis=1)])

print("Rows with missing values in CSV:")
print(qa_csv[qa_csv.isna().any(axis=1)])


Quarterly Analysis
Rows with missing values in Excel:
             2009-2010 Award Year FFEL Volume by School  \
0     Award Year Quarterly Activity  (07/01/2009-09/...   
1                                    Data Run: 4/5/2012   
2                                                         
3                                                   NaN   
649                                            00671300   
654                                            00684200   
696                                            00866600   
734                                            01018800   
764                                            01097700   
820                                            01280200   
821                                            01281100   
859                                            02207400   
868                                            02233300   
870                                            02244400   
871                                            02246000   
87

In [16]:
#Award Year Summary
print("Award Year Summary")
print("Rows with missing values in Excel:")
print(ays[ays.isna().any(axis=1)])

print("Rows with missing values in CSV:")
print(ays_csv[ays_csv.isna().any(axis=1)])


Award Year Summary
Rows with missing values in Excel:
             2009-2010 Award Year FFEL Volume by School  \
0     Award Year Cumulative Activity through Quarter...   
1                                    Data Run: 4/5/2012   
2                                                         
3                                                   NaN   
649                                            00671300   
654                                            00684200   
696                                            00866600   
734                                            01018800   
764                                            01097700   
820                                            01280200   
821                                            01281100   
859                                            02207400   
868                                            02233300   
870                                            02244400   
871                                            02246000   
87

#### Count missing values
* Both CSV and XLS files contain the same amount of missing values
* Both the Quarterly Analysis and Award Year Summary dataframes contain the same amount of missing values
* These similarities mean we must inspect the data more carefully to determine which filetype to use

In [18]:
print("Quarterly Analysis")
print("Total missing in Excel:", qa.isna().sum().sum())
print("Total missing in CSV:", qa_csv.isna().sum().sum())

print("\nAward Year Summary")
print("Total missing in Excel:", ays.isna().sum().sum())
print("Total missing in CSV:", ays_csv.isna().sum().sum())

Quarterly Analysis
Total missing in Excel: 111
Total missing in CSV: 111

Award Year Summary
Total missing in Excel: 111
Total missing in CSV: 111


# <u>Initial Exploratory Analysis: Quarterly Analysis (Q1 Dataset)</u>
#### We will perform a basic structural comparison on the original Excel file (XLS) and converted CSV file.
* .shape: Check for identical shape of the data
* .columns: View the column names of the dataframes for discrepancies
1. Check for: Same name, no extra whitespace, same order
* .dtypes: Check for identical data types of the dataframes (objects)
* .info(): A summary
* .describe(): Check for same descriptive statistics
* .head(): See the first few rows of the data
* .tail(): See the last few rows of the data
* .sample(): See random rows from the data that may not appear in a typical viewing of the data, due to the large amount of rows.

### Shape

In [21]:
print("SHAPE:")
print("Excel: ", qa.shape, "\nCSV: ", qa_csv.shape)

SHAPE:
Excel:  (3825, 25) 
CSV:  (3825, 25)


### <u>Observation:</u> Identical shape

### Columns

In [24]:
print("\nCOLUMNS:")
print("Excel: ", qa.columns, "\nCSV: ", qa_csv.columns)


COLUMNS:
Excel:  Index(['2009-2010 Award Year FFEL Volume by School', 'Unnamed: 1',
       'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6',
       'Unnamed: 7', 'Unnamed: 8', 'Unnamed: 9', 'Unnamed: 10', 'Unnamed: 11',
       'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14', 'Unnamed: 15',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24'],
      dtype='object') 
CSV:  Index(['2009-2010 Award Year FFEL Volume by School', 'Unnamed: 1',
       'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6',
       'Unnamed: 7', 'Unnamed: 8', 'Unnamed: 9', 'Unnamed: 10', 'Unnamed: 11',
       'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14', 'Unnamed: 15',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24'],
      dtype='object')


### <u>Observation:</u> Columns are identical in name, whitespace, and number

### DTypes

In [25]:
print("\nDTYPES:")
print("Excel: ", qa.dtypes, "\nCSV: ", qa_csv.dtypes)


DTYPES:
Excel:  2009-2010 Award Year FFEL Volume by School    object
Unnamed: 1                                    object
Unnamed: 2                                    object
Unnamed: 3                                    object
Unnamed: 4                                    object
Unnamed: 5                                    object
Unnamed: 6                                    object
Unnamed: 7                                    object
Unnamed: 8                                    object
Unnamed: 9                                    object
Unnamed: 10                                   object
Unnamed: 11                                   object
Unnamed: 12                                   object
Unnamed: 13                                   object
Unnamed: 14                                   object
Unnamed: 15                                   object
Unnamed: 16                                   object
Unnamed: 17                                   object
Unnamed: 18                  

### <u>Observation:</u> Data Types are identical in number (1-24) and type (object)

### Info

#### Excel

In [22]:
qa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3825 entries, 0 to 3824
Data columns (total 25 columns):
 #   Column                                      Non-Null Count  Dtype 
---  ------                                      --------------  ----- 
 0   2009-2010 Award Year FFEL Volume by School  3824 non-null   object
 1   Unnamed: 1                                  3821 non-null   object
 2   Unnamed: 2                                  3821 non-null   object
 3   Unnamed: 3                                  3803 non-null   object
 4   Unnamed: 4                                  3821 non-null   object
 5   Unnamed: 5                                  3822 non-null   object
 6   Unnamed: 6                                  3821 non-null   object
 7   Unnamed: 7                                  3821 non-null   object
 8   Unnamed: 8                                  3821 non-null   object
 9   Unnamed: 9                                  3821 non-null   object
 10  Unnamed: 10             

#### CSV

In [23]:
qa_csv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3825 entries, 0 to 3824
Data columns (total 25 columns):
 #   Column                                      Non-Null Count  Dtype 
---  ------                                      --------------  ----- 
 0   2009-2010 Award Year FFEL Volume by School  3824 non-null   object
 1   Unnamed: 1                                  3821 non-null   object
 2   Unnamed: 2                                  3821 non-null   object
 3   Unnamed: 3                                  3803 non-null   object
 4   Unnamed: 4                                  3821 non-null   object
 5   Unnamed: 5                                  3822 non-null   object
 6   Unnamed: 6                                  3821 non-null   object
 7   Unnamed: 7                                  3821 non-null   object
 8   Unnamed: 8                                  3821 non-null   object
 9   Unnamed: 9                                  3821 non-null   object
 10  Unnamed: 10             

### <u>Observation:</u> Both file types are identical in object type, memory usage, number of rows, number of columns, and entries

### Describe

In [28]:
#Excel
qa.describe()

Unnamed: 0,2009-2010 Award Year FFEL Volume by School,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24
count,3824,3821,3821,3803,3821,3822,3821,3821,3821,3821,...,3822,3821,3821,3821,3821,3822,3821,3821,3821,3821
unique,3824,3553,56,3794,8,1395,1399,3411,1417,3442,...,436,457,2252,463,2281,283,292,992,294,989
top,Award Year Quarterly Activity (07/01/2009-09/...,ITT TECHNICAL INSTITUTE,FC,841150000,PRIVATE,1,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
freq,1,29,381,2,1360,160,158,114,143,114,...,1352,1352,1352,1352,1352,2784,2784,2784,2784,2784


In [29]:
#CSV
qa_csv.describe()

Unnamed: 0,2009-2010 Award Year FFEL Volume by School,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24
count,3824,3821,3821,3803,3821,3822,3821,3821,3821,3821,...,3822,3821,3821,3821,3821,3822,3821,3821,3821,3821
unique,3824,3553,56,3794,8,1395,1399,3411,1417,3442,...,436,457,2252,463,2281,283,292,992,294,989
top,Award Year Quarterly Activity (07/01/2009-09/...,ITT TECHNICAL INSTITUTE,FC,841150000,PRIVATE,1,1,$ -,1,$ -,...,-,-,$ -,0,$ -,-,-,$ -,0,$ -
freq,1,29,381,2,1360,160,158,114,143,114,...,1352,1352,1352,1352,1352,2784,2784,2784,2784,2784


### <u>Discrepancy Discovered:</u> 
* Columns names have been changed in each dataframe. "School" is now "Unnamed: 1"
* ITT row contains values of 0 in XLS file, and $- in CSV file.
### Suggestion: Use the XLS sheet file (qa), which lies inside of a dictionary for the entire dataframe.
1. Rename columns manually by comparing number of column to column name on XLS sheet.
2. Drop all rows containing a value of '0' so there are no empty values where there should be data.

### View the ITT row

In [32]:
#From Excel DataFrame
itt_excel = qa[qa["Unnamed: 1"] == "ITT TECHNICAL INSTITUTE"]

#From CSV DataFrame
itt_csv = qa_csv[qa_csv["Unnamed: 1"] == "ITT TECHNICAL INSTITUTE"]

### Print both rows

In [34]:
print("Excel version of ITT row:")
print(itt_excel)

Excel version of ITT row:
     2009-2010 Award Year FFEL Volume by School               Unnamed: 1  \
111                                    02065200  ITT TECHNICAL INSTITUTE   
127                                    02361100  ITT TECHNICAL INSTITUTE   
366                                    02120900  ITT TECHNICAL INSTITUTE   
386                                    02291500  ITT TECHNICAL INSTITUTE   
387                                    02291600  ITT TECHNICAL INSTITUTE   
392                                    02321800  ITT TECHNICAL INSTITUTE   
393                                    02321900  ITT TECHNICAL INSTITUTE   
439                                    03070400  ITT TECHNICAL INSTITUTE   
443                                    03087400  ITT TECHNICAL INSTITUTE   
550                                    02321700  ITT TECHNICAL INSTITUTE   
1087                                   02286500  ITT TECHNICAL INSTITUTE   
1112                                   03087600  ITT TECHNICAL

In [35]:
print("\nCSV version of ITT row:")
print(itt_csv)


CSV version of ITT row:
     2009-2010 Award Year FFEL Volume by School               Unnamed: 1  \
111                                    02065200  ITT TECHNICAL INSTITUTE   
127                                    02361100  ITT TECHNICAL INSTITUTE   
366                                    02120900  ITT TECHNICAL INSTITUTE   
386                                    02291500  ITT TECHNICAL INSTITUTE   
387                                    02291600  ITT TECHNICAL INSTITUTE   
392                                    02321800  ITT TECHNICAL INSTITUTE   
393                                    02321900  ITT TECHNICAL INSTITUTE   
439                                    03070400  ITT TECHNICAL INSTITUTE   
443                                    03087400  ITT TECHNICAL INSTITUTE   
550                                    02321700  ITT TECHNICAL INSTITUTE   
1087                                   02286500  ITT TECHNICAL INSTITUTE   
1112                                   03087600  ITT TECHNICAL 

### <u>Observation:</u> The school name appears several times in the file. 
* Suggestion: For each school name, add all values together and merge into a single row.

### Heads
#### Excel

In [4]:
qa.head()

Unnamed: 0,2009-2010 Award Year FFEL Volume by School,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24
0,Award Year Quarterly Activity (07/01/2009-09/...,,,,,,,,,,...,,,,,,,,,,
1,Data Run: 4/5/2012,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,FFEL SUBSIDIZED,,,,,...,FFEL PARENT PLUS,,,,,FFEL GRAD PLUS,,,,
4,OPE ID,School,State,Zip Code,School Type,Recipients,# of Loans Originated,$ of Loans Originated,# of Disbursements,$ of Disbursements,...,Recipients,# of Loans Originated,$ of Loans Originated,# of Disbursements,$ of Disbursements,Recipients,# of Loans Originated,$ of Loans Originated,# of Disbursements,$ of Disbursements


### CSV

In [37]:
qa_csv.head()

Unnamed: 0,2009-2010 Award Year FFEL Volume by School,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24
0,Award Year Quarterly Activity (07/01/2009-09/...,,,,,,,,,,...,,,,,,,,,,
1,Data Run: 4/5/2012,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,FFEL SUBSIDIZED,,,,,...,FFEL PARENT PLUS,,,,,FFEL GRAD PLUS,,,,
4,OPE ID,School,State,Zip Code,School Type,Recipients,# of Loans Originated,$ of Loans Originated,# of Disbursements,$ of Disbursements,...,Recipients,# of Loans Originated,$ of Loans Originated,# of Disbursements,$ of Disbursements,Recipients,# of Loans Originated,$ of Loans Originated,# of Disbursements,$ of Disbursements


### <u>Observation:</u> Both dataframes look identical, but contain missing values.
* The first few rows of each dataframe are missing.
* The categories of SUBSIDIZED and UNSUBSIDIZED lie within rows that are missing.
* The columns, which should represent things like "Zip Code, School Type" are "Unnamed: 1", "Unnamed: 2"...and so on.
* The proper column names are in row 4 (index 0, so it is index[4]). 

### Sample random rows
* Sample 5 random rows from each dataset
* Sample rows by index for direct comparison

#### Sample 5 random rows

In [39]:
print("\n5 RANDOM SAMPLES:")
print("Excel: ")
qa.sample(5)


5 RANDOM SAMPLES:
Excel: 


Unnamed: 0,2009-2010 Award Year FFEL Volume by School,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24
2178,296400,SOUTHEASTERN COMMUNITY COLLEGE,NC,284725422,PUBLIC,236,237,675452,237,439470,...,0,0,0,0,0,0,0,0,0,0
563,137900,CONNECTICUT COLLEGE,CT,63204196,PRIVATE,154,154,735182,154,367341,...,33,34,767830,34,383919,0,0,0,0,0
1862,227800,LANSING COMMUNITY COLLEGE,MI,489331293,PUBLIC,1,1,1522,1,1750,...,0,0,0,0,0,0,0,0,0,0
1735,748800,KAPLAN CAREER INSTITUTE,MA,21291644,PROPRIETARY,74,74,240350,75,104458,...,3,3,9410,3,4821,0,0,0,0,0
2659,310000,OHIO UNIVERSITY,OH,457012979,PUBLIC,209,209,1768415,209,591493,...,0,0,0,0,0,43,47,472208,48,164999


In [41]:
print("\n5 RANDOM SAMPLES:")
print("CSV: ")
qa_csv.sample(5)


5 RANDOM SAMPLES:
CSV: 


Unnamed: 0,2009-2010 Award Year FFEL Volume by School,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24
416,2584200,NEWBRIDGE COLLEGE,CA,927058605,PROPRIETARY,115,115,"$ 419,821.00",122,"$ 218,028.00",...,5,5,"$ 21,645.00",5,"$ 12,113.00",-,-,$ -,0,$ -
2993,531000,PITTSBURGH INSTITUTE OF AERONAUTICS,PA,151222674,PRIVATE,63,63,"$ 233,820.00",63,"$ 91,557.00",...,9,10,"$ 126,786.00",10,"$ 43,431.00",-,-,$ -,0,$ -
1811,3296300,BALTIMORE SCHOOL OF MASSAGE,MD,210902261,PROPRIETARY,125,125,"$ 366,676.00",127,"$ 137,109.00",...,6,6,"$ 40,609.00",6,"$ 13,543.00",-,-,$ -,0,$ -
1681,212600,BERKLEE COLLEGE OF MUSIC,MA,22153693,PRIVATE,-,-,$ -,0,$ -,...,9,9,"$ 186,935.00",9,"$ 105,918.00",-,-,$ -,0,$ -
452,3131300,FIVE BRANCHES UNIVERSITY,CA,950624669,PRIVATE,138,138,"$ 1,120,675.00",139,"$ 625,007.00",...,-,-,$ -,0,$ -,58,59,"$ 1,217,611.00",68,"$ 580,365.00"


### Sample rows by index
### Excel
* Define the row numbers as a list
* Use "loc" to view the specified rows

In [43]:
indices = [10, 42, 77]
print("Excel sample:")
qa.loc[indices]

Excel sample:


Unnamed: 0,2009-2010 Award Year FFEL Volume by School,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24
10,2576900,CHARTER COLLEGE,AK,995084103,PROPRIETARY,192,192,516838,193,255514,...,4,4,20866,5,12783,0,0,0,0,0
42,2270400,SOUTHEASTERN BIBLE COLLEGE,AL,352442083,PRIVATE,123,124,495343,127,259886,...,3,3,34500,3,17250,0,0,0,0,0
77,2052200,BLACK RIVER TECHNICAL COLLEGE,AR,724550468,PUBLIC,648,662,1965082,669,998305,...,0,0,0,0,0,0,0,0,0,0


### CSV

In [44]:
print("CSV sample:")
qa_csv.loc[indices]

CSV sample:


Unnamed: 0,2009-2010 Award Year FFEL Volume by School,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24
10,2576900,CHARTER COLLEGE,AK,995084103,PROPRIETARY,192,192,"$ 516,838.00",193,"$ 255,514.00",...,4,4,"$ 20,866.00",5,"$ 12,783.00",-,-,$ -,0,$ -
42,2270400,SOUTHEASTERN BIBLE COLLEGE,AL,352442083,PRIVATE,123,124,"$ 495,343.00",127,"$ 259,886.00",...,3,3,"$ 34,500.00",3,"$ 17,250.00",-,-,$ -,0,$ -
77,2052200,BLACK RIVER TECHNICAL COLLEGE,AR,724550468,PUBLIC,648,662,"$ 1,965,082.00",669,"$ 998,305.00",...,-,-,$ -,0,$ -,-,-,$ -,0,$ -


### <u>Observation:</u> The Excel file contains....
* Integers, not decimals
* No dollar signs
* 0 in place of missing values, instead of "-"
### The CSV file contains...
* Decimals, with dollar signs
* "-", "$-", and 0 in place of missing values

### <u> Suggestion:</u> Use original XLS file.
#### Reasoning:
* Cleaner and more consistent structure
* No extra formatting artifacts like "$" or "-" (difficult to index and impossible to calculate, difficult to drop)
* Easier to process
* Already structured as a dictionary of sheets (qa and ays are the keys in the dictionary for each sheet)
* Missing values appear to be consistently marked as 0 (easier to or exclude from averages)

### For the reasons above, we will be using the XLS files (qa and ays) exclusively from now on and disregarding the csv files.
### Tail: View the tail end of the structure (the last few rows)

In [3]:
qa.tail()

Unnamed: 0,2009-2010 Award Year FFEL Volume by School,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24
3820,393200,UNIVERSITY OF WYOMING,WY,820713663,PUBLIC,2699,2735,12341711,2738,6819888,...,195,207,1830536,209,950682,58,65,740877,66,416236
3821,393300,WESTERN WYOMING COMMUNITY COLLEGE,WY,829010428,PUBLIC,214,214,731883,215,359742,...,2,2,13300,2,6650,0,0,0,0,0
3822,728900,CENTRAL WYOMING COLLEGE,WY,825012215,PUBLIC,149,154,414959,154,212783,...,7,7,34978,7,17492,0,0,0,0,0
3823,915700,WYOTECH,WY,820729519,PROPRIETARY,1049,1099,2554580,1220,1498483,...,478,490,7567391,522,4033026,0,0,0,0,0
3824,925900,LARAMIE COUNTY COMMUNITY COLLEGE,WY,820073299,PUBLIC,281,291,885794,291,447794,...,8,8,41636,8,20818,0,0,0,0,0


### <u>Observation:</u> The XLS file will be used with caution.
* Columns must be renamed from (Unnamed: 1) to appropriate name matching the original XLS file
* Values contain integers only, no decimals. Operations such as division must be performed carefully.
* Missing values contain "0" and may be dropped or excluded from analysis

# <u>Initial Exploratory Analysis: Award Year Summary (Q1 Dataset)</u>
### We will perform a basic structural comparison on the XLS file.
* .shape: Determine the shape of the data
* .columns: View the column names
* .dtypes: Determine the datatypes of the data
* .info(): A summary
* .describe(): Check the descriptive statistics
* .head(): See the first few rows of the data
* .tail(): See the last few rows of the data
* .sample(): See random rows from the data that may not appear in a typical viewing of the data, due to the large amount of rows.
1. Sample 5 random rows
2. Sample a specific index of rows

### Shape
* Observation: The shape is identical to the Quarterly Analysis sheet

In [47]:
ays.shape

(3825, 25)

### Columns
* Observation: The column names are identical to the Quarterly Analysis sheet. They do not match the column names in the original Excel file, such as "School" and "State".

In [50]:
ays.columns

Index(['2009-2010 Award Year FFEL Volume by School', 'Unnamed: 1',
       'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6',
       'Unnamed: 7', 'Unnamed: 8', 'Unnamed: 9', 'Unnamed: 10', 'Unnamed: 11',
       'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14', 'Unnamed: 15',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24'],
      dtype='object')

### Data types
* Observation: Same number (24) and type of data (objects) as the QA sheet.

In [52]:
ays.dtypes

2009-2010 Award Year FFEL Volume by School    object
Unnamed: 1                                    object
Unnamed: 2                                    object
Unnamed: 3                                    object
Unnamed: 4                                    object
Unnamed: 5                                    object
Unnamed: 6                                    object
Unnamed: 7                                    object
Unnamed: 8                                    object
Unnamed: 9                                    object
Unnamed: 10                                   object
Unnamed: 11                                   object
Unnamed: 12                                   object
Unnamed: 13                                   object
Unnamed: 14                                   object
Unnamed: 15                                   object
Unnamed: 16                                   object
Unnamed: 17                                   object
Unnamed: 18                                   

### Info
### Observations: Info appears identical to QA Sheet...
* Column name and number
* D type
* Number of non-null objects
* Memory usage
* Number of entries

In [53]:
ays.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3825 entries, 0 to 3824
Data columns (total 25 columns):
 #   Column                                      Non-Null Count  Dtype 
---  ------                                      --------------  ----- 
 0   2009-2010 Award Year FFEL Volume by School  3824 non-null   object
 1   Unnamed: 1                                  3821 non-null   object
 2   Unnamed: 2                                  3821 non-null   object
 3   Unnamed: 3                                  3803 non-null   object
 4   Unnamed: 4                                  3821 non-null   object
 5   Unnamed: 5                                  3822 non-null   object
 6   Unnamed: 6                                  3821 non-null   object
 7   Unnamed: 7                                  3821 non-null   object
 8   Unnamed: 8                                  3821 non-null   object
 9   Unnamed: 9                                  3821 non-null   object
 10  Unnamed: 10             

### Descriptive Statistics

In [54]:
ays.describe()

Unnamed: 0,2009-2010 Award Year FFEL Volume by School,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24
count,3824,3821,3821,3803,3821,3822,3821,3821,3821,3821,...,3822,3821,3821,3821,3821,3822,3821,3821,3821,3821
unique,3824,3553,56,3794,8,1395,1399,3411,1417,3442,...,436,457,2252,463,2281,283,292,992,294,989
top,Award Year Cumulative Activity through Quarter...,ITT TECHNICAL INSTITUTE,FC,841150000,PRIVATE,1,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
freq,1,29,381,2,1360,160,158,114,143,114,...,1352,1352,1352,1352,1352,2784,2784,2784,2784,2784


### Observations:
* ITT appears 29 times throughout the data.
* FC is listed as a state. This likely means "Foreign Country", as in the applicant of the loan is from a foreign country.
* The reason the data is missing for FC is likely because it is difficult to get data on foreign students.

### Head

In [55]:
ays.head()

Unnamed: 0,2009-2010 Award Year FFEL Volume by School,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24
0,Award Year Cumulative Activity through Quarter...,,,,,,,,,,...,,,,,,,,,,
1,Data Run: 4/5/2012,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,FFEL SUBSIDIZED,,,,,...,FFEL PARENT PLUS,,,,,FFEL GRAD PLUS,,,,
4,OPE ID,School,State,Zip Code,School Type,Recipients,# of Loans Originated,$ of Loans Originated,# of Disbursements,$ of Disbursements,...,Recipients,# of Loans Originated,$ of Loans Originated,# of Disbursements,$ of Disbursements,Recipients,# of Loans Originated,$ of Loans Originated,# of Disbursements,$ of Disbursements


### Tail

In [56]:
ays.tail()

Unnamed: 0,2009-2010 Award Year FFEL Volume by School,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24
3820,393200,UNIVERSITY OF WYOMING,WY,820713663,PUBLIC,2699,2735,12341711,2738,6819888,...,195,207,1830536,209,950682,58,65,740877,66,416236
3821,393300,WESTERN WYOMING COMMUNITY COLLEGE,WY,829010428,PUBLIC,214,214,731883,215,359742,...,2,2,13300,2,6650,0,0,0,0,0
3822,728900,CENTRAL WYOMING COLLEGE,WY,825012215,PUBLIC,149,154,414959,154,212783,...,7,7,34978,7,17492,0,0,0,0,0
3823,915700,WYOTECH,WY,820729519,PROPRIETARY,1049,1099,2554580,1220,1498483,...,478,490,7567391,522,4033026,0,0,0,0,0
3824,925900,LARAMIE COUNTY COMMUNITY COLLEGE,WY,820073299,PUBLIC,281,291,885794,291,447794,...,8,8,41636,8,20818,0,0,0,0,0


### 5 Random Samples

### Sampled index
* We will use the same index we defined when viewing a sampled index of the Quarterly Analysis data sheet

In [49]:
ays.loc[indices]

Unnamed: 0,2009-2010 Award Year FFEL Volume by School,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24
10,2576900,CHARTER COLLEGE,AK,995084103,PROPRIETARY,192,192,516838,193,255514,...,4,4,20866,5,12783,0,0,0,0,0
42,2270400,SOUTHEASTERN BIBLE COLLEGE,AL,352442083,PRIVATE,123,124,495343,127,259886,...,3,3,34500,3,17250,0,0,0,0,0
77,2052200,BLACK RIVER TECHNICAL COLLEGE,AR,724550468,PUBLIC,648,662,1965082,669,998305,...,0,0,0,0,0,0,0,0,0,0


### View the ITT Row

In [57]:
ays_itt = ays[ays["Unnamed: 1"] == "ITT TECHNICAL INSTITUTE"]

In [58]:
ays_itt

Unnamed: 0,2009-2010 Award Year FFEL Volume by School,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24
111,2065200,ITT TECHNICAL INSTITUTE,AZ,852829272,PROPRIETARY,425,432,1741803,432,620393,...,43,43,440430,43,153009,0,0,0,0,0
127,2361100,ITT TECHNICAL INSTITUTE,AZ,857045829,PROPRIETARY,338,343,1332796,346,480667,...,48,48,459525,48,154957,0,0,0,0,0
366,2120900,ITT TECHNICAL INSTITUTE,CA,956706047,PROPRIETARY,555,556,2025625,559,713346,...,58,59,595125,59,211452,0,0,0,0,0
386,2291500,ITT TECHNICAL INSTITUTE,CA,917732933,PROPRIETARY,259,271,1023487,272,379908,...,36,36,332734,36,119384,0,0,0,0,0
387,2291600,ITT TECHNICAL INSTITUTE,CA,921232662,PROPRIETARY,556,558,2017752,560,736649,...,56,60,529257,60,184102,0,0,0,0,0
392,2321800,ITT TECHNICAL INSTITUTE,CA,913423664,PROPRIETARY,511,514,1932110,516,683620,...,72,73,652541,73,223920,0,0,0,0,0
393,2321900,ITT TECHNICAL INSTITUTE,CA,928015454,PROPRIETARY,385,386,1417241,388,504777,...,33,33,301020,33,105490,0,0,0,0,0
439,3070400,ITT TECHNICAL INSTITUTE,CA,924083519,PROPRIETARY,563,565,2082138,565,729700,...,53,54,475077,54,165234,0,0,0,0,0
443,3087400,ITT TECHNICAL INSTITUTE,CA,905021356,PROPRIETARY,331,334,1351752,335,492466,...,46,46,426908,46,145586,0,0,0,0,0
550,2321700,ITT TECHNICAL INSTITUTE,CO,802295338,PROPRIETARY,391,391,1452092,395,519607,...,39,39,378738,39,127695,0,0,0,0,0


### <u>Observation:</u> ITT is listed under several different states and appears multiple times in the AYS data.
* Suggestion: Add up all value data (recipient number, loan disbursement amount, etc.) per school and merge rows.

# <u>Initial Exploratory Analysis: Federal School List</u>
## Step 1: Load and view raw dataframe

In [3]:
fsl = pd.read_excel("1617fedschoolcodelist.xls")

  warn(msg)


In [4]:
fsl

Unnamed: 0,SchoolCode,SchoolName,Address,City,StateCode,ZipCode,Province,Country,PostalCode
0,B04724,WIDENER UNIV SCHOOL OF LAW - DE,4601 CONCORD PIKE/PO BOX 7474,WILMINGTON,DE,19803,,,
1,B06171,CENTER FOR ADVANCED STUDIES OF PUER,BOX S-4467,SAN JUAN,PR,902,,,
2,B06511,PENTECOSTAL THEOLOGICAL SEMINARY,PO BOX 3330,CLEVELAND,TN,37320,,,
3,B07022,THE CHICAGO SCHOOL OF PROF PSYCHOLOGY,325 NORTH WELLS STREET,CHICAGO,IL,60610,,,
4,B07624,NATIONAL COLLEGE OF NATURAL MEDICINE,049 SW PORTER,PORTLAND,OR,97201,,,
...,...,...,...,...,...,...,...,...,...
6975,042517,HOPE COLLEGE OF ARTS AND SCIENCES,1200 SOUTH WEST 3RD STREET,POMPANO BEACH,FL,33069,,,
6976,E40419,BEAUTY INSTITUTE SCHWARZKOPF PROFESSIONA,1411 RAILROAD AVENUE,BELLINGHAM,WA,98225,,,
6977,042205,BUTTE COUNTY REGIONAL OCCUPATIONAL PROGR,2491 CARMICHAEL DRIVE,CHICO,CA,95928,,,
6978,G42404,UNIVERSIDAD ANA G. MENDEZ - CAMPUS VIRTU,1552 AVENUE PONCE DE LEON,SAN JUAN,PR,926,,,


### Observations: This file....
* Contains school codes that have a mix of letters and numbers. The school codes are not present in any other data sheets.
* Contains schools from Puerto Rico, which is treated as a US State
* Contains schools from other countries, where they contain values for the columns "Province", "Country", and "PostalCode".
### Suggestions:
* Delete or disregard any rows containing school codes that are not in the QA or AYS (School codes containing letters, like B04...)
* Delete rows containing schools from PR, as we are only looking at info on US States, and PR is a territory.
* Any row that contains a non-missing value for "Country" should be deleted, since these are foreign schools. We will only investigate schools in US States.
* The "province", "country", and "postalcode" columns can be deleted entirely since only foreign schools show values in these columns.

### Columns

### Observation: Column names match original Excel sheet, but whitespace is removed and title case is observed.

In [5]:
fsl.columns

Index(['SchoolCode', 'SchoolName', 'Address', 'City', 'StateCode', 'ZipCode',
       'Province', 'Country', 'PostalCode'],
      dtype='object')

### Inspect ITT Rows

In [6]:
fsl_itt = fsl[fsl["SchoolName"] == "ITT TECHNICAL INSTITUTE"]


In [7]:
fsl_itt

Unnamed: 0,SchoolCode,SchoolName,Address,City,StateCode,ZipCode,Province,Country,PostalCode
148,E00786,ITT TECHNICAL INSTITUTE,4520 SOUTH UNIVERSITY,LITTLE ROCK,AR,72204,,,
171,E00882,ITT TECHNICAL INSTITUTE,168 GIBSON ROAD,HENDERSON,NV,89014,,,
187,E00922,ITT TECHNICAL INSTITUTE,13 AIRLINE DRIVE,ALBANY,NY,12205,,,
224,E01017,ITT TECHNICAL INSTITUTE,760 MOORE ROAD,KING OF PRUSSIA,PA,19406,,,
508,E01966,ITT TECHNICAL INSTITUTE,17390 DUGDALE DRIVE,SOUTH BEND,IN,46635,,,
509,E01967,ITT TECHNICAL INSTITUTE,8488 GEORGIA STREET,MERRILLVILLE,IN,46410,,,
3245,004553,ITT TECHNICAL INSTITUTE,12302 WEST EXPLORER DRIVE,BOISE,ID,83713,,,
3616,007327,ITT TECHNICAL INSTITUTE,10999 STAHL ROAD,NEWBURGH,IN,47630,,,
3617,007329,ITT TECHNICAL INSTITUTE,9511 ANGOLA COURT,INDIANAPOLIS,IN,46268,,,
3662,007557,ITT TECHNICAL INSTITUTE,3640 CORPORATE TRAIL DRIVE,EARTH CITY,MO,63045,,,


### Observation: ITT is also listed several times in this file.
* Each entry has a unique school code, address, and zip code, suggesting that there is not one address or school code for each school, but each school has different identifiable information for each of its campuses.
* School Code may identify each individual campus of a school, and not each individual school.
* Example: 5610 and 5611 do not refer to different schools, but instead different campuses of the same school, ITT.

### Basic structural analysis:
* .describe(): Descriptive statistics
* .columns: Column names
* .shape: Shape of data
* .dtypes: Data types (such as object, int)
* .head(), tail(): First and last few rows of data
* sample(): Random sample of specified size
* .info(): Summary of data

In [16]:
fsl.describe()

Unnamed: 0,ZipCode
count,6980.0
mean,45657.029943
std,30989.290136
min,0.0
25%,18930.25
50%,44056.0
75%,73415.75
max,99801.0


In [17]:
fsl.columns

Index(['SchoolCode', 'SchoolName', 'Address', 'City', 'StateCode', 'ZipCode',
       'Province', 'Country', 'PostalCode'],
      dtype='object')

In [18]:
fsl.shape

(6980, 9)

In [19]:
fsl.dtypes

SchoolCode    object
SchoolName    object
Address       object
City          object
StateCode     object
ZipCode        int64
Province      object
Country       object
PostalCode    object
dtype: object

#### Note: Zipcode is listed as an int

In [21]:
fsl.head(10)

Unnamed: 0,SchoolCode,SchoolName,Address,City,StateCode,ZipCode,Province,Country,PostalCode
0,B04724,WIDENER UNIV SCHOOL OF LAW - DE,4601 CONCORD PIKE/PO BOX 7474,WILMINGTON,DE,19803,,,
1,B06171,CENTER FOR ADVANCED STUDIES OF PUER,BOX S-4467,SAN JUAN,PR,902,,,
2,B06511,PENTECOSTAL THEOLOGICAL SEMINARY,PO BOX 3330,CLEVELAND,TN,37320,,,
3,B07022,THE CHICAGO SCHOOL OF PROF PSYCHOLOGY,325 NORTH WELLS STREET,CHICAGO,IL,60610,,,
4,B07624,NATIONAL COLLEGE OF NATURAL MEDICINE,049 SW PORTER,PORTLAND,OR,97201,,,
5,B07625,OREGON COL OF ORIENTAL MEDICINE,10525 SE CHERRY BLOSSOM DR,PORTLAND,OR,97216,,,
6,B08041,ALFRED ADLER GRADUATE SCHOOL,1001 WEST HIGHWAY 7 SUITE 344,HOPKINS,MN,55305,,,
7,B08083,UNIV OF THE DIST OF COLU -SCHOOL OF LAW,4200 CONNECTICUT AVENUE NW,WASHINGTON,DC,20008,,,
8,B42154,GRACE SCHOOL OF THEOLOGY,3705 COLLEGE PARK DR,CONROE,TX,77384,,,
9,E00014,AMRCN REPERTORY THTR INST ADV THTR,64 BRATTLE STREET,CAMBRIDGE,MA,2138,,,


In [31]:
fsl.tail(10)

Unnamed: 0,SchoolCode,SchoolName,Address,City,StateCode,ZipCode,Province,Country,PostalCode
6970,042515,BETH MEDRASH OF ASBURY PARK,1500 VERMONT AVENUE,LAKEWOOD,NJ,8701,,,
6971,042540,ULTIMATE TOUCH BARBER COLLEGE,3732 SAUK TRAIL ROAD,RICHTON PARK,IL,60471,,,
6972,042519,FERRARA'S BEAUTY SCHOOL,108-22 QUEENS BOULEVARD,FOREST HILLS,NY,11375,,,
6973,042335,MED-LIFE INSTITUTE,4103 EAST TAMIAMI TRAIL,NAPLES,FL,34112,,,
6974,042503,HAIR ACADEMY SCHOOL OF BARBERING & BEAUT,1013 SOUTH COLLEGE AVENUE,NEWARK,DE,19713,,,
6975,042517,HOPE COLLEGE OF ARTS AND SCIENCES,1200 SOUTH WEST 3RD STREET,POMPANO BEACH,FL,33069,,,
6976,E40419,BEAUTY INSTITUTE SCHWARZKOPF PROFESSIONA,1411 RAILROAD AVENUE,BELLINGHAM,WA,98225,,,
6977,042205,BUTTE COUNTY REGIONAL OCCUPATIONAL PROGR,2491 CARMICHAEL DRIVE,CHICO,CA,95928,,,
6978,G42404,UNIVERSIDAD ANA G. MENDEZ - CAMPUS VIRTU,1552 AVENUE PONCE DE LEON,SAN JUAN,PR,926,,,
6979,042467,LEARNING BRIDGE CAREER INSTITUTE,1340 WEST TUNNELL BOULEVARD,HOUMA,LA,70360,,,


In [32]:
fsl.sample(5)

Unnamed: 0,SchoolCode,SchoolName,Address,City,StateCode,ZipCode,Province,Country,PostalCode
5375,25875,UNIVERSIDAD METROPOLITANA,PO BOX 21150,RÍO PIEDRAS,PR,928,,,
3498,6574,"ST LUKE'S HOSPITAL OF BETHLEHEM, PA",801 OSTRUM ST,BETHLEHEM,PA,18015,,,
2828,3447,SPARTANBURG METHODIST COLLEGE,1000 POWELL MILL ROAD,SPARTANBURG,SC,29301,,,
4913,15873,COMMONWEALTH OF PR DEPT OF EDUC - ITPR R,POST OFFICE BOX 7284,PONCE,PR,732,,,
4351,13003,NEILSON BEAUTY COLLEGE,416 W. JEFFERSON,DALLAS,TX,75208,,,


In [35]:
fsl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6980 entries, 0 to 6979
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   SchoolCode  6980 non-null   object
 1   SchoolName  6980 non-null   object
 2   Address     6980 non-null   object
 3   City        6980 non-null   object
 4   StateCode   6980 non-null   object
 5   ZipCode     6980 non-null   int64 
 6   Province    111 non-null    object
 7   Country     401 non-null    object
 8   PostalCode  227 non-null    object
dtypes: int64(1), object(8)
memory usage: 490.9+ KB


# Manually import sheets
* Import each sheet from the original Excel file as its own pandas dataframe
* Print the first 6 rows of each dataframe
* Reason: Each sheet was originally loaded into a dictionary of dataframes. Loading each sheet as its own individual dataframe will make datacleaning easier.

In [28]:
qa_raw = pd.read_excel("FL_Dashboard_AY2009_2010_Q1.xls", sheet_name="Quarterly Activity")
ays_raw = pd.read_excel("FL_Dashboard_AY2009_2010_Q1.xls", sheet_name="Award Year Summary")

### Quarterly Analysis

In [29]:
qa_raw.head(6)

Unnamed: 0,2009-2010 Award Year FFEL Volume by School,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24
0,Award Year Quarterly Activity (07/01/2009-09/...,,,,,,,,,,...,,,,,,,,,,
1,Data Run: 4/5/2012,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,FFEL SUBSIDIZED,,,,,...,FFEL PARENT PLUS,,,,,FFEL GRAD PLUS,,,,
4,OPE ID,School,State,Zip Code,School Type,Recipients,# of Loans Originated,$ of Loans Originated,# of Disbursements,$ of Disbursements,...,Recipients,# of Loans Originated,$ of Loans Originated,# of Disbursements,$ of Disbursements,Recipients,# of Loans Originated,$ of Loans Originated,# of Disbursements,$ of Disbursements
5,00106100,ALASKA PACIFIC UNIVERSITY,AK,995084672,PRIVATE,291,291,1546994,292,830513,...,31,33,386770,35,192181,5,5,69730,5,34865


### Award Year Summary

In [30]:
ays_raw.head(6)

Unnamed: 0,2009-2010 Award Year FFEL Volume by School,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24
0,Award Year Cumulative Activity through Quarter...,,,,,,,,,,...,,,,,,,,,,
1,Data Run: 4/5/2012,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,FFEL SUBSIDIZED,,,,,...,FFEL PARENT PLUS,,,,,FFEL GRAD PLUS,,,,
4,OPE ID,School,State,Zip Code,School Type,Recipients,# of Loans Originated,$ of Loans Originated,# of Disbursements,$ of Disbursements,...,Recipients,# of Loans Originated,$ of Loans Originated,# of Disbursements,$ of Disbursements,Recipients,# of Loans Originated,$ of Loans Originated,# of Disbursements,$ of Disbursements
5,00106100,ALASKA PACIFIC UNIVERSITY,AK,995084672,PRIVATE,291,291,1546994,292,830513,...,31,33,386770,35,192181,5,5,69730,5,34865


# <u>Overall Observations:</u>
### Our original questions...
1. Do community colleges or technical colleges originate more loans overall?
2. Which states have the highest total loan disbursements?
3. Do private or public schools disburse more in federal student loans?
4. Do institutions with “college” in the name receive more loans than those with “university”?
5. Are unsubsidized loans more common at private colleges than public ones?


# Quarterly Analysis and Award Year Summary
* Missing values as well as "0s" will need to be handled during data cleaning
* The first few missing rows will need to be handled (dropped)
* Columns will need to be renamed to proper original column names (School, State, etc.)
* Column headers (FFEL SUBSIDIZED, FFEL PARENT PLUS) will need to be reintegrated into the data. They currently exist among empty rows.

# Federal School List (School Codes)
* Each entry has a unique school code, address, and zip code, suggesting that there is not one address or school code for each school, but each school has different identifiable information for each of its campuses.
* School Code may identify each individual campus of a school, and not each individual school.
* Example: 5610 and 5611 do not refer to different schools, but instead different campuses of the same school, ITT.
* Schools may need to be grouped by name instead of state, as one school contains several school codes across several states.