# Scraping and Cleaning High-Level College Data

In this project, we'll be parsing college data from [CollegeData.com](https://www.collegedata.com/en/).  The site provides a wealth of information, based on college reporting, for prospective U.S.-based college students in an effort to help them with their decision-making process (providing data relating to areas such as financials, student population and statistics, satisfaction rates, etc).  It's a reasonably full-featured platform, overall, that provides users the ability to dive into the data, track deadlines, and find strong school matches for themselves; among other things.

We'll be pulling and parsing HTML data from CollegeData's search feature, focusing on compiling the high-level data provided initially as search results by the site (versus diving into individual college pages for more granular data - certainly a candidate for a subsequent project).  Once we parse out the information we're interested in, we will organize and clean it using pandas.

## Pulling the HTML Code & Parsing The Data

For the sake of expediency, avoiding possible future issues with changes to the HTML structure of CollegeData's search results page, and accounting for the fact that the search results gate some of the data based on the lack of a request verification token (normally attained through being logged in to a free account on the website); we've already saved a copy of the full HTML file (`collegedata.html`) that was pulled using Python's `requests` package.

For posterity: the HTML was retrieved by sending a `GET` request to `https://www.collegedata.com/en/explore-colleges/college-search/SearchByPreference/?SearchByPreference.SearchType=1&SearchByPreference.CollegeName=*` with a request verification token included within the request headers.  Note that this URL provides nothing other than `*` as search criteria, allowing us to pull results that include all available schools in CollegeData's database.

Note: Access to this data is available for free, and does not require a paid account for access.

### Opening the HTML & Initializing the Parser

We'll begin by opening our HTML file and using the `BeautifulSoup` package to create an HTML parser object, which we will use to extract the data we're interested in.  We'll also import the `re` package to use in combination with BeautifulSoup in selecting target elements.

In [1]:
from bs4 import BeautifulSoup
import re

with open("collegedata.html") as html_file:
    html_data = BeautifulSoup(html_file, "html.parser")

On CollegeData's search results page, results are organized into a table with 3 separate tabs of data (Vitals, Financial, and Students) and a paginated list of schools making up the table rows (see snapshot below).  The "College Chances" and locker icon columns are relevant to the site's services, but not to us.

![Search Results Snapshot](img/collegedataresults.png)

The results page indicates that, by running a search that captures all schools in CollegeData's database, we get data on 1966 schools in total.

![Search Results: Total](img/searchresulttotal.png)

Upon brief inspection of the HTML code, we notice that each table row element that represents a school's data contains the `data-school-id` attribute along with an ID value.  Let's parse the entire tree for every element containing this attribute.

In [2]:
data_by_ids = html_data.find_all(attrs={"data-school-id": True})
len(data_by_ids)

5898

There are 5898 items contained in this collection, which suggests that each school is represented in the code 3 times.  This makes sense, given that there are 3 tabs of data in the table, but let's confirm by looking at the number of _unique_ ID values (we should expect 1966).

In [3]:
ids = set([int(item["data-school-id"]) for item in data_by_ids])
len(ids)

1966

Upon further inspection of the HTML code, we notice that each tab appears to be represented as a distinct table entirely, each containing an element ID that uses the `colleges_table_` prefex.  Let's check how many of these there are.

In [4]:
data_tables = html_data.find_all(id=re.compile("colleges_table_"))

for table in data_tables:
    print(table['id'])

colleges_table_vitals
colleges_table_financial
colleges_table_students


Confirmed: there are 3 separate tables, each with an ID beginning with `colleges_table_` and ending with the tab label.  Given this structure, we'll start out by extracting the data we want into 3 separate collections, and then combine them.

### Extracting the Table Rows

Let's create 3 lists to represent the row elements from each table, and populate them with the table row elements (again, each representing a school) by using the same `data-school-id` attribute we used above.

In [5]:
table_rows = {
    "vitals": [],
    "financial": [],
    "students": []
}

# iterate through each table element, searching it for all elements with data-school-id attributes and saving to the appropriate list
for table in data_tables:
    table_name = table["id"].replace("colleges_table_", "")
    row_elements = table.find_all(attrs={"data-school-id": True})
    table_rows[table_name] = row_elements
    
len(table_rows["vitals"]) # check the length of one of our collections, to ensure it's 1966, as expected

1966

### Extracting Values From Table Cells

Now that we have our 3 lists of table rows, let's iterate through them and extract the individual values for each of the columns of relevant data.  We'll create a helper function to extract a cell value, which we can use for every column other than the college name and CollegeData assigned school ID value (the latter of which we'll capture now, but won't need later).

In [6]:
# helper function: extracts text from element passed to it
def extract_value(cell):
    return cell.find_all(string=True)[2].strip()

# extract values for the rows in all 3 tables
table_row_values = {}
for table in table_rows.keys():
    table_items = []
    for row in table_rows[table]:
        row_data = []
        row_cells = row.find_all("td")
        row_data.append(row_cells[0].find(href=re.compile("/college")).string)
        row_data.append(row_cells[1].find(attrs={"data-schoolidvalue":True})["data-schoolidvalue"]) # we'll need these assigned school_id values for merging data later
        for i in range(2, len(row_cells)):
            row_data.append(extract_value(row_cells[i]))
        table_items.append(row_data)
    table_row_values[table] = table_items

Let's check that the values for each table have been correctly extracted by examining the first row's from each.

In [7]:
print(table_row_values["vitals"][0])
print(table_row_values["financial"][0])
print(table_row_values["students"][0])

['West Texas A&M University', '949', 'Canyon', 'TX', '7,394', 'Pub', 'Mod', '64%', '25.2%']
['West Texas A&M University', '949', '', '$21,426', '$23,012', '62%', '55%', '$23,670']
['West Texas A&M University', '949', 'Coed', '0.5%', '4.9%', '1.9%', '29%', '1.8%']


We've correctly extracted all of the values we're interested in.

## Converting to DataFrames

Now let's move on to compiling our data into pandas dataframes.  We'll start out, keeping it simple, by using the same column names as the CollegeData search results tabs (but convert them to snake-case, to conform to Python standards).

In [8]:
import pandas as pd

table_columns = {
    "vitals": ["name", "school_id", "city", "state", "size", "type", "entrance_difficulty", "freshman_satisfaction", "grad_rate"],
    "financial": ["name", "school_id", "your_net_price", "resident_coa", "nonresident_coa", "need_met", "merit_aid", "student_debt"],
    "students": ["name", "school_id", "gender_mix", "american_indian", "african_american", "asian_pacific_islander", "hispanic", "intl"]
}

vitals_data = pd.DataFrame(table_row_values["vitals"], columns = table_columns["vitals"])
financial_data = pd.DataFrame(table_row_values["financial"], columns = table_columns["financial"])
students_data = pd.DataFrame(table_row_values["students"], columns = table_columns["students"])

In [9]:
vitals_data.head()

Unnamed: 0,name,school_id,city,state,size,type,entrance_difficulty,freshman_satisfaction,grad_rate
0,West Texas A&M University,949,Canyon,TX,7394,Pub,Mod,64%,25.2%
1,Rose-Hulman Institute of Technology,883,Terre Haute,IN,2168,Priv,Very,91.2%,66.5%
2,Dominican College,886,Orangeburg,NY,1425,Priv,Non,74%,28.7%
3,University of Puget Sound,859,Tacoma,WA,2364,Priv,Mod,80.7%,65.8%
4,Champlain College,1270,Burlington,VT,2129,Priv,Mod,78%,53.9%


In [10]:
financial_data.head()

Unnamed: 0,name,school_id,your_net_price,resident_coa,nonresident_coa,need_met,merit_aid,student_debt
0,West Texas A&M University,949,,"$21,426","$23,012",62%,55%,"$23,670"
1,Rose-Hulman Institute of Technology,883,,"$70,401","$70,401",Not reported,Not Reported,"$45,345"
2,Dominican College,886,,"$46,600","$46,600",71%,17%,"$32,527"
3,University of Puget Sound,859,,"$68,146","$68,146",78%,44%,"$32,999"
4,Champlain College,1270,,"$61,012","$61,012",71%,22%,"$35,383"


In [11]:
students_data.head()

Unnamed: 0,name,school_id,gender_mix,american_indian,african_american,asian_pacific_islander,hispanic,intl
0,West Texas A&M University,949,Coed,0.5%,4.9%,1.9%,29%,1.8%
1,Rose-Hulman Institute of Technology,883,,0.2%,3.2%,6.1%,5.3%,14.6%
2,Dominican College,886,Coed,0%,15.2%,6.8%,30.8%,1.4%
3,University of Puget Sound,859,Coed,0.1%,1.8%,7%,8.8%,0.4%
4,Champlain College,1270,Coed,0.2%,2.7%,3%,6.7%,0.8%


### Merging the Data

Now that we've created our 3 separate dataframes, we can merge them into one.  Because all schools are represented in each of the 3 dataframes, we can confidently use an inner join.  We'll also merge on both the `school_id` and `name` values, since not only do both exist in all 3 dataframes, but it allows us to avoid any column duplication.

In [12]:
merged_data = vitals_data.merge(financial_data, how="inner", on=["name", "school_id"])
merged_data = merged_data.merge(students_data, how="inner", on=["name", "school_id"])

merged_data.head()

Unnamed: 0,name,school_id,city,state,size,type,entrance_difficulty,freshman_satisfaction,grad_rate,your_net_price,...,nonresident_coa,need_met,merit_aid,student_debt,gender_mix,american_indian,african_american,asian_pacific_islander,hispanic,intl
0,West Texas A&M University,949,Canyon,TX,7394,Pub,Mod,64%,25.2%,,...,"$23,012",62%,55%,"$23,670",Coed,0.5%,4.9%,1.9%,29%,1.8%
1,Rose-Hulman Institute of Technology,883,Terre Haute,IN,2168,Priv,Very,91.2%,66.5%,,...,"$70,401",Not reported,Not Reported,"$45,345",,0.2%,3.2%,6.1%,5.3%,14.6%
2,Dominican College,886,Orangeburg,NY,1425,Priv,Non,74%,28.7%,,...,"$46,600",71%,17%,"$32,527",Coed,0%,15.2%,6.8%,30.8%,1.4%
3,University of Puget Sound,859,Tacoma,WA,2364,Priv,Mod,80.7%,65.8%,,...,"$68,146",78%,44%,"$32,999",Coed,0.1%,1.8%,7%,8.8%,0.4%
4,Champlain College,1270,Burlington,VT,2129,Priv,Mod,78%,53.9%,,...,"$61,012",71%,22%,"$35,383",Coed,0.2%,2.7%,3%,6.7%,0.8%


In [13]:
merged_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1966 entries, 0 to 1965
Data columns (total 21 columns):
name                      1966 non-null object
school_id                 1966 non-null object
city                      1966 non-null object
state                     1966 non-null object
size                      1966 non-null object
type                      1966 non-null object
entrance_difficulty       1966 non-null object
freshman_satisfaction     1966 non-null object
grad_rate                 1966 non-null object
your_net_price            1966 non-null object
resident_coa              1966 non-null object
nonresident_coa           1966 non-null object
need_met                  1966 non-null object
merit_aid                 1966 non-null object
student_debt              1966 non-null object
gender_mix                1966 non-null object
american_indian           1966 non-null object
african_american          1966 non-null object
asian_pacific_islander    1966 non-null object


We've successfully merged our data into one dataframe.

It appears that there are no missing values anywhere, and that every column contains string values - we'll come back to this a little later.

## Cleaning the Data

### Removing Irrelevant Columns

There are two columns that now stand out as unnecessary at this point: `school_id` (ID values assigned by CollegeData) and `your_net_price` (a financial calculation column only used as part of CollegeData's services).  Let's drop them.

In [14]:
merged_data = merged_data.drop(["your_net_price", "school_id"], axis = 1)

### Adjusting Column Names

Let's modify a few column names for clarity.

In [15]:
merged_data.rename(columns={
    "size": "n_students",
    "entrance_difficulty": "acceptance_difficulty",
    "resident_coa": "resident_cost",
    "nonresident_coa": "nonresident_cost",
    "student_debt": "avg_grad_debt",
    "american_indian": "pct_american_indian",
    "african_american": "pct_african_american",
    "asian_pacific_islander": "pct_asian_pacific_islander",
    "hispanic": "pct_hispanic",
    "intl": "pct_international"
}, inplace=True)

merged_data.columns

Index(['name', 'city', 'state', 'n_students', 'type', 'acceptance_difficulty',
       'freshman_satisfaction', 'grad_rate', 'resident_cost',
       'nonresident_cost', 'need_met', 'merit_aid', 'avg_grad_debt',
       'gender_mix', 'pct_american_indian', 'pct_african_american',
       'pct_asian_pacific_islander', 'pct_hispanic', 'pct_international'],
      dtype='object')

### Handling Missing Values

As was observed earlier, on first glance, it appears that our dataset contains no missing values.  However, given that our data originated from parsing HTML, it is all in string form.  Thus, missing values could still exist, such as in the form of an empty string.  From browsing through the original search results on the CollegeData.com website, we know that some schools were missing data, and the missing values were represented in a few different ways.

To start, let's look into how many empty strings exist for each of our columns.

In [16]:
for col in merged_data.columns:
    print("Empty strings in " + col + ":", len(merged_data[merged_data[col] == ""]))

Empty strings in name: 0
Empty strings in city: 0
Empty strings in state: 0
Empty strings in n_students: 0
Empty strings in type: 7
Empty strings in acceptance_difficulty: 180
Empty strings in freshman_satisfaction: 0
Empty strings in grad_rate: 0
Empty strings in resident_cost: 0
Empty strings in nonresident_cost: 0
Empty strings in need_met: 0
Empty strings in merit_aid: 0
Empty strings in avg_grad_debt: 0
Empty strings in gender_mix: 77
Empty strings in pct_american_indian: 0
Empty strings in pct_african_american: 0
Empty strings in pct_asian_pacific_islander: 0
Empty strings in pct_hispanic: 0
Empty strings in pct_international: 0


We see that the columns `type` (7), `acceptance_difficulty` (180), and `gender` (77) each contain missing (empty string) values.  To fix that, we'll replace the empty strings with `Not Reported`, to stay consistent with other existing entries in the data.

In [17]:
merged_data.loc[merged_data["type"] == "", "type"] = "Not Reported"
merged_data["type"].value_counts()

Priv            1193
Pub              594
Proft            172
Not Reported       7
Name: type, dtype: int64

In [18]:
merged_data.loc[merged_data["acceptance_difficulty"] == "", "acceptance_difficulty"] = "Not Reported"
merged_data["acceptance_difficulty"].value_counts()

Mod             1078
Min              317
Not Reported     180
Non              175
Very             157
Most              59
Name: acceptance_difficulty, dtype: int64

In [19]:
merged_data.loc[merged_data["gender_mix"] == "", "gender_mix"] = "Not Reported"
merged_data["gender_mix"].value_counts()

Coed            1821
Not Reported      77
Men               56
Women             12
Name: gender_mix, dtype: int64

There are a few columns containing values that should be numeric (totals and percentages), including: `n_students`, `freshman_satisfaction`, `grad_rate`, `resident_cost`, `nonresident_cost`, `need_met`, `merit_aid`, `avg_grad_debt`, and the minority percentage columns.  These columns also appear to have missing values.

It appears that "Not Reported" (with varying capitalization) or a hypen ("-") are what are used in any case where a value in one of these columns is missing.  Let's take a look at how many there are.

In [20]:
numeric_cols = ["n_students", "freshman_satisfaction", "grad_rate", "resident_cost", "nonresident_cost", "need_met", "merit_aid", "avg_grad_debt", "pct_american_indian", "pct_african_american", "pct_asian_pacific_islander", "pct_hispanic", "pct_international"]

for col in numeric_cols:
    merged_data[col] = merged_data[col].str.lower()
    print(col + " - Not Reported: " + str(len(merged_data[(merged_data[col] == "not reported")])))
    print(col + " - Hyphen:" + str(len(merged_data[(merged_data[col] == "-")])))

n_students - Not Reported: 82
n_students - Hyphen:0
freshman_satisfaction - Not Reported: 343
freshman_satisfaction - Hyphen:0
grad_rate - Not Reported: 534
grad_rate - Hyphen:0
resident_cost - Not Reported: 0
resident_cost - Hyphen:556
nonresident_cost - Not Reported: 0
nonresident_cost - Hyphen:553
need_met - Not Reported: 529
need_met - Hyphen:0
merit_aid - Not Reported: 550
merit_aid - Hyphen:0
avg_grad_debt - Not Reported: 0
avg_grad_debt - Hyphen:572
pct_american_indian - Not Reported: 349
pct_american_indian - Hyphen:0
pct_african_american - Not Reported: 328
pct_african_american - Hyphen:0
pct_asian_pacific_islander - Not Reported: 358
pct_asian_pacific_islander - Hyphen:0
pct_hispanic - Not Reported: 317
pct_hispanic - Hyphen:0
pct_international - Not Reported: 355
pct_international - Hyphen:0


We can see that although `n_students` only has 82 entries with missing values, the rest of the numeric columns all have at least 300 missing (and, in some cases, over 500); attributable to the schools not reporting those data points.  For now, since we're simply compiling a dataset and not performing a subsequent analysis, we'll use a replacement value (in this case, `NaN`) to indicate an unknown while still allowing for these columns to be converted to numeric types.

If using this dataset for an analysis, one would then want to make a determination of what to do regarding missing values (perhaps drop the 82 entries missing an "n_students" value, while taking a different action for columns where missing values make up a much greater percentage; depending on the goal of the analysis.)  We'll follow up by filtering down to and exporting an additional dataset containing no missing values, as well.

In [21]:
import numpy as np

merged_data[numeric_cols] = merged_data[numeric_cols].replace("not reported", np.nan).replace("-", np.nan)

### Adjusting Data Types

Now we can convert our numeric columns to actual numeric values (versus strings, which they currently are due to originating from being parsed from HTML).  First, let's remove the non-numeric characters.

In [22]:
merged_data[numeric_cols] = merged_data[numeric_cols].apply(lambda col: col.str.replace("[\$\,\%]", ""), axis = 0)

This does leave us with a few entries in the `avg_grad_debt` column that now contain empty strings (because, initially, they contained unreported values, and were simply represented as `$`).

In [23]:
len(merged_data[merged_data["avg_grad_debt"] == ""])

5

Let's fix that, and replace these values with `NaN` as well.

In [24]:
merged_data.loc[merged_data["avg_grad_debt"] == "", "avg_grad_debt"] = np.nan

Now we can adjust data types for these columns.  For simplicity's sake, we'll use the `float` data type so that we can successfully convert columns containing `NaN` values.  We'll also wrap up data type conversion by converting the values that represent percentages to decimal form.

In [25]:
merged_data[numeric_cols] = merged_data[numeric_cols].astype("float")

percentage_cols = ["freshman_satisfaction", "grad_rate", "need_met", "merit_aid", "pct_american_indian", "pct_african_american", "pct_asian_pacific_islander", "pct_hispanic", "pct_international"]
merged_data[percentage_cols] = merged_data[percentage_cols] / 100 # divide values by 100 to convert to decimal form

merged_data[numeric_cols].head()

Unnamed: 0,n_students,freshman_satisfaction,grad_rate,resident_cost,nonresident_cost,need_met,merit_aid,avg_grad_debt,pct_american_indian,pct_african_american,pct_asian_pacific_islander,pct_hispanic,pct_international
0,7394.0,0.64,0.252,21426.0,23012.0,0.62,0.55,23670.0,0.005,0.049,0.019,0.29,0.018
1,2168.0,0.912,0.665,70401.0,70401.0,,,45345.0,0.002,0.032,0.061,0.053,0.146
2,1425.0,0.74,0.287,46600.0,46600.0,0.71,0.17,32527.0,0.0,0.152,0.068,0.308,0.014
3,2364.0,0.807,0.658,68146.0,68146.0,0.78,0.44,32999.0,0.001,0.018,0.07,0.088,0.004
4,2129.0,0.78,0.539,61012.0,61012.0,0.71,0.22,35383.0,0.002,0.027,0.03,0.067,0.008


## Exporting the Full Dataset

Finally, let's sort our dataframe alphabetically by the `name` column and export it to a CSV file.

In [26]:
data_full_final = merged_data.sort_values(by="name").reset_index(drop=True)
data_full_final.head(10)

Unnamed: 0,name,city,state,n_students,type,acceptance_difficulty,freshman_satisfaction,grad_rate,resident_cost,nonresident_cost,need_met,merit_aid,avg_grad_debt,gender_mix,pct_american_indian,pct_african_american,pct_asian_pacific_islander,pct_hispanic,pct_international
0,Abilene Christian University,Abilene,TX,3670.0,Priv,Mod,0.765,0.481,51957.0,51957.0,0.71,1.0,,Coed,0.003,0.091,0.013,0.178,0.039
1,Academy of Art University,San Francisco,CA,7051.0,Proft,Non,0.78,0.055,52532.0,52532.0,0.32,0.11,35862.0,Coed,0.007,0.081,0.092,0.155,0.278
2,Adams State University,Alamosa,CO,1991.0,Pub,Mod,0.538,0.24,22916.0,34340.0,0.7,0.41,22822.0,Coed,0.013,0.081,0.012,0.348,0.006
3,Adelphi University,Garden City,NY,5391.0,Priv,Mod,0.81,0.603,58710.0,58710.0,0.44,0.86,34980.0,Coed,0.002,0.09,0.115,0.181,0.04
4,Adrian College,Adrian,MI,1647.0,Priv,Mod,0.7,0.422,,,0.77,0.22,27741.0,Coed,0.004,0.099,0.006,0.023,0.001
5,Adventist University of Health Sciences,Orlando,FL,1341.0,Priv,Min,0.66,,28872.0,28872.0,0.31,0.14,38990.0,Coed,0.002,0.188,0.058,0.324,0.016
6,Agnes Scott College,Decatur,GA,996.0,Priv,Mod,0.789,0.638,57705.0,57705.0,0.84,0.37,30850.0,Women,0.002,0.335,0.083,0.137,0.07
7,Alabama A&M University,Huntsville,AL,4940.0,Pub,Min,0.75,0.372,24036.0,32646.0,0.74,0.07,38819.0,Coed,0.001,0.97,0.002,0.003,0.009
8,Alabama State University,Montgomery,AL,3903.0,Pub,Min,0.59,0.101,22260.0,30588.0,0.81,0.33,3376.0,Coed,0.001,0.945,0.005,0.01,0.01
9,Alaska Bible College,Palmer,AK,50.0,Priv,Min,0.2,,15800.0,15800.0,,,,Coed,0.013,0.007,,0.013,


In [27]:
data_full_final.to_csv("college_data_full.csv")

As a reminder, we've exported the full 1966 school dataset with missing values included in order to make all of the data accessible, should someone want to use it.  Depending on a user's use-case, they can choose how to handle missing values (dropping columns, replacing with the mean, etc).

Let's now take this one step further and produce a second, filtered dataset with no missing values, as well.

## Generating A Second Dataset: No Missing Values (Preparing for Machine Learning)

We're going to approach this as though we're interested in creating a starting point dataset for training a predictive model.  There are a number of ways we can approach this - this is just one.

Looking through the data, it seems likely that many of the schools with at least one unreported value also didn't report at least one other value.  Let's take a look at the breakdown of how many rows contain various totals of missing values.

In [28]:
def count_missing(row):
    total_missing = row.isnull().sum() # calculate total # of NaN values in a row first
    
    # then iterate through string columns other than name, state, or city and add any "Not Reported" values
    for col in ["gender_mix", "acceptance_difficulty", "type"]:
        if row[col] == "Not Reported":
            total_missing += 1
    
    return total_missing

missing_values = data_full_final.apply(count_missing, axis = 1)
missing_values.value_counts()

0     967
1     217
2     139
3      96
14     68
12     68
5      67
7      58
4      56
13     54
6      46
8      45
9      34
10     26
11     25
dtype: int64

We can see that if we were to solely drop all rows with any missing values, we would be left with only around half of the original dataset.  However, if we kept rows below a certain threshold of missing values, we may be able to retain a substantial number of rows while only needing to handle a small percentage of missing values further.  Schools with several unreported values aren't providing a lot of value to the dataset, so we can feel reasonably comfortable eliminating them in this case.

In [29]:
data_filtered = data_full_final[missing_values < 3]
data_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1323 entries, 0 to 1965
Data columns (total 19 columns):
name                          1323 non-null object
city                          1323 non-null object
state                         1323 non-null object
n_students                    1323 non-null float64
type                          1323 non-null object
acceptance_difficulty         1323 non-null object
freshman_satisfaction         1303 non-null float64
grad_rate                     1235 non-null float64
resident_cost                 1258 non-null float64
nonresident_cost              1260 non-null float64
need_met                      1260 non-null float64
merit_aid                     1273 non-null float64
avg_grad_debt                 1236 non-null float64
gender_mix                    1323 non-null object
pct_american_indian           1319 non-null float64
pct_african_american          1323 non-null float64
pct_asian_pacific_islander    1314 non-null float64
pct_hispanic   

We can see that if we take the approach of eliminating rows with 3 or more missing values, not only would we still be left with 1323 rows (67% of the original dataset; a reasonable amount), but we would only have to handle at most 88 missing values (6.7%) in any given column within the remaining data.  Also notable is the fact that we aren't missing any values in the 3 categorical columns (`type`, `acceptance_difficulty`, and `gender_mix`).

Again, we're approaching this as though we're preparing the data for machine learning, so we still want to take care of the remaining missing values.

Approaches to handling the remaining missing values could include:

- Filling in missing values in the finance-related columns with column means subsetted by school type (as there may be some correlation between money-related values \[the columns where most remaining missing values are, such as `resident_cost`\] and school type \[private, public, etc\])
- Examining what column(s) `grad_rate` correlates with most highly and filling in with estimated values accordingly


For now, we'll keep things simple, as a starting point, by filling in the remaining missing values with their column means.  We'll also reset the dataframe indices and export to a CSV.

In [30]:
college_data_ml_prepped = data_filtered.fillna(data_filtered.mean()).reset_index(drop=True)
college_data_ml_prepped.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1323 entries, 0 to 1322
Data columns (total 19 columns):
name                          1323 non-null object
city                          1323 non-null object
state                         1323 non-null object
n_students                    1323 non-null float64
type                          1323 non-null object
acceptance_difficulty         1323 non-null object
freshman_satisfaction         1323 non-null float64
grad_rate                     1323 non-null float64
resident_cost                 1323 non-null float64
nonresident_cost              1323 non-null float64
need_met                      1323 non-null float64
merit_aid                     1323 non-null float64
avg_grad_debt                 1323 non-null float64
gender_mix                    1323 non-null object
pct_american_indian           1323 non-null float64
pct_african_american          1323 non-null float64
pct_asian_pacific_islander    1323 non-null float64
pct_hispanic   

In [31]:
college_data_ml_prepped.to_csv("college_data_ml.csv")

## Potential Next Steps

We've scraped, cleaned, and exported our dataset(s).  Future steps could include:
- Parsing each individual school's CollegeData profile page for more granular data (i.e. exact acceptance percentages, cost-of-attendance sub-values, etc) and compiling for a more comprehensive dataset
- Performing an analysis examining common threads among subsets of schools or correlations between columns (i.e cost & graduation rate)
- Exploring alternative approaches for handling missing values (as mentioned above)