# In-class activity: Making your own college rankings

Cathy O'Neil's chapter, "Arms Race," from _Weapons of Math Destruction_ concludes with a proposed solution for the data manipulation that followed the rise of the U.S. News and World Report college rankings: an [open data portal from the U.S Department of Education](https://collegescorecard.ed.gov/).

You've had a chance to look at that portal -- now, you can explore some of the [data](https://collegescorecard.ed.gov/data) yourself, and build your own college ranking model.

#### Special Note: Organization of this notebook
This time, I've included a clean version up top, where you can work on the problems on your own, and the solutions down at the bottom, if you want to peek ahead to some potential ways to do each step.

I _highly encourage_ you to try it on your own first without peeking!

## Part 1: Load in the data and look at the data dictionary

As we saw, the full institution-level dataset from the US Department of Education has more than 6,000 different features in it! I've made a condensed version of it that only has ~115 different features/variables. This data file is available to download from Canvas called `"college_scorecard_2022.csv"`.

I've also provided a [condensed version of the *data dictionary*](https://docs.google.com/spreadsheets/d/1LjTVjLoXnD4OTKhT7Mw7lWuCvvKqhevZ4rywVfeLioQ/edit?usp=sharing) that explains what features/variables are and what kinds of values they contain.

### Step 1. Load the data file into this notebook and examine it.
* What do you notice?
* How many rows and how many columns does it have?
* Which columns are numeric and which are not?
* For the numeric values, what ranges do they have?

### Step 2. Peruse the simplified [data dictionary](https://docs.google.com/spreadsheets/d/1LjTVjLoXnD4OTKhT7Mw7lWuCvvKqhevZ4rywVfeLioQ/edit?usp=sharing) (which is in a Google Sheet).
* How is it organized?
* What information is relevant to you as you are making sense of this dataset?

In [3]:
# YOUR CODE HERE

# Import necessary packages

# Read in data file

### Step 3. What are some basic ways you can filter and sort this data to look at some colleges you may be intersted in?

For example, Whitman is a small liberal arts college, which means it has a "Carnegie Classification--basic" (CCBASIC) value of 21. You might filter this data set to look at liberal arts colleges.

Or, Historically Black Colleges and Universities (HBCUs) are indicated in the HBCU column with either a 0 or a 1. You might filter this data set to look at HBCUs.

What are some other interesting ways you might filter this dataset?

As you are doing this, make sure to consult the data dictionary to see what the column names mean.

Pick a subset of colleges you want to focus on (i.e. HBCUs, liberal arts colleges, colleges in the Pacific Northwest, public universities, etc.) and make a new data frame with those colleges.

In [2]:
# Note: By default, Jupyter only shows 20 columns at a time
# You can override this by using the following line:
pd.options.display.max_columns = None

# You can also get a list of all of the columns with the following line:
print(df.columns.tolist())

NameError: name 'pd' is not defined

In [None]:
# YOUR CODE HERE

# Select a subset of the colleges (however you like) and make a new dataframe

## Part 2: Making your own college ranking

What features/columns are important to _you_ in selecting a college? (Or, since you have already gone through this process, you might think about a sibling or a friend who is thinking about what college they might want to attend.)

### Step 4. Pick a subset of the columns (at least 5, but no more than 10) that you want to include in your custom ranking

Using the data dictionary, highlight the features / columns that you want to include in your custom ranking. (Remember that column D in the data dictionary, VARIABLE NAME, is the one that has the column name in your data frame.)

Make a new data frame that just has the institution name and these 5-10 columns that you are interested in.




In [None]:
# YOUR CODE HERE

# Select a subset of columns and make a new data frame

# Hint: Remember that you can select many columns at once using a list


### Step 5: Normalizing the data values

Look at the columns that you have picked. Is there a way to add them all together to come up with a single score?

To do this, we have to think about:
* What types of data do they contain?
* Are some numeric? For the numeric values, what range do they have (max and min)?
* Are some categorial or quantitative?

Figure out a way to turn each column into a number on a shared scale (i.e. 0 to 1, or -1 to 1). This process is called **normalization**.

As you do this, think about what values you are imbuing into the data. Are you assuming that some values are preferable to others?

For example, how might we normalize the values for "Instructional expenditures per full-time equivalent student" (INEXPFTE) for liberal arts colleges?

First, we would look at the range of values in the dataframe.

In [None]:
# A generalized way to do this is what is called min-max normalization:

# SCALED VALUE = (OLD VALUE - MIN) / (MAX - MIN)

# Or, to put it in terms of dataframes:

# df[scaled_column] = (df[column] - df[column].min()) / (df[column].max() - df[column].min())

# You can also write a for loop to iterate over all the columns in your data frame to do this!
# But, depending on the columns you chose, this might not work for all of them.

# More on min-max normalization here: https://towardsdatascience.com/everything-you-need-to-know-about-min-max-normalization-in-python-b79592732b79

## Part 4: Adding the columns together & weighting them

Now that you have normalized your columns, you can add them all together.

### Step 6: Create a new column that sums the other columns. 

This new column is your rank.

Which colleges have the highest overall scores?

In [None]:
# YOUR CODE HERE

# Create a new column that adds the normalized values

In [None]:
# What is the result?

# Display the ranking by using .sort_values()

# You can also use .rank() to add in a ranking
# See: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rank.html

### Step 7: Create *weights* for each column

Right now, each column as an equal weight. That is, all of the 10 (or 20) features you picked contribute the same amount to the overall score. But that might not be what you want.

Think about how you might weigh each column on a value between 0 and 1. If something has a weight of 1, it is really important. If it has a weight of 0, it is not important at all (and in fact won't be included in the model). Or, if you have 5 columns, you might rank them and give each one a weight between 1 and 5.

Are there any columns that should have a negative weight (i.e., it should detract from the overall score)?

_For folks taking machine learning, this is how many ML algorithms work!_ The difference is that you use a mathematical model to determine the weights, instead of a human's preferences.

Come up with weights for each of the columns you picked. Now, calculate a new total using these weights.

How has the ranking changed?

In [None]:
# YOUR CODE HERE

# You can do this by writing out a formula

# You might also make a dictionary of weights and use that to calculate the weighted values

In [None]:
# What are the new rankings?

_______
### Professor Wirfs-Brock's sample solution

Only peek at this if you need to!

In [None]:
# YOUR CODE HERE
# Import necessary packages
import pandas as pd
import numpy as np

# Load in the .csv as a data frame
df = pd.read_csv("college_scorecard_2022.csv")

# Examine your data frame

In [None]:
# For example, say we are focusing on a ranking for liberal arts colleges
# We might make a liberal arts data frame

liberal_arts = df[df["CCBASIC"] == 21]

In [None]:
# If we want to focus on HBCUs:

HBCUs = df[df["HBCU"] == 1]

In [None]:
# Picking a subset of columns:

# For example, let's focus on liberal arts colleges using the following criteria:

# Admission rate - ADM_RATE
# Percentage of degrees awarded in Computer And Information Sciences And Support Services. - PCIP11
# Percentage of degrees awarded in English Language And Literature/Letters. - PCIP23
# Average cost of attendance (academic year institutions) - COSTT4_A
# Median earnings of students working and not enrolled 10 years after entry - MD_EARN_WNE_P10
# Instructional expenditures per full-time equivalent student - INEXPFTE

liberal_arts_ranking = liberal_arts[["INSTNM","ADM_RATE","PCIP11","PCIP23","COSTT4_A","MD_EARN_WNE_P10","INEXPFTE" ]]

In [None]:
# Normalizing the values

# We might first use .describe() to look at the range of values:
liberal_arts_ranking["INEXPFTE"].describe()

In [None]:
# We can scale the max, 48373 to 1, and the min, 1709, to 0.

# The first way we might do this is just by calculating a new column using the min-max-scaling-method:
# SCALED VALUE = (OLD VALUE - MIN) / (MAX - MIN)

liberal_arts_ranking["INEXPFTE_s"] = (liberal_arts_ranking["INEXPFTE"] - 1709)/(48373-1709)

In [None]:
# Let's look at the new distribution,
# notice how the max is 1 and the min is 0
liberal_arts_ranking["INEXPFTE_s"].describe()

In [None]:
# And here's how we might normalize ALL the columns at once:

# here's where we make a new dataframe with just the columns are are interested in
liberal_arts_ranking = liberal_arts[["INSTNM","ADM_RATE","PCIP11","PCIP23","COSTT4_A","MD_EARN_WNE_P10","INEXPFTE" ]]

# make a copy the data frame
liberal_arts_normalized = liberal_arts_ranking.copy()

# list of columns we want to normalize
to_normalize = ["ADM_RATE","PCIP11","PCIP23","COSTT4_A","MD_EARN_WNE_P10","INEXPFTE"]

# apply normalization techniques
for column in to_normalize:
    liberal_arts_normalized[column] = (liberal_arts_normalized[column] - liberal_arts_normalized[column].min()) / (liberal_arts_normalized[column].max() - liberal_arts_normalized[column].min())

# view normalized data
# Note how all the maxes are 1 and all the mins are 0
liberal_arts_normalized.describe()


In [None]:
# Calculating the total score:

# we can use the list of columns we care about, to_normalize, to do a sum
liberal_arts_normalized["score"] = liberal_arts_normalized[to_normalize].sum(axis=1)

liberal_arts_normalized.sort_values("score", ascending=False).head(50)

In [None]:
# And we can make a new column with the rank as well using the .rank method
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rank.html

liberal_arts_normalized["rank"] = liberal_arts_normalized["score"].rank(ascending=False)

liberal_arts_normalized

In [None]:
# And where is Whitman?
liberal_arts_normalized[liberal_arts_normalized["INSTNM"] == "Whitman College"]

In [None]:
# Adding in weights

# How might we do the weighting? You could write out a long formula -- that is fine!
# But we can also expedite it with a dictionary.
# Here, our keys are the column names, and our values are the weights.
# For example:

weights = {"ADM_RATE":7,"PCIP11":1,"PCIP23":2,"COSTT4_A":3,"MD_EARN_WNE_P10":4,"INEXPFTE":6}

# Now we can make a new column using the weights
liberal_arts_normalized['weighted_score'] = sum([liberal_arts_normalized[key] * weights[key] for key in weights.keys()])

# view weighted data
liberal_arts_normalized.sort_values("weighted_score", ascending=False).head(50)

In [None]:
# Now where is Whitman?