<h1 style="text-align: center">
<div style="color: #DD3403; font-size: 60%">Data Science DISCOVERY MicroProject</div>
<span style="">MicroProject: World University Rankings</span>
<div style="font-size: 60%;"><a href="https://discovery.cs.illinois.edu/microproject/world-university-rankings/">https://discovery.cs.illinois.edu/microproject/world-university-rankings/</a></div>
</h1>

<hr style="color: #DD3403;">

## Data Source: The Times Higher Education

There are hundreds of organizations that rank universities, including US News and World Report, QS World University Rankings, Times Higher Education (THE), and many others.

The Times Higher Education (THE) provides a clean, well-documented CSV that includes their rankings based on the "performance data on universities for students and their families, academics, university leaders, governments and industry".  Their 2020 dataset includes almost 1,400 universities across 92 countries and includes 13 performance indicators that measure an institution’s performance across teaching, research, knowledge transfer and international outlook.  Their website with additional details on this dataset is found here: https://www.timeshighereducation.com/content/world-university-rankings

In this MicroProject, you will explore basic DataFrame operations on the Times Higher Education university rankings.

<hr style="color: #DD3403;">

## Part 1: Importing the World University Rank Dataset

To use the `pandas` library, we must **import** it into your notebook. import pandas as `pd` in the following cell:

Use panda's `read_csv` function to read the `World_University_Rank_2020.csv` and create a DataFrame called `df`.

In [None]:
df = ...
df

In [None]:
## == CHECKPOINT TESTS ==
# - This read-only cell contains a "checkpoint" for this section of the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.

tada = "\N{PARTY POPPER}"

assert("df" in vars()), "Make sure to name the dataframe df"
assert(df["University"].iloc[0] == "University of Oxford")
assert(df["University"].iloc[1392] == "Pontifical Catholic University of Minas Gerais")
assert("University" in df)
assert("Score" not in df)
print(f"{tada} All Tests Passed! {tada}") 

<hr style="color: #DD3403;">

## Puzzle 1: Focusing on the United States

In this dataset, each row represents one university.  In the dataset, find the variable that encodes where the university is located.

Create one new DataFrame, `df_united_states` , that only contains universities located in the United States:

In [None]:
df_US = ...
df_US

In [None]:
## == TEST CASES FOR PUZZLE 1 ==
# - This read-only cell contains a "checkpoint" for this section of the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.

tada = "\N{PARTY POPPER}"

assert(df.iloc[47]["University"] == "University of Illinois at Urbana-Champaign")
assert(df.iloc[47]["Number_students"] == 44916)

assert('df_US' in vars()), "Make sure to name the dataframe df_united_states."
assert(len(df_US) == 172), "It looks like you did not subset df_united_states to only universities located in the United States."
assert(df_US["University"].iloc[0] == "California Institute of Technology")
assert(df_US["Number_students"].iloc[171] == 14791)

print(f"{tada} All Tests Passed! {tada}") 

## Exploring Indexes

By default, pandas creates an **index** column that starts with the index `0` and indexes each row with an increasing number.  The Top University in the original dataset, Oxford, has an index of 0.  You can view that by running the cell below, that displays the data for the university at index (or `loc`) `0`:

In [None]:
# Find the row with the index `0` in the DataFrame `df`:
df.loc[0]

Your new dataset `df_US` still has the original index values since it is a subset of `df`.  **You will get a `KeyError`** when you try and find the `index` `0` since you removed when you selected only the universities in the United States:

In [None]:
# Find the row with the index `0` in the DataFrame `df_US`:
# (It does not exist, since it's not in the United States, so you should get an error!)
df_US.loc[0]

## Puzzle 1.2: Re-indexing the Universities in the United States

Use `df_US.reset_index()` to have pandas **regenerated** the indexes for `df_US`.  Python will:
- Replace all of the existing index values in `df_US` with a new index,
- This new index will start with `0` and number each row one-by-one as if you imported it as a new dataset


In [None]:
df_US = ...
df_US

In [None]:
## == TEST CASES FOR PUZZLE 1 ==
# - This read-only cell contains a "checkpoint" for this section of the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.

tada = "\N{PARTY POPPER}"
assert(df_US["University"].loc[0] == "California Institute of Technology")
print(f"{tada} All Tests Passed! {tada}") 

<hr style="color: #DD3403;">

## Puzzle 2: Large Universities in the United States

Examine the dataset to find the variable that stores the number of students that attends each university, then create a subset of the DataFrame called `df_US_large` that contains all universities in the United States that have over 30,000 students:

In [None]:
df_US_large = ...
df_US_large

In [None]:
## == TEST CASES FOR PUZZLE 2 ==
# - This read-only cell contains a "checkpoint" for this section of the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.

tada = "\N{PARTY POPPER}"

assert('df_US_large' in vars()), "Make sure to name the dataframe df_less_than_10000."
assert(len(df_US_large) == 45), "It looks like you did not subset df_less_than_10000 to only universities with less than 10000 students."
assert(df_US_large["University"].iloc[7] == "University of Illinois at Urbana-Champaign")
print(f"{tada} All Tests Passed! {tada}")

## Puzzle 2.2: Using the index to store a US_Large_Ranking

It is sometimes useful to store the current "rank" of values in a DataFrame.  The first entry should be ranked as #1, the second entry as #2, and so on.

To do this, we can do use the `reset_index` technique to ask Python to re-index the rows and then use the index value, plus one, to find the total rank.  We can do this since Python will give the first row index `0`, the second row index `1`, the next row index `2`, and so on.  Since adding one to the index gets us the rank, this is a quick technique to rank all of the rows.

Complete the following two steps:
- Reset the index of `df_US_large`,
- Create a new column in `df_US_large` called `"US_large_rank"` that has the value of `df_US_large.index + 1`.

In [None]:
# Step 1: Reset the index of df_US-large (if needed, see Puzzle 1.2 to refresh how you can do this):
df_US_large = ...
df_US_large


In [None]:
# Step 2: Create a new column in df_US_large:
df_US_large["US_large_rank"] = ...
df_US_large

In [None]:
## == TEST CASES FOR PUZZLE 2 ==
# - This read-only cell contains a "checkpoint" for this section of the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.

tada = "\N{PARTY POPPER}"

assert("US_large_rank" in df_US_large)
assert(df_US_large.loc[0][0] == df_US_large.iloc[0][0])
assert(df_US_large.loc[0]["US_large_rank"] == 1)
print(f"{tada} All Tests Passed! {tada}")

<hr style="color: #DD3403;">

## Puzzle 3: Creating Random Subsets

Instead of focusing on just a subset of a DataFrame, researchers often need to look at a random sample of a DataFrame.

Returning to the original dataset of nearly 1,400 universities, create one new DataFrame, `df_random_15` , that gives us a random sample of 15 rows in the dataset.

In [None]:
df_random_15 = ...
df_random_15

In [None]:
## == TEST CASES FOR PUZZLE 3 ==
# - This read-only cell contains a "checkpoint" for this section of the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.

tada = "\N{PARTY POPPER}"

assert('df_random_15' in vars()), "Make sure to name the dataframe df_random_15."
assert(len(df_random_15) == 15), "It looks like you did not sample exactly 15 rows."
assert(len(df_random_15[df_random_15.Country != "United States"]) > 0), "Make sure to sample from the full dataset stored in `df`"
print(f"{tada} All Tests Passed! {tada}") 

<hr style="color: #DD3403;">

## Submission

You're almost done!  All you need to do is to commit your lab to GitHub and run the GitHub Actions Grader:

1.  ⚠️ **Make certain to save your work.** ⚠️ To do this, go to **File => Save All**

2.  After you have saved, exit this notebook and follow the instructions to commit and grade this MicroProject!