# Lab-P12:  Web Requests, Caching, DataFrames and Scraping
Version: 4/22, 1:00AM

## Segment 1: Web Requests and File Downloads

Import the `time`, `requests`, `os`, `json`, `pandas` and `BeautifulSoup` modules. 

For `pandas`, import it as `pd` - as was done in lecture. You can refer to the [lecture material](https://github.com/tylerharter/caraza-harter-com/blob/master/tyler/meena/cs220/s22/materials/readings/pandas-intro.ipynb) for help.

In [None]:
#Write import statements here

### Task 1.1 Fetch `rankings.json` from an internet URL

Use the `requests` library to fetch the file at this URL: `https://raw.githubusercontent.com/msyamkumar/cs220-s22-projects/main/p12/rankings.json`. Make sure to call the appropriate function to raise an HTTPError if status code is not 200.

Then make a variable called `file_text` that saves the text of the response

In [None]:
# Write your code here

In [None]:
assert file_text[:30] == '[\n    {\n        "World Rank": '

### Task 1.2 Save `rankings.json` as a file

Open a file in write mode called "rankings.json", and write `file_text` to it. Make sure to close it after, unless you used a `with` block.

This lecture may be useful: [Files and Directories](https://github.com/tylerharter/caraza-harter-com/blob/master/tyler/meena/cs220/s22/materials/meena_lec_notes/lec-26/files_and_directories.ipynb)

In [None]:
# Your code here

In [None]:
assert(os.path.exists("rankings.json"))

Check your `lab12` directory in Finder (Mac) / Explorer (Windows). It should have a file `rankings.json`.

### Task 1.3 Implement the `download` function

Now, implement a function `download` to download data from the internet and save it to a file. 

This function takes in two arguments `filename` and `url`. The contents at the address pointed to by the `url` field should be saved into the file whose path is specified by `filename`. Remember that you can reuse the code you wrote above.

In [None]:
def download(filename, url):
    # make the request
    # get the text
    # open the file
    # write to the file
    # close the file
    return (str(filename) + " created!")

### Task 1.4 Implement caching in the `download` function 

Now go back and modify `download` to implement caching. This means that before downloading the file from the internet, the function should check if the file already exists. *Hint:* We've used the `os` function you need for this in an assert test above and in the test below.

If the file already exists, the function should return the message `"<filename> already exists!"` where `filename` is the argument. It should not make a request.

### Task 1.5 Test the `download` function

Run the cell below to test your function. Think about why the test code is written in this way. Ask a TA if you're not sure.

In [None]:
rankings_url = "https://raw.githubusercontent.com/msyamkumar/cs220-s22-projects/main/p12/rankings.json"

if os.path.exists("rankings.json"):
    os.remove("rankings.json") ## delete the existing file

assert download("rankings.json", rankings_url) == "rankings.json created!"
assert(os.path.exists("rankings.json"))
assert(os.path.getsize("rankings.json") > 1600000 and os.path.getsize("rankings.json") < 2500000)
assert (download("rankings.json",rankings_url) == "rankings.json already exists!" )

You will have to use this `download` function to download files during p12. This will ensure that you do not download the files each time you 'Restart & Run All'.


## Segment 2:  Creating DataFrames

For this project, we will be analyzing statistics about world university rankings adapted from
[here](https://cwur.org/). The `rankings.json` file was created by scraping content from pages on the linked website. 

We are going to use `pandas` throughout the lab and project to analyze this dataset.

### Task 2.1 Load data from `rankings.json` into a dataframe

In lecture, you reviewed different ways to create pandas DataFrames. For this task, create a DataFrame `rankings` by reading the JSON data saved in `rankings.json`. 

We covered the `read_csv` method of pandas in lecture to read CSV files into a DataFrame. Now, we are going to use a similar method `read_json` to read a JSON file into a dataframe. Try this below, and seek help from a TA if you face any trouble.

Remember to cast the return value explicitly into a DataFrame object. You must do this throughout the lab and project. 
Sometimes, the `read_json` function's returned DataFrame has type issues on Windows laptops. Hence the need for explicit type conversion.

In [None]:
# Use the read_json method of pandas to create a DataFrame by reading from a file
# Cast the return value of read_json to a DataFrame explicitly
rankings = pd.DataFrame(???)

rankings.head()

In [None]:
assert(type(rankings) == pd.DataFrame)
assert(rankings.iloc[0]["Institution"] == 'Harvard University')
assert(rankings.iloc[1]["Score"]== 96.7)

### Task 2.2 Find the unique universities in the dataset

As the dataset contains rankings for three different years, the same university may have featured multiple times. Find the names of the unique universities that are represented in the dataset.

First, extract just the names of the institutions as a pandas Series. Then, make a list of unique names called `institutions`. Think about what data structure(s) you have been using to extract unique values from a list. Series can be easily converted into that useful data structure, and that data structure can be converted back into a series.

In [None]:
# Create a pandas `Series` of just the institution names in the dataset. 
institutions = ???

In [None]:
assert(type(institutions) == pd.Series)
assert(len(institutions) == 2156)

### Task 2.3 Use `value_counts` to count instances in a dataframe

Now, let's find the country that is the 5th most represented in the dataframe, and the number of times it features. Recall that `value_counts` enables us to count number of occurrences of unique values in a pandas Series.

#### Task 2.3a Obtain the counts for all countries

First, use the `value_counts` function to return a pandas series called `country_counts`. This series contains each country in the dataset and the number of times it occurs.

In [None]:
country_counts = ???

In [None]:
assert(type(country_counts) == pd.Series)
assert(country_counts["USA"] == 1062)
assert(len(country_counts) == 103)

#### Task 2.3b Find the 5th most represented country

Use the `.index` attribute of the `Series` `country_counts` to fetch the name of the 5th most represented country. Use `loc` or `iloc` to fetch the count of this country. Make sure to use the pandas series defined in Task 2.3a.

**Hint**: The pandas `Series.index` works differently from the `.index` method you are familiar with for python lists. `Series.index` takes in the numerical index of the element you want to access, and returns the label you can pass to `.loc` to access it.

In [None]:
country = country_counts.index[???]
count = ???

In [None]:
assert(country == "France")
assert(count == 256)

### Task 2.4 `loc` vs `iloc`

In this lab and project, you must only use `iloc`. Using `loc` will be considered hardcording. This is since `iloc` selects rows and columns at the given integer position while `loc` selects rows at the given pandas index. 

Intuition: Recall that row index can be given meaningful names like string indices. Consider a scenario where you add rows to the beginning of the DataFrame - if you use `.loc` indexing, your answer will become incorrect if the data changes. Whereas if you use `.iloc`, you will always get the correct answer.

This distinction may not be as intuitive for the current `rankings` dataframe. As an example, use both `loc` and `iloc` to fetch the first row in `rankings`.

In [None]:
first_row_iloc = ???
print(first_row_iloc)
first_row_loc = ???
print(first_row_loc)

The results are exactly the same! This happens since the integer positions correspond to the pandas indices in the `rankings` dataframe. However, this will not always hold true - as we see in the next task.

### Task 2.5 Use boolean indexing to filter data

Now, use boolean indexing to extract data from the dataframe. Recall boolean indexing from [lecture](https://github.com/tylerharter/caraza-harter-com/blob/master/tyler/meena/cs220/s22/materials/meena_lec_notes/lec-28/lec_28_pandas2.ipynb)

Create a dataframe `rankings_arg_bra` that only consists of rankings of universities from Argentina and Brazil. Extract the first value in this new dataframe. As you'll see, using `loc` will not work the same way it did before. The code in line 5 of the next cell should now return a KeyError.

**Hint**: When implementing boolean indexing in pandas, the `or` operator is represented by `|` and the `and` operator is represented by `&`.

In [None]:
rankings_arg_bra = ???
rankings_arg_bra

In [None]:
first_row_iloc = rankings_arg_bra.iloc[0]
print(first_row_iloc)
first_row_loc = rankings_arg_bra.loc[0]
print(first_row_loc)

Oops! We see that using `.loc` now causes a KeyError.

`.loc[0]` tries to find the row with the *labeled* index 0. Run the cell below and notice how `rankings_arg_bra` starts at the labeled index 127. There is no 0. Hence the KeyError.

In [None]:
rankings_arg_bra.head()

### Task 2.6 Sort the dataframe

The dataframe in Task 2.5 is sorted by World Rank, with the result that universities from Argentina and Brazil are interleaved throughout the data. Re-sort the data to sort by country so that all universities from Argentina appear first followed by universities from Brazil. Within each country, the universities should be sorted by their National Rank. 

Use the `sort_values` function of `pandas`. Remember - by default, `pandas` returns a new sorted DataFrame and does not modify the existing one.

Recall that `sort_values` takes an argument for the parameter `by` as the column name, based on which you want to do the sorting. If you want to use one column for primary sorting and another for secondary sorting, you can specify a list of column names.

In [None]:
sorted_rankings_arg_bra = ???

sorted_rankings_arg_bra.head()

In [None]:
assert(sorted_rankings_arg_bra.iloc[0]["Institution"] == "University of Buenos Aires")
assert(sorted_rankings_arg_bra.iloc[-1]["World Rank"] == 1997)

### Task 2.7 Create a new, simplified dataframe to track changes in rankings

As we have seen, universities that have featured in rankings of multiple years are featured repeatedly. To simplify comparisons, we want to feature each university once and remove all other metrics. 

This time - instead of simply ranking universities, we want to find the absolute change in universities' rankings between the year 2019-2020 and 2020-2021. We are only interested in the absolute change and not whether the rank improved or declined.  

First, let's attempt to measure the change for one particular university.

**Hint**: The `abs` function can be used to find the absolute value.

#### Task 2.7a Find the absolute difference in World Rank for "University of Madras" between 2019-2020 and 2020-2021

Store the difference in a variable `absolute_diff_madras`

In [None]:
# First find the ranking of "University of Madras" in the year "2019-2020"
# Then find the ranking of "University of Madras" in the year "2020-2021
# Remember to use .iloc[0] to extract the value
absolute_diff_madras = ???

In [None]:
assert(absolute_diff_madras == 108)

#### Task 2.7b Create a Series with the absolute difference in ranks for "University of Madras" between 2019-2020 and 2020-2021

First, create a dictionary with the keys as "Institution" and "Absolute Change". The values should be the relevant values for "University of Madras". Then, convert this dictionary to a Series called `madras_series`.

In [None]:
assert(madras_series["Institution"] == "University of Madras")
assert(madras_series["Absolute Change"] == 108)

#### Task 2.7c Create the `change_in_rankings` dataframe

Now, create a dataframe `change_in_rankings` with just 2 columns, "Institution" and "Absolute Change" where each university is only featured once. For this task, we are interested in universities in all countries. If the institution is not present in the rankings of either year, we will ignore it.

The institutions should be sorted in increasing order of their absolute change. For institutions with the same absolute change, sort them alphabetically by their names.

Note: this cell may take a few seconds to run.

In [None]:
# Suggested Approach:
# Initialize an empty list
# For each institution,
    # create a new dataframe that has rankings for only this institution
    # (Hint: Use boolean indexing for the "Institution" column)
    
    # Create a list of years by casting the "Year" column of this dataframe to a list
    # check if "2019-2020" or "2020-2021" are *not* in this list
        #If so, skip this institution
        
    # Extract the World Rank for each year from the new dataframe 
    # Remember to use .iloc[0] to extract the actual value
    # Find their absolute difference
    
    
    # Make a mini dictionary where the keys are “Institution” and “Absolute Change”
    # and the values are the corresponding values you just found for this institution
    
    # Append this dictionary to the empty list initialized in the first step
# Finally, convert the list of dicts to a pandas dataframe called change_in_rankings
# Sort this dataframe using .sort_values() similar to Task 2.6

Test your function below.

In [None]:
assert(change_in_rankings.iloc[100]["Institution"] == "Vrije Universiteit Brussel")
assert(change_in_rankings.iloc[-1]["Absolute Change"] == 1081)
assert(change_in_rankings.shape[1] == 2)

# Segment 3: Lint

The p12 autograder introduces lint checks to detect bad coding style. 
"Lint" refers to bad code that is not necessarily buggy (though "bad" coding style often leads to bugs).  A linter helps warn you about common issues. If you are interested in finding out about the origins of this term, check out the [Wikipedia page](https://en.wikipedia.org/wiki/Lint_(software)).

For project p12, we're adding a linter as part of `test.py`. It will notify you of code that is bad style, deducting 1% per issue (for a max of a 10% penalty).  

### Task 3.1 Install the pylint module

For the linter to run properly, install the `pylint` module by running this command in your terminal.

`
pip install pylint
`

Verify that the installation worked by simply running the `pylint` command in your terminal. You should see text explaining the various `pylint` options available. If you see a `command not found` error, ask a TA!

### Task 3.2 Run the pylint module

In a new notebook (e.g., named `lint_nb.ipynb`), paste the following code and save the notebook.

In [None]:
def abs(list):
    # Objective: return a new list, which contains absolute values of 
    #            items from the original list
    list = list[:] # copy it
    for i in range(len(list)):
        if list[i] < 0:
            list[i] = -list[i]
    return list

abs([-1, -3, 5, -4, 8])

Now open your terminal (Windows: PowerShell, Mac: Terminal), navigate to the directory you are currently working on (the folder which contains the lint_nb.ipynb and lint.py), and run the linter: 

`
python lint.py -v lint_nb.ipynb 
`

The command above assumes your code is in a notebook called `lint_nb.ipynb`. If you want to test some other code you've written in a different notebook, simply substitute `lint_nb.ipynb` with the name of your notebook (e.g. `main.ipynb`)

Consider why the linter is complaining, then write a better version of the function to make the linter happy. Recall that any word with green syntax highlighting in jupyter notebook is a Python keyword. You should not be using such words as variable names or function names.

You can find extensive documentation for the file lint.py [here](https://github.com/msyamkumar/cs220-s22-projects/tree/main/linter). If you find the linter confusing, please read the full documentation there!

# Segment 4: BeautifulSoup

As mentioned in Segment 2, the `rankings.json` file is created by parsing HTML content on the Web, specifically the web pages listed below.
* https://raw.githubusercontent.com/msyamkumar/cs220-s22-projects/main/p12/2019-2020.html
* https://raw.githubusercontent.com/msyamkumar/cs220-s22-projects/main/p12/2020-2021.html
* https://raw.githubusercontent.com/msyamkumar/cs220-s22-projects/main/p12/2021-2022.html

Now, let's write a function to do this ourselves. We will use the `BeautifulSoup` library to scrape web pages and extract information.

### Task 4.1 Download the HTML files
Use the `download` function you previously created to download the contents of each of the URLs above and save them into files. Name the files `2019-2020.html`, `2020-2021.html` and `2021-2022.html` based on the respective URL.

In [None]:
# Your code here

### Task 4.2 Read `2019-2020.html` content into a variable

**Note:** If you get a `UnicodeDecodeError`, make sure all your calls to `open()` have the keyword argument `encoding="utf-8"`. Delete the downloaded files and run the cell above again.

In [None]:
# Your code here

### Task 4.3 Initialize BeautifulSoup object instance

Use the variable defined in Task 4.2. 

In [None]:
# Your code here

### Task 4.4 Find the table element

The webpage has a table containing all the data we're trying to extract. Write the code to find this element and store it in a variable. Use the BeautifulSoup object instance defined in Task 4.3.

In [None]:
# Write your code here

### Task 4.5 Find all th tags, to parse the table header

Use the variable defined in Task 4.4. Save your answer to a variable named `header` in order to pass the asserts.

**Hint**: The header should be a list of elements, that can be obtained by using the `get_text()` method for each `th` element in the table. List comprehension may be useful here.

In [None]:
# Write your code here

In [None]:
assert(len(header) == 9)
assert(type(header) == list)
assert(header[0] == "World Rank")
assert(header[-1] == "Score")

Great work! The next tasks are optional. You may choose to skip them and start the lab! You can revisit this section when you are solving the relevant portion of P12.

### Task 4.6 (Optional) Build row dictionary for one row

Scrape the second row (the first one is the header!), convert data to the appropriate types, and populate the data into a row dictionary. The keys of the dictionary are the columns in the dataframe. Avoid hardcoding these keys - instead, use the variable obtained in the previous task.

**Hint**: Rows can be found by locating the `tr` elements in the table.

- "World Rank", "National Rank", "Quality of Education Rank", "Alumni Employment Rank", "Quality of Faculty Rank", "Research Performance Rank": `int` conversion
- "Score"  : `float` conversion

You can compare your parsing output to `rankings.json` file contents, to confirm your result.


In [None]:
# Write your code here

### Task 4.7 (Optional) Build list of all row dictionaries

Scrape all rows, convert data to appropriate types, and populate data into a row dictionary and append row dictionaries into a list.

This is a natural extension of Task 4.6. You can use a loop to extract all rows and populate the list.

**Important**:
* Some fields in the dataset have missing values, represented simply as `-`.
* The "Year" value isn't present in the dataset. Think of a different way to populate this field.

In [None]:
# Write your code here

### Task 4.8 (Optional) Write the parse_html function

Convert tasks 4.2 to 4.7 to a function. The function should take in a `filename` as input and return a list of dictionaries, each dictionary representing a row in the dataset.

In [None]:
def parse_html(filename):
    '''This function parses an HTML file and returns a list of dictionaries containing the tabular data'''
    #TODO: Write your code here
    pass
    

Finally, test your code below.

In [None]:
assert(parse_html("2019-2020.html")[-1]["Institution"] == 'Government College University Faisalabad')
assert(parse_html("2020-2021.html")[15]["Score"] == 89.0)
assert(parse_html("2021-2022.html")[100]["Country"] == 'United Kingdom')
assert(parse_html("2021-2022.html")[25]["World Rank"] == 26)
assert(parse_html("2020-2021.html")[-5]["National Rank"] == 15)
assert(parse_html("2019-2020.html")[50]["Quality of Faculty Rank"] == 78)
assert(parse_html("2021-2022.html")[87]["Alumni Employment Rank"] == 464)
assert(parse_html("2020-2021.html")[40]["Research Performance Rank"] == 398)
assert(parse_html("2019-2020.html")[0]["Year"] == "2019-2020")

### Congratulations, you are now ready to start p12!