# Pandas II

In this notebook, we will build on what we've learned so far with pandas. We will also introduce the **requests** library which will allow us to grab data from a webpage.

## Reading Data from a Webpage

We can also use pandas to grab tabular data from webpages. To do this, we first have to import the **requests** library so we can make an http request to grab a webpage.

Before we make the web request, navigate to the webpage by clicking on the link below so that you can see what the webpage looks like.

https://worldpopulationreview.com/country-rankings/coffee-producing-countries

You should see a page that includes a table of countries with their coffee production.

Before we make the http request, we need to set the http header to include information about where the request is coming from. Websites may otherwise block requests.



In [None]:
# Import the requests library for making http requests
import requests as r
# Create a header that says the request is coming form a browser-like agent.
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
}

In [None]:
# Make an http request to get the webpage with the following url
url = "https://worldpopulationreview.com/country-rankings/coffee-producing-countries"
page = r.get(url, headers = headers)

In [None]:
# Check that the request was successful. If so, the status code should be 200.
page.status_code

In [None]:
# Print the first few hundred characters of the webpage
# The actual content of the page, which is HTML, is stored in the 'content' attribute of the page object
print(page.content[0:200])

Now we will use the pandas.read_html() method

In [None]:
# read_html() will find all the tables in the webpage and put them in a list.
tables = pd.read_html(page.content)
# In this case, there is only one table, so the length of the list should be 1.
len(tables)

Each table in the HTML page is converted into a dataframe in the tables list. As there is only one table, let's store it in a separate variable called cd (for coffee data).

In [None]:
# Store the one table as a dataframe in the variable cd
cd = tables[0]
# Check the type of this variable. It should say 'pandas.core.frame.DataFrame'
type(cd)

In [None]:
# Let's look at the table
cd

## Data Wrangling

We are now going to do some data 'wrangling' on the dataframe. Basically, we are going to fix some issues with the data as it present to make it easier to analyze.

You may notice there are some issues with these data:

1. There is a column full of 'NaN' which we should delete
2. Some of the column titles contain special characters which we don't need - we can rename the columns.
3. The coffee production column has a mixture of numbers and letters 3.2M means 3.2 million (tonnes) 794.8K mean 794.8 thousand (tonnes). We need to replace these with consistent numbers.

Firstly, we can change the column names.




In [None]:
# Rename the columns
cd.columns = ['Not used', 'Country', '2022 Production (tonnes)', '2022 Yield (kg/hectare)']
cd

In [None]:
# Drop column 1
cd = cd.drop(columns=['col1'])

In [None]:
cd.shape

In [None]:
pd.set_option('display.max_rows', 79)

In [None]:
cd

In [None]:
for val in cd['2022 Production (tonnes)']:
  print(type(val))

In [None]:
Data Wrangling
We are now going to do some data 'wrangling' on the dataframe. Basically, we are going to fix some issues with the data as it present to make it easier to analyze.

You may notice there are some issues with these data:

There is a column full of 'NaN' which we should delete
Some of the column titles contain special characters which we don't need - we can rename the columns.
The coffee production column has a mixture of numbers and letters 3.2M means 3.2 million (tonnes) 794.8K mean 794.8 thousand (tonnes). We need to replace these with consistent numbers.
Firstly, we can change the column names.


[ ]
# Rename the columns
cd.columns = ['Not used', 'Country', '2022 Production (tonnes)', '2022 Yield (kg/hectare)']
cd

Next steps:

[ ]
# Drop column 1
cd = cd.drop(columns=['col1'])

[ ]
cd.shape
(79, 3)

[ ]
pd.set_option('display.max_rows', 79)

[ ]
cd

Next steps:

[ ]
for val in cd['2022 Production (tonnes)']:
  print(type(val))
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>

[ ]
