<a href="https://colab.research.google.com/github/guyfrancis/dat1001/blob/main/Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas

Pandas is a Python library that includes a useful datastructure type called 'dataframe' as well as associated code that help manipulate and analyze data.
Dataframes are good for storing and manipulating data in tabular form, with rows and columns.

Let's start by importing the pandas library.
We will then create a dictionary containing information about certain people and we will use pandas to convert it to a dataframe.

In [2]:
import pandas as pd

my_dict = { "Name": ["Bob", "Charlie", "Elise"], "Age": [32, 47, 25], "Height (ft)": [5.9, 6.0, 5.2]}

df = pd.DataFrame(my_dict)

df.head()

Unnamed: 0,Name,Age,Height (ft)
0,Bob,32,5.9
1,Charlie,47,6.0
2,Elise,25,5.2


## DataFrame Operations

Now that we have the data stored in a dataframe, we can perform a number of operations.

You can reference a column by using the column header.



In [3]:
# Reference the column 'Age'
print(df['Age'])

0    32
1    47
2    25
Name: Age, dtype: int64


You can reference a particular table value with the **iloc** method, giving the numerical value of the row and column

In [4]:
# Print the entry in first row (row 0) and second column (column 1) of the data frame
print(df.iloc[0, 1])

32


In [5]:
# You can also use iloc to get a whole row
print(df.iloc[1])

Name           Charlie
Age                 47
Height (ft)        6.0
Name: 1, dtype: object


In [6]:
# The describe method gives some summary information about the data
df.describe()

Unnamed: 0,Age,Height (ft)
count,3.0,3.0
mean,34.666667,5.7
std,11.23981,0.43589
min,25.0,5.2
25%,28.5,5.55
50%,32.0,5.9
75%,39.5,5.95
max,47.0,6.0


In [8]:
# The shape property tells you how many rows and columns are in the data
df.shape

(3, 3)

# Reading Data from a Webpage

We can also use pandas to grab tabular data from webpages. To do this, we first have to import the **requests** library so we can make an http request to grab a webpage.

Before we make the web request, navigate to the webpage by clicking on the link below so that you can see what the webpage looks like.

https://worldpopulationreview.com/country-rankings/coffee-producing-countries

You should see a page that includes a table of countries with their coffee production.

Before we make the http request, we need to set the http header to include information about where the request is coming from. Websites may otherwise block requests.



In [11]:
# Import the requests library for making http requests
import requests as r
# Create a header that says the request is coming form a browser-like agent.
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
}

In [12]:
# Make an http request to get the webpage with the following url
url = "https://worldpopulationreview.com/country-rankings/coffee-producing-countries"
page = r.get(url, headers = headers)

In [13]:
# Check that the request was successful. If so, the status code should be 200.
page.status_code

200

In [24]:
# Print the first few hundred characters of the webpage
# The actual content of the page, which is HTML, is stored in the 'content' attribute of the page object
print(page.content[0:200])

b'<!DOCTYPE html><html lang="en"> <head><meta charset="utf-8"><meta name="viewport" content="width=device-width, initial-scale=1"><title>Coffee Producing Countries 2025</title><meta name="description" c'


Now we will use the pandas.read_html() method

In [29]:
# read_html() will find all the tables in the webpage and put them in a list.
tables = pd.read_html(page.content)
# In this case, there is only one table, so the length of the list should be 1.
len(tables)

Each table in the HTML page is converted into a dataframe in the tables list. As there is only one table, let's store it in a separate variable called cd (for coffee data).

In [32]:
# Store the one table as a dataframe in the variable cd
cd = tables[0]
# Check the type of this variable. It should say 'pandas.core.frame.DataFrame'
type(cd)

In [33]:
# Let's look at the table
cd

Unnamed: 0.1,Unnamed: 0,Country,Coffee Production 2022 (t)↓,Coffee Yield 2022
0,,Brazil,3.2M,1694.3
1,,Vietnam,2M,2979.0
2,,Indonesia,794.8K,618.1
3,,Colombia,665K,789.4
4,,Ethiopia,496.2K,668.9
...,...,...,...,...
74,,Sao Tome and Principe,8,58.5
75,,Suriname,6,20.4
76,,New Caledonia,2,137.5
77,,Cook Islands,0,587.3
