# 02 Pandas Reading Excel (data.europe.eu)

Content:

* Loading (large) Excel file, 75MB
* Basic inspection


----

Note:

* Requires additional libraries to load Excel files, e.g. [xlrd](https://github.com/python-excel/xlrd) or [openpyxl](https://openpyxl.readthedocs.io/en/stable/).

In [1]:
!tree -sh data/E-PRTR_database_v13

[4.0K]  [01;37mdata/E-PRTR_database_v13[0m
├── [ 75M]  [00mPollutant releases.xlsx[0m
├── [7.6M]  [00mPollutant transfers.xlsx[0m
└── [ 94M]  [00mWaste transfers.xlsx[0m

0 directories, 3 files


In [2]:
import pandas as pd

Reading 75MB of Excel can take a while (few minutes), even with Pandas.

In [None]:
df = pd.read_excel("data/E-PRTR_database_v13/Pollutant releases.xlsx")

## Basic inspection

Key takeaways:

* Pandas has inferred the types of various columns (`dtypes: bool(1), float64(3), int64(2), object(16)`)

Few inspection functions:

* `df.info`
* `df.describe`
* `df["columnname"].unique()`

Real-world data is often not complete, values are missing. That's why a `df.info()` is useful to have a first glance at the data quality.

In [None]:
df.info()

Then, to get an impression of the content, the column names might or might not be speaking.

In [None]:
df.columns

With the various selection techniques, it is quite easy to just peek into the data and get more impressions of values and shape.

In [None]:
df.CountryName[:10]

Unique values are often of interest as well.

In [None]:
df.CountryName.unique()

In [None]:
df.City.unique()

In [None]:
len(df.City.unique())

Most of the time, unique values *per column* are what is relevant. In case the unique values should be queried
across multiple columns, one has to take a slight workaround: first concatenate the columns and then run unique on that.

In [None]:
pd.concat([df['City'], df['CountryName']]).unique()

In [None]:
len(pd.concat([df['City'], df['CountryName']]).unique())

The value almost equals the sum of the unique values.

In [None]:
len(df.City.unique()) + len(df.CountryName.unique())

Which means, that there is an overlap between city and country names. The interoperability of Pandas, numpy and Python makes it  quite simple to find the overlap with the built-in set data type.

In [None]:
set(df.City.unique()) & set(df.CountryName.unique())

## Grouping

* Grouping is one of the most common operations to perform on data.
* The df.groupyby function is "call-by-need" or lazily evaluated, it returns a DataFrameGroupBy object.
* Aggregations, like size are realised as chained calls.

One useful stanza is the `df.groupy("columnname").size().sort_values()` expression, which ranks values by frequency.


In [None]:
df.groupby("CountryName")

In [None]:
df.groupby("CountryName").size().sort_values(ascending=False)

## How many pollutants are listed?

* And which one is the most frequent?

In [None]:
len(df["PollutantName"].unique())

In [None]:
df.groupby("PollutantName").size().sort_values(ascending=False)

## Which pollutant groups are listed?

In [None]:
df.PollutantGroupName.unique()

## The medium of release for the pollutant has fewer manifestations.

* Air
* Water
* Soil

In [None]:
df.ReleaseMediumName.unique()

In [None]:
df.groupby(df.ReleaseMediumName).size().sort_values(ascending=False)

## Grouping by more than one attribute.

Grouping by more than one attribute is supported, by passing a list of columns to `df.groupby`.

Are there pollutants, that are release through more than one medium?

In [None]:
nm = df.groupby([df.PollutantName, df.ReleaseMediumName]).size()

The result is a series with a hierarchical index.

> The MultiIndex object is the hierarchical analogue of the standard Index object which typically stores the axis labels in pandas objects. You can think of MultiIndex as an array of tuples where each tuple is unique.

From: https://pandas.pydata.org/pandas-docs/stable/advanced.html

In [None]:
nm.head(20)

In [None]:
type(nm), type(nm.index)

In [None]:
nm["Aldrin"]

In [None]:
nm["Aldrin", "Air"]

To sort an index on various levels, we can use `sort_index`.

In [None]:
nm.sort_index(level=0, ascending=True).head()

In [None]:
nm.sort_index(level=1, ascending=True)

We sort by the second level (using level=1). The result is still a series, the display just looks more DataFrame-like.

In [None]:
by_medium = nm.sort_index(level=1, ascending=True)

In [None]:
by_medium.head()

Access by chaining.

In [None]:
by_medium["Aldrin"]["Water"]

In [None]:
nm

The idea, that the indices behave like tuples, can be observed here as well.

In [None]:
nm[("Anthracene",)]

In [None]:
nm[("Anthracene", "Water")]

## Masking on hierarchical index.

To filter all levels, we can use boolean indexing as usual.

In [None]:
nm[nm > 100]

To restrict the value for a given type, we can then filter by e.g. PollutantName. This results in a series that contains all pollutants released through water which have measurements above 1000.

In [None]:
nm[nm > 1000][:, "Water"]

Another way would be to use `get_level_values`.

> Return an Index of values for requested level, equal to the length of the index.

It is important to keep the index length intact.

In [None]:
nm[(nm.index.get_level_values("ReleaseMediumName") == "Water") & (nm > 1000)]

## Looking at a specific location

How many German cities are contained in the dataset?

In [None]:
df[df.CountryName == "Germany"].City.unique().size

How many measurements per city in Germany?

In [None]:
df[df.CountryName == "Germany"].groupby("City").size().sort_values(ascending=False)

Is a certain city contained?

In [None]:
"Bonn" in df[df.CountryName == "Germany"].City.values

How many entries for given city?

In [None]:
df[df.City == "Bonn"].size

Filtering out the FacilityName and StreetName for entries in a given city.

In [None]:
df[df.City == "Bonn"][["FacilityName", "StreetName"]].head(20)

Facilities with most entries for a given city.

In [None]:
df[df.City == "Bonn"].groupby("FacilityName").size().sort_values(ascending=False)

Sort by pollutant groups.

In [None]:
df[df.City == "Bonn"].groupby(["FacilityName", "PollutantGroupName"]).size().sort_index(level=1)

Sort by values.

In [None]:
df[df.City == "Bonn"].groupby(["FacilityName", "PollutantGroupName"]).size().sort_values(ascending=False)