<a target="_blank" href="https://mybinder.org/v2/gh/joshmaglione/CS3101-Notes/HEAD?labpath=Notes%2Fnotebooks%2F09_panda_bears.ipynb">
  <img src="https://mybinder.org/badge_logo.svg" alt="Binder"/>
</a> 
<a target="_blank" href="https://colab.research.google.com/github/joshmaglione/CS3101-Notes/blob/main/Notes/notebooks/09_panda_bears.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a> <a target="_blank" href="https://github.com/joshmaglione/CS3101-Notes/blob/main/Notes/notebooks/09_panda_bears.ipynb">View on GitHub</a>

# Pandas: working with data

![](../imgs/Pandas.png)

The package `pandas` is *ubiquitous* in data science and machine learning.

It's a simple package and built on top of `NumPy`.

Lots of uses when it comes to data analysis.

We won't explore the depths, but we will get a feeling for deep and wide it gets.

In [None]:
import pandas as pd

The main object is the `pandas` DataFrame.

You can think of this like a table or a spreadsheet.

(This would make `pandas` like Excel.)

## Understanding Ireland's renewable energy

The UN has gathered data on a number of countries. 

One data set they have concerns the [percentage of the share of renewable energy](https://unstats.un.org/sdgs/dataportal/countryprofiles/IRL#goal-7).

The data set is `UN_renewable.csv`.

We could open this using what we learned in `08_data.ipynb`, but we can also take advantage of the infrastructure in `pandas`.

In [None]:
UN_data = pd.read_csv('../data/UN_renewable.csv')

Pandas has *many* methods for reading in data files -- including `xls` and `xlsx`.

Let's look at our data.

Instead of printing the whole DataFrame, we use `.head()` for the first five rows.

In [None]:
UN_data.head()

You'll see `NaN`, which means Not a Number. In other words, there is no data for that entry.

Two columns look relevant: "TimePeriod" and "Value".

Let's focus on just those.

In [None]:
df = UN_data[["TimePeriod", "Value"]]
df

We can clean up the data with `.dropna()`

In [None]:
df = df.dropna()
df

Let's change the names of the columns to be "Year" and "Percentage". 

In [None]:
df = df.rename(columns={"TimePeriod": "Year", "Value": "Percentage"})
df.head()

Now we can use `pandas` to plot the data.

In [None]:
_ = df.plot(kind="scatter", x="Year", y="Percentage", title="Renewable Energy Consumption by Year")

It's bizarre that the Year data is a float. 

This is simply because we didn't tell pandas do interpret this value as time data.

Here's he code all at once now.

In [None]:
UN_data = pd.read_csv(
	'../data/UN_renewable.csv', 
	parse_dates=[7]					# parse the 8th (!) column as a date
)
UN_data.head()
_ = UN_data.plot(
	kind="scatter", 
	x="TimePeriod", 
	y="Value", 
	title="Renewable Energy Consumption by Year", 
	xlabel="Year",
	ylabel="Percentage"
)

There is so much one can do with `pandas`. If you need to do more data streamlining or analysis, check it out.

We can quickly get some simple statistical information.

In [None]:
UN_data["Value"].describe()

You might say that after 2004, the State felt differently about the percentage of renewable energy.

(I am making this up; I don't know whether or not it is true.)

In [None]:
df = UN_data.query("TimePeriod >= 2005")
df["Value"].describe()

## Exercises

1. Download a data set from the [UN's Sustainable Development Goals](https://unstats.un.org/sdgs/dataportal/countryprofiles/IRL) and load it into Python with `pandas`.
2. Plot the relevant data (there is usually a lot of extra data).
3. Determine some basic statistics regarding the data set.

---

There is at least one glaring aspect we have omitted. Let's do one more data-based notebook.

- `10_data_aficionado.ipynb` 