[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/joshmaglione/CS102-Jupyter/main?labpath=.%2FWeek07.ipynb) 

<a href="https://colab.research.google.com/github/joshmaglione/CS102-Jupyter/blob/main/Week07.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> 

[View on GitHub](https://github.com/joshmaglione/CS102-Jupyter/blob/main/Week07.ipynb)

# Week 7: Working with DataFrames

Last time we discussed the three main data structures `pandas` brings.

Now we will discuss how to manipulate the primary object: DataFrames.

Let's load in UN data about Ireland from two files sources: 
- `data/01_below_poverty.csv`
- `data/07_renewable_energy.csv`

(This was obtained from the [United Nations' SDG Country Profile Page](https://unstats.un.org/sdgs/dataportal/countryprofiles/IRL))

This data has lots of information that we don't need, so our goal is to produce one DataFrame with the information we want. 

In [None]:
import pandas as pd

## Below Poverty Data Set

We'll work through some of the basics with the below poverty data set.

In [None]:
df1 = pd.read_csv("data/01_below_poverty.csv")
df1.head()

Most of the columns look irrelevant. Let's look to keep the columns
- "TimePeriod"
- "Value"
- "Time_Detail"
- "[Age]"
- "[Location]"
- "[Sex]"

In [None]:
df1 = df1[[
	"TimePeriod",
	"Value",
	"Time_Detail",
	"[Age]",
	"[Location]",
	"[Sex]"
]]
df1.head()

It's weird that `"TimePeriod"` is a float. Let's change this to an int.

In [None]:
# df1.astype({"TimePeriod" : "i"})

We need to clean our data before we can manipulate it. Let's remove the rows where `"TimePeriod"` is empty.

In [None]:
df1.dropna(subset="TimePeriod", inplace=True)
df1 = df1.astype({"TimePeriod" : "i"})
df1

We are not going to span the entirity of this data set. Let's just take the years 2005 to 2020.

In [None]:
df1.query("2005 <= TimePeriod <= 2020", inplace=True)
df1

The `"Time_Detail"` column looks irrelevant. Let's look at all the values.

In [None]:
df1["Time_Detail"].value_counts()

In [None]:
df1.drop(["Time_Detail"], axis=1, inplace=True)
df1

The last four columns are potentially irrelevant. Let's see what the possibilities are for 
- "[Age]"
- "[Location]"
- "[Sex]"

We want the most inclusive options.

Let's look at the rows for one particular year. 

In [None]:
df1.query("2011 == TimePeriod")

The data set has lots of granularity. Let's just take the coarse, general information.

In [None]:
df1 = df1[
	(df1["[Age]"] == "ALLAGE") & 
	(df1["[Location]"] == "ALLAREA") &
	(df1["[Sex]"] == "BOTHSEX")
]
df1

Now that the last three columns are constant, we will drop them.

In [None]:
df1.drop(["[Age]", "[Location]", "[Sex]"], axis=1, inplace=True)
df1

The current index is also irrelevant. Let's convert `"TimePeriod"` to our index. 

In [None]:
df1.set_index("TimePeriod", inplace=True)
df1

Lastly, let's change `"Value"` to `"Below Poverty (%)"`.

In [None]:
df1.rename(columns={"Value" : "Below Poverty (%)"}, inplace=True)
df1

## Speedrun: Renewable Energy Data Set

We will basically do the same steps as above, but all at once. See if you can follow along line by line.

In [None]:
df2 = pd.read_csv("data/07_renewable_energy.csv")
df2 = df2[["TimePeriod", "Value"]]
df2.dropna(inplace=True)
df2 = df2.astype({"TimePeriod" : "i"})
df2.set_index("TimePeriod", inplace=True)
df2 = df2.loc["2005":"2020"]
df2.rename(columns={"Value" : "Renewable Energy Share (%)"}, inplace=True)
df2

## From two to one

Because our two DataFrames have the same index, we can concatenate them in `pandas`.

In [None]:
df = pd.concat([df1, df2], axis=1)		# Merging our two DataFrames
df.index.names = ["Year"]				# Renaming the index
df

#### Detour: visualization

We'll discuss this more later, but we can now plot the DataFrame on a set of axes with `Matplotlib`.

In [None]:
import matplotlib.pyplot as plt 
plt.grid()
plt.scatter(
	df.index, 
	df["Below Poverty (%)"], 
	marker='o', 
	label="Below poverty"
)
plt.scatter(
	df.index, 
	df["Renewable Energy Share (%)"], 
	marker="^", 
	label="Renewable energy share"
)
plt.xlabel("Year")
plt.ylabel("Percent")
plt.legend()
plt.show()

## Ufuncs

All the `NumPy` UFuncs can be applied to DataFrames, provided they are applied to appropriate numerical data. 