# Data Transformations with Pandas in Python - Describing Data

Welcome to the notebook about describing data. In this notebook we will look to types of description: 
1. **Unique values** - showing the unique values, and counting the occurance of unique values
2. **Missing values** - calculating the number and percentage of missing values

Good luck!

We will us a file with (fake / randomly generated) data from the file `infoData.csv`, to see some methods for quickly obtaining information. Run below cell to load and check the data.

In [None]:
import pandas as pd
infoData = pd.read_csv('infoData.csv')
infoData

## 1. Showing and counting unique values

We can get an overview of the unique values in a _column_, by adding the method `.unique()` to a column. For example, to check which station names are available:

In [None]:
infoData.station.unique()

**Exercise**: Show an overview of the unique values in the column `infoData.Month`

In [None]:
# Write your code for getting the unique values of the column infoData.Month here

We can also get an overview of the number of values for each unique value. This is possible with the method `.value_counts()`. For example, to check the number of times each of the unique station names occurs in the column `infoData.station`:

In [None]:
infoData.station.value_counts()

**Exercise**: Show an overview of the number of values per unique value in the column `infoData.Month`

In [None]:
# Write your code for getting the number of values per unique month number here.

The method `.value_counts()` can also be used to get an overview of number of values for unique combinations of _multiple columns_; for example, an overview of the number of occurances for all unique station/year combinations. For this, you have to use the method on the full dataframe, and specify the columns that you want to use as `subset`. See below example.

In [None]:
# Creating a variable containing the column names for the columns of which you want to get number of unique combinations
columns_to_use = ['station', 'Year'] 

# Using the method .value_counts() on the dataframe, and specify the columns through the argument subset
infoData.value_counts(subset=columns_to_use) 

**Exercise**: Show some information on unique values and their counts in the dataframe emiData.

Steps:
- Load the data by running below code
- Write code for the following information requests:
    1. An overview of the unique values in the column `emiData.NAME`
    2. An overview of the unique values in the column `emiData.Element`
    3. The number of occurances of each unique value in the column `emiData.Element`
    4. The number of occurances of each unique value in the column `emiData.YEAR`
    5. The number of occurances of the _combinations_ of unique values in columns `'Name'` and `'Element'`
    6. The number of occurances of the _combinations_ of unique values in columns `'Element'` and `'YEAR'`

In [None]:
emiData = pd.read_excel('emiData.xlsx')
emiData

In [None]:
# Write your code here for the different information requests.

## 2. Getting insight in the number of missing values

In our dataframe infoData, some values are missing. This is shown as `NaN`. See for example the first two rows of the dataframe, in which we have a missing value in the column `infoData.rainfall`.

In [None]:
# .iloc[row_slice, column_slice] can be used to select parts of a DataFrame based on the relative position. 
# In this case: the first two rows (:2), and all columns (:)
infoData.iloc[:2, :] 

To count the number of missing values in a dataframe, you need to know two things:
- You can get **booleans** based on missing or not with the methods `.isna()` or `notna()`. The method `.isna()` gives `True` if a value is missing, and `False` if not.
- Booleans can be used mathematically: `True` counts as 1, `False` counts as 0.

The above two things combined gives us the option to count the number of missing values: first request booleans, and then sum them. With `.isna()`, all missing values will give `True`, and the rest `False`. So, summing that will result in the number of missing values (because only the `True`'s will be counted). And summing is easy: with the built-in method `.sum()`. See below example.

In [None]:
# Getting Booleans for missing (True) or not (False):
bool_missing = infoData.isna()
bool_missing

In [None]:
# Summing the booleans with the built-in Pandas method .sum()
bool_missing.sum()

In [None]:
# Or, all steps combined, and specifically for the columns of our interest (rainfall and temperature)
infoData.get(['rainfall', 'temperature']).isna().sum()

In the same way we can get the percentage of missing data. We saw that in the column `rainfall`, 15 values are missing. The total length of the dataframe is 72. So, the percentage of missing rainfall data is `15/72 * 100`, or, the sum divided by the length - or, the **average**. 

In other words, we can get the percentage of missing values by using the built-in method `.mean()` on the booleans created by `.isna()`.

In [None]:
# All steps combined, to get the percentage of missing data for both columns:
infoData.get(['rainfall', 'temperature']).isna().mean() * 100

**Exercise**: Display the precentage of **not missing data** (in other words: how much data is available). Hint: use `.notna()` instead of `.isna()`.

In [None]:
# Write your code for displaying the percentage of available data here.

More advanced overviews of missing data can be obtained by smartly using it with functions like `.groupby()`. 

For example, if you want the number of missing values _per year_, you want the _sum per year_ of the booleans. This can be done by grouping them per year (with `.groupby()`), before taking the sum. See below example.

In [None]:
infoData.get(['rainfall', 'temperature']).isna().groupby(by=infoData.Year).sum()

The above code includes 4 steps in one line:
- We select only columns `rainfall` and `temperature` with `.get()`
- We turn those into booleans for missing or not with `.isna()`
- We group those booleans for the values in the column `infoData.Year` with the method `.groupby()` (one group for each unique value in the column `infoData.Year`)
- We take the sum for each of those groups

**Exercise**: Show some information on missing data in the dataframe `makTemp`.

Steps:
- Load the data by running below code
- Write code for the following information requests:
    1. The number of missing values in the column `makTemp.temperature`
    2. The percentage of missing values in the column `makTemp.temperature`
    3. The percentage of msising values **per YEAR** in the column `makTemp.temperature`
    4. The percentage of msising values **per Month** in the column `makTemp.temperature`    

In [None]:
makTemp = pd.read_csv('makTemp.csv')
makTemp

In [None]:
# Write your code here for the different missing data information requests.