# Session 12: Reading files into dataframes. Operations on data. Aggregating and grouping.

## Reading files into DataFrames:

`pandas` is a module really versatile when converting data in different files into DataFrames.

We have several functions from `pandas` to read files into DataFrames:
* `pd.read_csv` converts CSV files into a `pd.DataFrame`
* `pd.read_json` converts JSON files into a `pd.DataFrame`
* `pd.read_html` converts HTML files into a `pd.DataFrame`
* `pd.read_clipboard` converts the data in your clipboard into a `pd.DataFrame`
* and many more... https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

In general, `pandas` will read the file just ok, but there are sometimes in which we need to specify some arguments within `read_csv()`:
* separator: `sep` can be semicolon (;), comma (,), tab (\t), etc
* encoding: `encoding` can be `utf-8`, `latin1`, ...

In [1]:
import pandas as pd
pd.read_clipboard()

Unnamed: 0,marketplace


In [1]:
# lets read animals.csv

import pandas as pd

animals = pd.read_csv("../files/animals.csv", sep=",")

animals.head()

FileNotFoundError: [Errno 2] No such file or directory: '../files/animals.csv'

### pandas: `head`, `tail`, `sample`

* `df.head(n)` will display the first n rows of a dataframe. By default, n=5.
* `df.tail(n)` will display the last n rows of a dataframe. By default, n=5.
* `df.sample(n)` will display a random sample of n rows of a dataframe. By default, n=1.

In [3]:
animals.head()

Unnamed: 0,year,district,dogs,cats
0,2019,ARGANZUELA,10556,5074
1,2019,BARAJAS,5086,1515
2,2019,CARABANCHEL,20258,6387
3,2019,CENTRO,16010,9248
4,2019,CHAMARTÍN,11098,3922


In [4]:
animals.tail(3)

Unnamed: 0,year,district,dogs,cats
123,2014,VICÁLVARO,4584,505
124,2014,VILLA DE VALLECAS,7107,940
125,2014,VILLAVERDE,10467,851


In [5]:
animals.sample(5)

Unnamed: 0,year,district,dogs,cats
31,2018,MONCLOA-ARAVACA,12462,3453
111,2014,CIUDAD LINEAL,17604,3042
59,2017,USERA,12623,1847
124,2014,VILLA DE VALLECAS,7107,940
10,2019,MONCLOA-ARAVACA,12367,3931


## Operations with the data in the columns

With pandas we can not only store tabular-like data, but also perform different operations with it

### Using `pandas` methods and attributes 

Since our columns are nothing `pd.Series` objects, we can use all the attributes and methods that apply to them:
* https://pandas.pydata.org/docs/reference/api/pandas.Series.html

Just a sample of what we can do:
* Attributes:
    * `.index`, `.shape`, `.size`, `.values`, `.T`, 
* Methods:
    * `.abs()`, `.min()`, `.max()`, `.count()`, `.value_counts()`
    * `.sum()`, `.cumsum()`, `.mean()`, `.std()`
    * `.isna()`, `.isnull()`, `.idxmin()`, `.idxmax()`
    * `.unique()`, `.nunique()`, `.drop_duplicates()`

In [6]:
list(animals.index)[:5]

[0, 1, 2, 3, 4]

In [7]:
animals.shape

(126, 4)

In [8]:
animals.size

504

In [9]:
animals.values

array([[2019, 'ARGANZUELA', 10556, 5074],
       [2019, 'BARAJAS', 5086, 1515],
       [2019, 'CARABANCHEL', 20258, 6387],
       [2019, 'CENTRO', 16010, 9248],
       [2019, 'CHAMARTÍN', 11098, 3922],
       [2019, 'CHAMBERÍ', 13359, 4692],
       [2019, 'CIUDAD LINEAL', 17286, 8183],
       [2019, 'FUENCARRAL-EL PARDO', 17375, 6121],
       [2019, 'HORTALEZA', 15836, 8556],
       [2019, 'LATINA', 19049, 10564],
       [2019, 'MONCLOA-ARAVACA', 12367, 3931],
       [2019, 'MORATALAZ', 6724, 2502],
       [2019, 'PUENTE DE VALLECAS', 23437, 6208],
       [2019, 'RETIRO', 7786, 3105],
       [2019, 'SALAMANCA', 13471, 5033],
       [2019, 'SAN BLAS', 14228, 5064],
       [2019, 'TETUÁN', 12470, 5535],
       [2019, 'USERA', 12393, 2898],
       [2019, 'VICÁLVARO', 5244, 1505],
       [2019, 'VILLA DE VALLECAS', 9923, 2946],
       [2019, 'VILLAVERDE', 12917, 2694],
       [2018, 'ARGANZUELA', 10622, 4458],
       [2018, 'BARAJAS', 5203, 1300],
       [2018, 'CARABANCHEL', 20265, 5524],

In [10]:
animals.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,116,117,118,119,120,121,122,123,124,125
year,2019,2019,2019,2019,2019,2019,2019,2019,2019,2019,...,2014,2014,2014,2014,2014,2014,2014,2014,2014,2014
district,ARGANZUELA,BARAJAS,CARABANCHEL,CENTRO,CHAMARTÍN,CHAMBERÍ,CIUDAD LINEAL,FUENCARRAL-EL PARDO,HORTALEZA,LATINA,...,MORATALAZ,PUENTE DE VALLECAS,RETIRO,SALAMANCA,SAN BLAS,TETUÁN,USERA,VICÁLVARO,VILLA DE VALLECAS,VILLAVERDE
dogs,10556,5086,20258,16010,11098,13359,17286,17375,15836,19049,...,6706,22072,8774,12942,12786,12301,11310,4584,7107,10467
cats,5074,1515,6387,9248,3922,4692,8183,6121,8556,10564,...,1153,2065,1344,1793,2043,2178,978,505,940,851


#### abs

In [11]:
animals["dogs"].abs()

0      10556
1       5086
2      20258
3      16010
4      11098
       ...  
121    12301
122    11310
123     4584
124     7107
125    10467
Name: dogs, Length: 126, dtype: int64

#### max, min, count, value_counts

In [12]:
# max value in series
animals["dogs"].max()

23860

In [13]:
# min value in series
animals["dogs"].min()

4584

In [14]:
# number of elements in series
animals["district"].count()

126

In [15]:
animals["district"].size

126

In [16]:
animals["district"].shape

(126,)

In [17]:
# counts items per category
animals["year"].value_counts()

2019    21
2018    21
2017    21
2016    21
2015    21
2014    21
Name: year, dtype: int64

In [18]:
# if we pass normalize=True to `value_counts` we will get the proportions instead of the totals
animals["year"].value_counts(normalize=True)

2019    0.166667
2018    0.166667
2017    0.166667
2016    0.166667
2015    0.166667
2014    0.166667
Name: year, dtype: float64

#### sum, cumsum, mean, std

In [19]:
# sum of all elements
animals["cats"].sum()

419173

In [20]:
# cummulative sum 
# item1, item1+item2, item1+item2+item3, ...
animals["cats"].cumsum()

0        5074
1        6589
2       12976
3       22224
4       26146
        ...  
121    415899
122    416877
123    417382
124    418322
125    419173
Name: cats, Length: 126, dtype: int64

In [21]:
# mean value of series
animals["cats"].mean()

3326.7698412698414

In [22]:
# standard deviation of series
animals["cats"].std()

2062.750967665068

#### isna/isnull, idxmin/idxmax

In [23]:
import numpy as np

In [24]:
# missing values in pandas areindicated as NaN
# with isna we can check how many 

s = pd.Series([1, None, "a", True, np.nan]).isna()



In [25]:
# isna: returns array with same shape with True/False to mask NaN
animals["dogs"].isna()

0      False
1      False
2      False
3      False
4      False
       ...  
121    False
122    False
123    False
124    False
125    False
Name: dogs, Length: 126, dtype: bool

In [35]:
# with dropna we can drop rows with NaN
pd.Series([1, None, "a", True, np.nan]).dropna()

0       1
2       a
3    True
dtype: object

In [27]:
# idxmax() returns the row label (index) of the highest value in series
animals["dogs"].idxmax()

54

In [28]:
animals["dogs"][animals["dogs"].idxmax()] == animals["dogs"].max()

True

In [29]:
# idxmin() returns the row label (index) of the lowest value in series
animals["dogs"].idxmin()

123

#### unique, nunique, drop_duplicates

In [30]:
# returns an array with the unique values, like doing set(series)
animals["district"].unique()

array(['ARGANZUELA', 'BARAJAS', 'CARABANCHEL', 'CENTRO', 'CHAMARTÍN',
       'CHAMBERÍ', 'CIUDAD LINEAL', 'FUENCARRAL-EL PARDO', 'HORTALEZA',
       'LATINA', 'MONCLOA-ARAVACA', 'MORATALAZ', 'PUENTE DE VALLECAS',
       'RETIRO', 'SALAMANCA', 'SAN BLAS', 'TETUÁN', 'USERA', 'VICÁLVARO',
       'VILLA DE VALLECAS', 'VILLAVERDE', 'FUENCARRAL EL PARDO'],
      dtype=object)

In [31]:
# nunique returns how many unique elements there are in the series, like doing len(set(series))
animals["district"].nunique()

22

In [32]:
# drop_duplicates returns a series with only the unique values and the index at which they are
animals["district"].drop_duplicates()

0               ARGANZUELA
1                  BARAJAS
2              CARABANCHEL
3                   CENTRO
4                CHAMARTÍN
5                 CHAMBERÍ
6            CIUDAD LINEAL
7      FUENCARRAL-EL PARDO
8                HORTALEZA
9                   LATINA
10         MONCLOA-ARAVACA
11               MORATALAZ
12      PUENTE DE VALLECAS
13                  RETIRO
14               SALAMANCA
15                SAN BLAS
16                  TETUÁN
17                   USERA
18               VICÁLVARO
19       VILLA DE VALLECAS
20              VILLAVERDE
112    FUENCARRAL EL PARDO
Name: district, dtype: object

### Create new columns out of existing columns

* We can operate 2 or more columns with arithmetic operators
* We can perform logical operations in columns using np.where
    * ```Python
    np.where(condition_on_column, result_if_true, result_if_false)
    ```


In [33]:
# sum two columns

animals["total_animals"] = animals["cats"] + animals["dogs"]

animals.head()

Unnamed: 0,year,district,dogs,cats,total_animals
0,2019,ARGANZUELA,10556,5074,15630
1,2019,BARAJAS,5086,1515,6601
2,2019,CARABANCHEL,20258,6387,26645
3,2019,CENTRO,16010,9248,25258
4,2019,CHAMARTÍN,11098,3922,15020


### np.where

```Python
np.where(
    condition_to_check,
    value_if_condition_is_true,
    value_if_condition_is_false
)
```

In [34]:
if condition:
    value_if_condition_is_true
else:
    value_if_condition_is_false

NameError: name 'condition' is not defined

In [36]:
# create a new column based on a logical condition on an existing column: `np.where`

import numpy as np

mean_animals = animals["total_animals"].mean()
print(mean_animals)

animals["total_animals_cat"] = np.where(
    animals["total_animals"] > mean_animals, #if animals above mean
    "above_mean", # save "above_mean"
    "below_mean" # save "below_mean"
)

animals.sample(5)

16388.97619047619


Unnamed: 0,year,district,dogs,cats,total_animals,total_animals_cat
98,2015,SALAMANCA,13159,1860,15019,below_mean
102,2015,VICÁLVARO,4702,545,5247,below_mean
26,2018,CHAMBERÍ,13615,4087,17702,above_mean
99,2015,SAN BLAS,13067,2188,15255,below_mean
77,2016,SALAMANCA,12709,3424,16133,below_mean


In [37]:
# concatenating strings and converting
animals["concat_string"] = animals["year"].astype(str) + animals["district"]

animals["concat_string"] = animals["concat_string"].str[-3:]

animals.dtypes

year                  int64
district             object
dogs                  int64
cats                  int64
total_animals         int64
total_animals_cat    object
concat_string        object
dtype: object

In [None]:
# create a new column called "cats_per_dog" that contains the ratio cats/dogs
animals["cats_per_dog"] = animals["cats"] / animals["dogs"]

animals.head()

In [None]:
# create a new column called "cum_sum_animals" that contains 
# the cummulative sum of the total animals 
animals["cum_sum_animals"] = animals["total_animals"].cumsum().

animals.head()

### Sorting columns using `.sort_values()`

We can sort our dataframes this way:

```Python
df.sort_values(by=[columns_to_order_with], ascending=True)
```

In [None]:
animals.sort_values(by=["cats", "dogs"], ascending=[True, False])

In [None]:
animals.sort_values(by=["cats", "dogs"], ascending=[False, True])

## Practice

### Exercise 1:
Whats the percentage that represents the dogs in "LATINA" in 2018 compared to the whole city in 2018 

In [None]:
dogs_latina_2018 = animals[
    (animals["district"]=="LATINA")&
    (animals["year"]==2018)
]["dogs"].values[0]

dogs_2018 = animals[
    (animals["year"]==2018)
]["dogs"].sum()

ratio = round(dogs_latina_2018 * 100 / dogs_2018, 1)

f"{ratio} % of the dogs in Madrid in 2018 are in Latina"

### Exercise 2:
How many districts had an "above_mean" rating in 2016?

In [None]:
animals[
    (animals["year"]==2016)&
    (animals["total_animals_cat"]=="above_mean")
]["district"].nunique()

### Exercise 3:
Has the "Hortaleza" district increased or decreased its dog population in the analyzed period? By how much?

In [None]:
dogs_hortaleza_2019 = animals[
    (animals["district"]=="HORTALEZA")
][["year", "dogs"]].sort_values(by="year", ascending=False)["dogs"].values[0]

dogs_hortaleza_2014 = animals[
    (animals["district"]=="HORTALEZA")
][["year", "dogs"]].sort_values(by="year", ascending=False)["dogs"].values[-1]

# to calculate the evolution we substract the number of dogs in Hortaleza in 2014 from 2019
evolution = dogs_hortaleza_2019 - dogs_hortaleza_2014

# results
result = "increased" if evolution > 0 else "decreased"

print(f"The number of dogs in Hortaleza has {result} by {abs(evolution)} dogs from 2014 to 2019")

## Groupby and aggregations

### `groupby`

Just like in SQL, we can use `groupby` to perform operations to whole groups on our DataFrames.

```Python
df.groupby([columns_to_group]).function_to_apply_to_each_group
```

In [None]:
# read energy data
from datetime import datetime

energy = pd.read_csv("../files/energy.csv")

energy.head()

In [None]:
energy[["power_demand"]].head()

In [None]:
# mean spot_price per month 
energy.groupby(["month"])[["spot_price"]].mean()

In [None]:
# max power_demand per hour
energy.groupby(["hour"])[["power_demand"]].max()

In [None]:
# day of week with lowest average consumption of fossil fuels 

energy["fossil_fuel_consumption"] = energy["gas"] + energy["coal"]

energy.groupby(["weekday"])[["fossil_fuel_consumption"]].mean().idxmin()  # Sunday

### Inside a `groupby` object

`groupby` creates a tuple per `category` in the `column`(s) we're grouping by:
* The first element of the tuple is each one of the `category` in `column`
* The second element is the data associated to that category:
    * ```Python
    df[df[col_groupby]==category]
    ```

In [None]:
# what's inside a groupby object?
groupby_object = energy.groupby("month")

In [None]:
# first element
list(groupby_object)[0][1]["month"].unique()

In [None]:
# category
list(groupby_object)[0][0]

In [None]:
# data associated with category
list(groupby_object)[0][1]

Now we understand the `groupby` object, we can dig a bit deeper into the syntax

If we want to groupby several columns, we can pass a list of columns to `groupby` and perform the operation we need.

If we don't want the columns to become the index of the resulting DF, we can pass `as_index=False` to `groupby`

In [None]:
# groupby with several columns

# mean  power_demand and spot_price per month and weekday
df = energy.groupby(["month", "weekday"])[["power_demand", "spot_price"]].mean()

df.columns = [f"mean_{col}" for col in df.columns]

df

In [None]:
# with `as_index=False` we can keep the index
energy.groupby(["month", "weekday"], as_index=False)[["power_demand", "spot_price"]].mean()

### `groupby` and `agg`

If we want to perform different operations after `groupby` we can mix `groupby` and `agg`.

In [None]:
# groupby on several columns and perform mean and sum on coal and wind

energy.groupby(["month", "weekday"]).agg({
    "coal": ["sum", "mean"],
    "wind": ["sum", "mean"]
})

In [None]:
# We can handle a multiindex like the one resulting from a groupby with several columns 
# and several operations in the following way:

df = energy.groupby(["month", "weekday"]).agg({
    "coal": ["sum", "mean"],
    "wind": ["sum", "mean"]
})

# mean coal generation on Tuesdays in January
df.loc[(1, 1), ("coal", "mean")]

When we have a DataFrame with several indices, we can use `unstack()` and `stack()`:

### `stack` and `unstack`

These methods allow us to "move" labels from rows to columns and viceversa
* `unstack` moves row labels to column labels
* `stack` moves column labels to row labels

By default, the level at which these function operates is on the -1th level.

In [None]:
# create DF with 2 indices
energy.groupby(["month", "weekday"]).agg({
    "coal": ["sum", "mean"],
    "wind": ["sum", "mean"]
})

In [None]:
# move `weekday` from rows to columns: unstack weekday
energy.groupby(["month", "weekday"]).agg({
    "coal": ["sum", "mean"],
    "wind": ["sum", "mean"]
}).unstack(level="weekday")

In [None]:
# move ("coal", "wind") from columns labels to rows: stack 0
energy.groupby(["month", "weekday"]).agg({
    "coal": ["sum", "mean"],
    "wind": ["sum", "mean"]
}).stack()

## Practice

### Exercise 1: `energy` dataset
What's the maximum solar power generation happened in August?

In [None]:
# let's see the maximum solar generation in each month
energy.groupby(["month"]).agg({"solar": ["max"]})

In [None]:
# only getting the value for August (month 8)
energy.groupby(["month"]).agg({"solar": ["max"]}).loc[8]

### Exercise 2: `energy` dataset
What's the average production of each of the following technologies on Hour 5

```Python
tech = ["nuclear", "solar", "hydro"]
```

In [None]:
tech = ["nuclear", "solar", "hydro"]

energy.groupby("hour")[tech].mean().loc[5]

### Exercise 3:
Create a new column called `stop_wind` with value 1 if `spot_price` is below 20, and 0 otherwise.

In [None]:
energy["stop_wind"] = np.where(
    energy["spot_price"] < 20, 
    1,
    0
)

energy.head(1)

### Exercise 4:
Create a new column called weekend with 0 if weekday=0,1,2,3,4 and 1 otherwise

In [None]:
energy["weekend"] = np.where(
    energy["weekday"] > 4,
    1,
    0
)

energy.head(1)