# Pivoting / Reshaping - Intro

A very common problem in data analysis is that data arrive in the wrong **shape**. Data are often provided in the so called "**wide**" format. But for analysis we typically need the "**long**" format.

The Worldbank dataset you have analyzed in previous exercises already is in the **long** format, e.g:

| Country | Year | Population  |
|---------|------|-------------|
| A       | 2001 | 1100000     |
| A       | 2002 | 1200000     |
| A       | 2003 | 1300000     |
| B       | 2002 | 5200000     |
| B       | 2003 | 5300000     |
| B       | 2004 | 5400000     |
| C       | 2001 | 7100000     |

But what if the data is delivered in the **wide** format, e.g.:

| Country                | Population2001 | Population2002 | Population2003 |  Population2002  |
|------------------------|----------------|----------------|----------------|------------------|
| A                      | 1100000        | 1200000        | 1300000        | NA               |
| B                      | NA             | 5200000        | 5300000        | 5400000          |
| C                      | 7100000        | NA             | NA             | NA               |

This chapter shows you how to transform data back and forth from/to the wide/long format.

# Preparations

In [1]:
import pandas as pd

pd.set_option("display.max_columns", 500)

# Load the data (in long format)

In [2]:
df = pd.read_excel("../../data/raw/wdi_reduced.xlsx", sheet_name="wdi")
df.head()

Unnamed: 0,countrycode,countryname,region,year,NY_GDP_MKTP_CD,NY_GDP_MKTP_KD_ZG,SP_POP_TOTL
0,ABW,Aruba,Latin America & Caribbean,2007,2623726000.0,-3.654626,101220.0
1,ABW,Aruba,Latin America & Caribbean,2011,2584464000.0,,102053.0
2,ABW,Aruba,Latin America & Caribbean,1992,,,68235.0
3,ABW,Aruba,Latin America & Caribbean,1989,,,61032.0
4,ABW,Aruba,Latin America & Caribbean,1975,,,60657.0


In [3]:
# limit to fewer columns and rows
df = df.loc[
    (df["region"] == "North America") & (df["year"] > 2010) & (df["year"] < 2014),
    ["countrycode", "year", "NY_GDP_MKTP_CD", "SP_POP_TOTL"],
]
df

Unnamed: 0,countrycode,year,NY_GDP_MKTP_CD,SP_POP_TOTL
1418,BMU,2011,5550771000.0,64564.0
1428,BMU,2012,5537537000.0,64798.0
1432,BMU,2013,5573710000.0,65001.0
1895,CAN,2012,1824289000000.0,34750545.0
1896,CAN,2013,1842628000000.0,35155451.0
1912,CAN,2011,1788648000000.0,34342780.0
11799,USA,2012,16155260000000.0,313998379.0
11803,USA,2013,16691520000000.0,316204908.0
11809,USA,2011,15517930000000.0,311663358.0


# Reshape from long to wide using the `pivot_table()` method

`pivot_table()` requires you to enter into `index` all columns that denote a row in the resulting table. To `columns` you should pass the name of the column that defines the additional columns to be created (here, we would like to have additional columns for each year).

In [4]:
df_wide = df.pivot_table(index=["countrycode"], columns=["year"])
df_wide.head(3)

Unnamed: 0_level_0,NY_GDP_MKTP_CD,NY_GDP_MKTP_CD,NY_GDP_MKTP_CD,SP_POP_TOTL,SP_POP_TOTL,SP_POP_TOTL
year,2011,2012,2013,2011,2012,2013
countrycode,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
BMU,5550771000.0,5537537000.0,5573710000.0,64564.0,64798.0,65001.0
CAN,1788648000000.0,1824289000000.0,1842628000000.0,34342780.0,34750545.0,35155451.0
USA,15517930000000.0,16155260000000.0,16691520000000.0,311663358.0,313998379.0,316204908.0


You can use the following function to flatten the hierachical columns and turn the multi-index back into 'normal' columns:

In [5]:
def simplify_df(df, flatten_columns=True, reset_index=True):
    df = df.copy()

    def join_if_more_than_one(x):
        if len(x) > 1:
            return "_".join(x)
        else:
            return x[0]

    if flatten_columns and (isinstance(df.columns, pd.MultiIndex)):
        df.columns = [
            join_if_more_than_one(list(filter(None, l)))
            for l in [list(str(i) for i in t) for t in df.columns.to_flat_index()]
        ]
    if reset_index:
        df = df.reset_index()
    return df


df_wide_simple = simplify_df(df_wide)
df_wide_simple

Unnamed: 0,countrycode,NY_GDP_MKTP_CD_2011,NY_GDP_MKTP_CD_2012,NY_GDP_MKTP_CD_2013,SP_POP_TOTL_2011,SP_POP_TOTL_2012,SP_POP_TOTL_2013
0,BMU,5550771000.0,5537537000.0,5573710000.0,64564.0,64798.0,65001.0
1,CAN,1788648000000.0,1824289000000.0,1842628000000.0,34342780.0,34750545.0,35155451.0
2,USA,15517930000000.0,16155260000000.0,16691520000000.0,311663358.0,313998379.0,316204908.0


# Reshaping from wide to long

This is actually the most common problem: to turn a wide table into a long one. We can use the previously created `df_wide` as example.

We can use the function `pd.wide_to_long()` or the method `melt()` of the `pd.DataFrame`. Here we use both, first `pd.wide_to_long()`.

In [6]:
df_long = pd.wide_to_long(
    df_wide_simple,
    sep="_",
    stubnames=["NY_GDP_MKTP_CD", "SP_POP_TOTL"],
    i=["countrycode"],
    j="year",
)
df_long

Unnamed: 0_level_0,Unnamed: 1_level_0,NY_GDP_MKTP_CD,SP_POP_TOTL
countrycode,year,Unnamed: 2_level_1,Unnamed: 3_level_1
BMU,2011,5550771000.0,64564.0
CAN,2011,1788648000000.0,34342780.0
USA,2011,15517930000000.0,311663358.0
BMU,2012,5537537000.0,64798.0
CAN,2012,1824289000000.0,34750545.0
USA,2012,16155260000000.0,313998379.0
BMU,2013,5573710000.0,65001.0
CAN,2013,1842628000000.0,35155451.0
USA,2013,16691520000000.0,316204908.0


In [7]:
df_long = simplify_df(df_long)
df_long

Unnamed: 0,countrycode,year,NY_GDP_MKTP_CD,SP_POP_TOTL
0,BMU,2011,5550771000.0,64564.0
1,CAN,2011,1788648000000.0,34342780.0
2,USA,2011,15517930000000.0,311663358.0
3,BMU,2012,5537537000.0,64798.0
4,CAN,2012,1824289000000.0,34750545.0
5,USA,2012,16155260000000.0,313998379.0
6,BMU,2013,5573710000.0,65001.0
7,CAN,2013,1842628000000.0,35155451.0
8,USA,2013,16691520000000.0,316204908.0


Note that `melt()` will turn the table into "as long as possible". That means, it will not only generate a row per year but also per variable that changes over time! We can then reshape the "very long" data into our "middle ground" that we started out with.

The argument `id_vars` determines which columns should be repeated for each new row. `var_name` will be the name of a new column generated, which will denote both the 'year' and the 'variable'.

In [8]:
df_very_long = df_wide_simple.melt(id_vars=["countrycode"], var_name="variable_year")
df_very_long

Unnamed: 0,countrycode,variable_year,value
0,BMU,NY_GDP_MKTP_CD_2011,5550771000.0
1,CAN,NY_GDP_MKTP_CD_2011,1788648000000.0
2,USA,NY_GDP_MKTP_CD_2011,15517930000000.0
3,BMU,NY_GDP_MKTP_CD_2012,5537537000.0
4,CAN,NY_GDP_MKTP_CD_2012,1824289000000.0
5,USA,NY_GDP_MKTP_CD_2012,16155260000000.0
6,BMU,NY_GDP_MKTP_CD_2013,5573710000.0
7,CAN,NY_GDP_MKTP_CD_2013,1842628000000.0
8,USA,NY_GDP_MKTP_CD_2013,16691520000000.0
9,BMU,SP_POP_TOTL_2011,64564.0


In [9]:
df_very_long[["variable", "year"]] = df_very_long["variable_year"].str.rsplit(
    r"_", n=1, expand=True
)
df_very_long = df_very_long.drop(columns="variable_year")
df_very_long

Unnamed: 0,countrycode,value,variable,year
0,BMU,5550771000.0,NY_GDP_MKTP_CD,2011
1,CAN,1788648000000.0,NY_GDP_MKTP_CD,2011
2,USA,15517930000000.0,NY_GDP_MKTP_CD,2011
3,BMU,5537537000.0,NY_GDP_MKTP_CD,2012
4,CAN,1824289000000.0,NY_GDP_MKTP_CD,2012
5,USA,16155260000000.0,NY_GDP_MKTP_CD,2012
6,BMU,5573710000.0,NY_GDP_MKTP_CD,2013
7,CAN,1842628000000.0,NY_GDP_MKTP_CD,2013
8,USA,16691520000000.0,NY_GDP_MKTP_CD,2013
9,BMU,64564.0,SP_POP_TOTL,2011


# Reshape the "very long" data into something "not too wide"

In [10]:
df_not_too_wide = df_very_long.pivot_table(
    index=["countrycode", "year"], columns=["variable"]
).reset_index()
df_not_too_wide.head()

Unnamed: 0_level_0,countrycode,year,value,value
variable,Unnamed: 1_level_1,Unnamed: 2_level_1,NY_GDP_MKTP_CD,SP_POP_TOTL
0,BMU,2011,5550771000.0,64564.0
1,BMU,2012,5537537000.0,64798.0
2,BMU,2013,5573710000.0,65001.0
3,CAN,2011,1788648000000.0,34342780.0
4,CAN,2012,1824289000000.0,34750545.0


In [11]:
df_not_too_wide = simplify_df(df_not_too_wide)
df_not_too_wide

Unnamed: 0,index,countrycode,year,value_NY_GDP_MKTP_CD,value_SP_POP_TOTL
0,0,BMU,2011,5550771000.0,64564.0
1,1,BMU,2012,5537537000.0,64798.0
2,2,BMU,2013,5573710000.0,65001.0
3,3,CAN,2011,1788648000000.0,34342780.0
4,4,CAN,2012,1824289000000.0,34750545.0
5,5,CAN,2013,1842628000000.0,35155451.0
6,6,USA,2011,15517930000000.0,311663358.0
7,7,USA,2012,16155260000000.0,313998379.0
8,8,USA,2013,16691520000000.0,316204908.0


# Exercise

1. Load the first sheet of the Excel file "wdi_reduced.xlsx" into a pandas DataFrame (see [here](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html) for help with `pandas.read_excel()`)
2. Select only the columns *countrycode*, *year*, and *NY_GDP_MKTP_CD*.
3. Select only the years after 2010.
4. Reshape to wide format. (One row per *countrycode*)
5. Generate a new column "large_value_in_2011" indicating whether a country had a *NY_GDP_MKTP_CD* above the 2011-median.
6. Reshape to long format. (One row per *countrycode* and *year*).