# 2.2 Wrangling

In this notebook, we practice with operations on tidy data frames.

In [30]:
# imports

import pandas as pd
import os

## Import the datasets

Let us import sample dataset in memory. They contain different metadata about newspaper titles. One of them contains metadata at the level of articles. A second and third contain different metadata at the higher level of titles. This will be a typical case of data about the same objects scattered across different dfs. 

In [52]:
root_folder = "../data/lwmnewspapers/"
df_articles = pd.read_csv(os.path.join(root_folder,"LwM-HMD-articles.csv"))
df_MPD_links = pd.read_csv(os.path.join(root_folder,"MPD_links.csv"))
df_wikiid_latlong = pd.read_csv(os.path.join(root_folder,"wikiid_lat_long.csv"))

Let's create a few smaller datasets to play with transformations, via selection.

In [53]:
df_articles_93 = df_articles[df_articles["year"] == 1893]
df_articles_87 = df_articles[df_articles["year"] == 1887]
df_articles_87.head()

Unnamed: 0,NLP,issue,art_num,title,collection,full_date,year,month,day,location,word_count,ocrquality,decade
1,3040,326,art0056,The Birkenhead News and Wirral General Adverti...,British Library Living with Machines Project,1887-03-26,1887,3,26,"Birkenhead, Merseyside, England",509,0.9844,1880
18,3035,319,art0060,The Herald of Wales.,British Library Living with Machines Project,1887-03-19,1887,3,19,"Swansea, West Glamorgan, Wales",444,0.875,1880
42,3051,1103,art0009,The Warwickshire Herald.,British Library Living with Machines Project,1887-11-03,1887,11,3,"Birmingham, West Midlands, England",19,0.8353,1880
90,2982,524,art0013,The Telegram.,British Library Living with Machines Project,1887-05-24,1887,5,24,"Weymouth, Dorset, England",240,0.6163,1880


Note that the index of the original df is maintained, unless you specify to reset the index at the time of subsetting or merging dataframes (not specifying `drop=True` would keep the original index as a new column `index`):

In [54]:
df_articles_93 = df_articles[df_articles["year"] == 1893].reset_index(drop=True)
df_articles_87 = df_articles[df_articles["year"] == 1887].reset_index(drop=True)
df_articles_87.head()

Unnamed: 0,NLP,issue,art_num,title,collection,full_date,year,month,day,location,word_count,ocrquality,decade
0,3040,326,art0056,The Birkenhead News and Wirral General Adverti...,British Library Living with Machines Project,1887-03-26,1887,3,26,"Birkenhead, Merseyside, England",509,0.9844,1880
1,3035,319,art0060,The Herald of Wales.,British Library Living with Machines Project,1887-03-19,1887,3,19,"Swansea, West Glamorgan, Wales",444,0.875,1880
2,3051,1103,art0009,The Warwickshire Herald.,British Library Living with Machines Project,1887-11-03,1887,11,3,"Birmingham, West Midlands, England",19,0.8353,1880
3,2982,524,art0013,The Telegram.,British Library Living with Machines Project,1887-05-24,1887,5,24,"Weymouth, Dorset, England",240,0.6163,1880


## Joins

![figure with joins](https://www.datasciencemadesimple.com/wp-content/uploads/2017/09/join-or-merge-in-python-pandas-1.png?ezimgfmt=ng:webp/ngcb1)

- **Inner Join** or Natural join: To keep only rows that match from the data frames, specify the argument `how=‘inner’`.
- **Outer Join** or Full outer join:To keep all rows from both data frames, specify `how=‘outer’`.
- **Left Join** or Left outer join:To include all the rows of your data frame x and only those from y that match, specify `how=‘left’`.
- **Right Join** or Right outer join:To include all the rows of your data frame y and only those from x that match, specify `how=‘right’`.

The two dfs have two columns in common: `NLP` and `year`. To add any column in `metadf` which is not in `articlesdf` to `articlesdf` based on a match of both `NLP` and `year`, you simply do a left-merge:

In [None]:
df_articles.merge(df_MPD_links,how='left')

It automatically uses all common columns to merge the left df with the right. If we specify one column only, then the result is different:

In [None]:
df_articles.merge(df_MPD_links,how='left',on='NLP')

This is because in `metadf` each `NLP`-`year` combination has a unique set of variables. If we remove `year` off the equation when merging, then pandas will consider `year` from each df as different variables and it will add all combinations of NLP-year from both datasets. Note that in that case, the `year` col will be automatically renamed as `year_x` (left df) and `year_y` (right df).

Inner merge will get you the intersection between the two dfs based on the common columns:

In [None]:
pd.merge(right=df_articles, left=df_MPD_links, how="inner")
# also equivalent:
# df_articles.merge(df_MPD_links,how="inner")

**⏰ ✏️ Excercise**:

* Using the most suitable type of merging, create a df starting from `df_articles` and adding any column not present there from `df_MPD_links` and `df_wikiid_latlong`. Make sure that the final dataframe has the same number of rows as `df_articles`!
* Which is the most represented `location` in the dataframe?

In [38]:
### Write your solution here

## Pivoting

This is bonus content!

For more (including stacking with multi-indexes and unpivoting or melting), see https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html

In [82]:
data = {'place': ["London","London","Berlin","Berlin","Rome","Rome"],
       'year': [1800, 1900, 1800, 1900, 1800, 1900],
       'values': [10,20,30,40,50,60]}
toy_df = pd.DataFrame(data)

In [86]:
toy_df

Unnamed: 0,place,year,values,values2
0,London,1800,10,45
1,London,1900,20,43
2,Berlin,1800,30,2
3,Berlin,1900,40,0
4,Rome,1800,50,33
5,Rome,1900,60,4


In [94]:
pivoted = toy_df.pivot(index='year', columns='place', values='values')

In [95]:
pivoted

place,Berlin,London,Rome
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1800,30,10,50
1900,40,20,60
