# Toponym extraction
## Explore results
### Toponyms in news on Brexit in Dutch newspapers

Explore the results from this [case study](https://lcvriend.github.io/toponym_extraction/) in this notebook.  
Load the datasets and see the examples below.  
Run cells with `ctrl-enter`.

In [1]:
%cd ..
import pandas as pd
import altair as alt
from src.doc_analysis import most_common, load_counts
from src.config import LEXISNEXIS

/media/vanboefer/DATA/projects/lc/lexisnexis_place_extraction


### Load datasets
Load all datasets as `DataFrames` here.
* `lexis` refers to the meta data from the LexisNexis dataset.
* `toponym` refers to the dataset with the results from the toponym recognition.
* `lemmata` refers to the dataset with the results from the lemma recognition.

In [2]:
lexis = pd.read_csv("data/lexisnexis_dataset.csv").astype({'publication_date': 'datetime64[ns]'})
toponym = pd.read_csv("results/toponym_results.gz", index_col=[0,1], header=[0,1])
lemmata = pd.read_csv("results/lemmata_results.gz", index_col=[0,1], header=[0,1])

### General

In [3]:
# number of articles
lexis.pivot_table(
    index='source',
    aggfunc='count',
    values='id').rename(columns={'id': 'articles'})

Unnamed: 0_level_0,articles
source,Unnamed: 1_level_1
Leeuwarder Courant,276
Telegraaf,488
Trouw,485
Volkskrant,581


In [4]:
# average number of tokens per article
data = lexis.pivot_table(
    index='source',
    aggfunc='mean',
    values='n_tokens')
alt.Chart(data.reset_index()).mark_bar().encode(
    x=alt.X('n_tokens', title='average number of tokens per article'),
    y='source',
    color=alt.Color('source', legend=None)
)

### Most common lemmata and toponyms
Select by:
* **Indicator**: 'frequency' or 'articles'
* **Category**: 'countries', 'places', 'places_uk', 'places_nl', 'places_fr'

In [5]:
# ten most frequent lemmata
most_common(lemmata.xs('frequency', axis=1, level=1), 'lemma')

Unnamed: 0_level_0,volkskrant,volkskrant,trouw,trouw,telegraaf,telegraaf,leeuwarder_courant,leeuwarder_courant
Unnamed: 0_level_1,label,count,label,count,label,count,label,count
ranking,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
0,gaan,1494,eu,1361,brexit,928,brexit,531
1,eu,1355,gaan,1165,gaan,859,gaan,486
2,brexit,1295,brexit,1140,eu,823,jaar,464
3,jaar,1261,jaar,1133,jaar,755,komen,434
4,komen,1237,komen,1049,komen,695,eu,434
5,groot,1229,groot,1039,brits,671,groot,424
6,europees,1160,land,1025,groot,573,may,394
7,maken,953,europees,852,europees,488,brits,365
8,zeggen,926,brits,822,goed,472,europees,340
9,land,860,maken,704,land,466,partij,314


In [6]:
# ten country toponyms occurring in the most articles
most_common(toponym.xs('articles', axis=1, level=1), 'countries')

Unnamed: 0_level_0,volkskrant,volkskrant,trouw,trouw,telegraaf,telegraaf,leeuwarder_courant,leeuwarder_courant
Unnamed: 0_level_1,label,count,label,count,label,count,label,count
ranking,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
0,Verenigd Koninkrijk,446,Verenigd Koninkrijk,455,Verenigd Koninkrijk,283,Verenigd Koninkrijk,234
1,Nederland,218,Nederland,185,Nederland,161,Nederland,96
2,Verenigde Staten,159,Verenigde Staten,183,Verenigde Staten,91,Verenigde Staten,57
3,Duitsland,121,Duitsland,88,Duitsland,54,Duitsland,31
4,Frankrijk,114,Frankrijk,88,Frankrijk,53,Frankrijk,29
5,Polen,51,Ierland,51,China,31,Ierland,21
6,Rusland,46,Rusland,46,Ierland,30,China,12
7,Italië,44,China,45,België,23,Spanje,12
8,China,42,Polen,39,Italië,20,Japan,10
9,Ierland,41,Italië,38,Polen,19,Italië,10


In [7]:
# total frequency of uk toponyms
data = (
    most_common(
        toponym.xs('frequency', axis=1, level=1), 'places_uk')
    .stack(level=0)
    .rename_axis(['ranking', 'source'])
    .reset_index()
    .rename(columns={'count': 'frequency', 'label': 'toponym'})
)
alt.Chart(data).mark_line().encode(
    x=alt.X('toponym', sort='y'),
    y=alt.Y(
        'sum(frequency)',
        title='total frequency', 
        scale=alt.Scale(type='log')
    )
)