### The Culture of International Relations

#### About this project
Cultural treaties are the bi-lateral and multilateral agreements among states that promote and regulate cooperation and exchange in the fields of life generally call cultural or intellectual. Although it was only invented in the early twentieth century, this treaty type came to be the fourth most common bilateral treaty in the period 1900-1980 (Poast et al., 2010). In this project, we seek to use several (mostly European) states’ cultural treaties as a historical source with which to explore the emergence of a global concept of culture in the twentieth century. Specifically, the project will investigate the hypothesis that the culture concept, in contrast to earlier ideas of civilization, played a key role in the consolidation of the post-World War II international order.

The central questions that interest me here can be divided into two groups: 
- First, what is the story of the cultural treaty, as a specific tool of international relations, in the twentieth century? What was the historical curve of cultural treaty-making? For example, in which political or ideological constellations do we find (the most) use of cultural treaties? Among which countries, in which historical periods? What networks of relations were thereby created, reinforced, or challenged? 
- Second, what is the "culture" addressed in these treaties? That is, what do the two signatories seem to mean by "culture" in these documents, and what does that tell us about the role that concept played in the international system? How can quantitative work on this dataset advance research questions about the history of concepts?

In this notebook, we deal with these treaties in three ways:
1) quantitative analysis of "metadata" about all bilateral cultural treaties signed betweeen 1919 and 1972, as found in the World Treaty Index or WTI (Poast et al., 2010).
    For more on how exactly we define a "cultural treaty" here, and on other principles of selection, see... [add this, using text now in "WTI quality assurance"].
2) network analysis of the system of international relationships created by these treaties (using data from WTI, as above).
3) Text analysis of the complete texts of selected treaties. 

After some set-up sections, the discussion of the material begins at "Part 1," below.

### Brief Instructions on Jupyter Notebooks
Please see [this tutorial](https://www.youtube.com/watch?v=h9S4kN4l5Is) for an introduction on what Jupyter notebooks are and how to use them. There are lots of other Jupyter tutorials on YouTube (and elsewhere) as well. In short, a notebook is a document with embedded executable code presented in a simple and easy to use web interface. Most important things to note are:
- Click on the menu Help -> User Interface Tour for an overview of the Jupyter Notebook App user interface.
- The **code cells** contains the script code (Python in this case, but can be other languages are also suported) and are the sections marked by **In [x]** in the left margin. It is marked as **In []** if it hasn't been executed, and as **In [n]** when it has been executed(n is an integer). A cell marked as **In [\*]** is either executing, or waiting to be executed (i.e. other cells are executing).
- The **current cell** is highlighted with a blue (or green if in "edit" mode) border. You make a cell current by clicking on it,
- Code cells aren't executed automatically. Instead you execute the current cell by either pressing **shift+enter** or the **play** button in the toolbar. The output (or result) of a cell's execution is presented directly below the cell prefixed by **Out[n]**.
- The next cell will automatically be selected (made current) after a cell has been executed. Repeatadly pressing **shift+enter** or the play button hence executes the cells in sequence.
- You can run the entire notebook in a single step by clicking on the menu Cell -> Run All. Note that this can take some time to finish. You can see how cells are executed in sequence via the indicator in the margin (i.e. "In [\*]" changes to "In [n]" where n is an integer).
- The cells can be edited if they are double-clicked, in which case the cell border turns green. Use the ESC key to escape edit mode (or click on any other cell).

To restart the kernel (i.e. the computational engine assigned to your session), click on the menu Kernel -> Restart. 


In [None]:
%%html
<style>
.jupyter-widgets, .widget-label, .widget-dropdown > select { font-size: 8pt; }
</style>

### <span style='color:blue'>**Mandatory Prepare Step**</span>: Setup Notebook and Load and Process Treaty Master Index
The following code cell to be executed once for each user session. The step loads utility Python code stored in separate files, and imports dependencies to external libraries. The code also loads the WTI master index (and some related data files), and prepares the data for subsequent use.

The treaty data is processed as follows:
- All the treaty data are loaded.Extract year treaty was signed as seperate fields
- Add new fields for specified signed period divisions
- Fields 'group1' and 'group2' are ignored (many missing values). Instead group are fetched via party code from encoding found in the "groups" table.

In [1]:
# Setup
import os
import re
import logging
import datetime
import wordcloud
import warnings
import pandas as pd
import numpy as np
import bokeh.plotting as bp
import bokeh.palettes
import bokeh.models as bm
import bokeh.io
import ipywidgets as widgets
import matplotlib.pyplot as plt
import matplotlib

from math import sqrt
from bokeh.io import push_notebook

from IPython.display import display, HTML
from IPython.core.interactiveshell import InteractiveShell

os.sys.path = os.sys.path if '..' in os.sys.path else os.sys.path + ['..']

from common.file_utility import FileUtility
from common.widgets_utility import BaseWidgetUtility

import configuration_elements as config

logging.basicConfig(format="%(asctime)s : %(levelname)s : %(message)s", level=logging.ERROR)
logger = logging.getLogger()
logger.setLevel(logging.ERROR)

warnings.filterwarnings('ignore')
bp.output_notebook()

%config IPCompleter.greedy=True

from common.treaty_state import load_treaty_state

# Load and process treaties master index
state = load_treaty_state('../data')


### Task: Headnote word toplist and word-pair co-occurence toplist
This report displays headnote toplists either single word occurrance or word-word co-occurrance toplists depending on whether or not the "Co-occurrance" is checked. The result is grouped by selected division's periods or by year.

The word co-occurrance is defined as the number of times a pair of words co-occur in the same headnote. The length of headnotes is ignored in the computation (all pairs have equal weight). Multiple occurance of a word in a headnote is taken into account i.e "cultural exchange cultural" is counted as two co-occurances, and "cultural exchange exchange cultural" is four co-occurrances. Stopwords are removed if "Remove stopwords" are checked.

Stopwords are always removed from the co-occurrance computation, whilst they are removed from single word occurrance toplist if the "Remove stopwords" flag is checked. The removal is based on NLTK's list of english stopwords (run ```nltk.corpus.stopwords.words('english')``` to display all stopwords).

The toplist can be filtered so that only treaties involving any or one of the five parties of interest are included, and words can be excluded based on character length. Each resulting group can also be restricted by both a maximum number of pairs to display per group, as well as a min co-occurrance count.

In [11]:
# Code
import os

os.sys.path = os.sys.path if '..' in os.sys.path else os.sys.path + ['..']

import nltk
import qgrid
import ipywidgets as widgets
import pandas as pd
import logging

from nltk.stem import WordNetLemmatizer
from common.widgets_utility import BaseWidgetUtility
from IPython.display import display, HTML

import configuration_elements as config
from common.treaty_state import load_treaty_state

logging.basicConfig(format="%(asctime)s : %(levelname)s : %(message)s", level=logging.ERROR)
logger = logging.getLogger()
logger.setLevel(logging.ERROR)

# Load and process treaties master index
state = load_treaty_state('../data')
class HeadnoteTokenServiceOLD():

    def __init__(self, tokenizer, stopwords=None, lemmatizer=None, min_word_size=2):
        
        self.transforms = [
            tokenizer,
            lambda ws: ( x for x in ws if len(x) >= min_word_size ),
            lambda ws: ( x for x in ws if any(ch.isalpha() for ch in x)) 
        ]
        
        if stopwords is not None:
            self.transforms += [ lambda ws: ( x for x in ws if x not in stopwords ) ]
            
        if lemmatizer is not None:
            self.transforms += [ lambda ws: ( lemmatizer(x) for x in ws ) ]

    def _apply_transforms(self, ws):
        for f in self.transforms:
            ws = f(ws)
        return list(ws)
    
    def parse_headnotes(self, treaties):
        
        headnotes = treaties['headnote']
        
        texts = [ x.lower() for x in list(headnotes) ]
        #tokens = list(map(self._apply_transforms, texts))
        df = pd.DataFrame({'headnote': headnotes, 'tokens': tokens })
        
        return df
    
    def compute_stacked(self, treaties):
        
        df = self.parse_headnotes(treaties)
        
        df_stacked = pd.DataFrame(df.tokens.tolist(), index=df.index).stack()\
            .reset_index().rename(columns={'level_1': 'sequence_id', 0: 'token'})
            
        return df_stacked
    
    def compute_co_occurrence(self, treaties, pos_tags, only_cultural_treaties=False):

        # Filter out tags based on treaties of interest
        pos_tags = pos_tags.merge(treaties, how='inner', left_on='treaty_id', right_index=True)[[]]
        
        if only_cultural_treaties:
            df_pos_tags = df_pos_tags[(df_pos_tags.is_cultural.str.contains('yes',na=False))]

        # Self join of words within same treaty
        df_co_occurrence = pd.merge(df_pos_tags, df_pos_tags, how='inner', left_on='treaty_id', right_on='treaty_id')
        # Only consider a specific poir once
        df_co_occurrence = df_co_occurrence[(df_co_occurrence.wid_x < df_co_occurrence.wid_y)]
        # Reduce number of returned columns
        df_co_occurrence = df_co_occurrence[['treaty_id', 'year_x', 'is_cultural_x', 'lemma_x', 'lemma_y' ]]
        # Rename columns
        df_co_occurrence.columns = ['treaty_id', 'year', 'is_cultural', 'lemma_x', 'lemma_y' ]

        # Sort token pair so smallest always comes first
        lemma_x = df_co_occurrence[['lemma_x', 'lemma_y']].min(axis=1)
        lemma_y = df_co_occurrence[['lemma_x', 'lemma_y']].max(axis=1)
        df_co_occurrence['lemma_x'] = lemma_x
        df_co_occurrence['lemma_y'] = lemma_y

        return df_co_occurrence

class HeadnoteTokenCorpus():

    def __init__(self, treaties, tokenize=None, stopwords=None, lemmatize=None, min_size=2):
        
        tokenize = tokenize or nltk.tokenize.word_tokenize
        lemmatize = lemmatize or WordNetLemmatizer().lemmatize
        stopwords = stopwords or nltk.corpus.stopwords.words('english')
        
        self.transforms = [
            tokenize,
            lambda ws: ( x for x in ws if len(x) >= min_size ),
            lambda ws: ( x for x in ws if any(ch.isalpha() for ch in x)),
            lambda ws: list(set(ws)) 
        ]
        
        #if stopwords is not None:
        #    self.transforms += [ lambda ws: ( x for x in ws if x not in stopwords ) ]
            
        #if lemmatizer is not None:
        #    self.transforms += [ lambda ws: ( lemmatizer(x) for x in ws ) ]
        
        treaty_tokens = self._compute_stacked(treaties)
        vocabulary = treaty_tokens.token.unique()
        lemmas = list(map(lemmatize, vocabulary))
        lemma_map = { w: l for (w, l) in zip(*(vocabulary, lemmas)) if w != l }
        stopwords_map = { s : True for s in stopwords }
        treaty_tokens['lemma'] = treaty_tokens.token.apply(lambda x: lemma_map.get(x, x))
        treaty_tokens['is_stopword'] = treaty_tokens.token.apply(lambda x: stopwords_map.get(x, False))

        self.treaty_tokens = treaty_tokens.set_index(['treaty_id', 'sequence_id'])
        
    def _apply_transforms(self, ws):
        for f in self.transforms:
            ws = f(ws)
        return list(ws)
    
    def _parse_headnotes(self, treaties):
        
        headnotes = treaties['headnote']
        
        texts = [ x.lower() for x in list(headnotes) ]
        tokens = list(map(self._apply_transforms, texts))
        df = pd.DataFrame({'headnote': headnotes, 'tokens': tokens })
        
        return df
    
    def _compute_stacked(self, treaties):
        
        df = self._parse_headnotes(treaties)
        
        df_stacked = pd.DataFrame(df.tokens.tolist(), index=df.index).stack()\
            .reset_index().rename(columns={'level_1': 'sequence_id', 0: 'token'})
            
        return df_stacked
    
def compute_co_occurrance(treaties):
    
    treaty_tokens = state.treaty_headnote_corpus.treaty_tokens
    i1 = treaties.index
    # i2 = treaty_tokens.reset_index().set_index('treaty_id').index
    i2 = treaty_tokens.index.get_level_values(0)
    treaty_tokens = treaty_tokens[i2.isin(i1)]
    
    treaty_tokens = treaty_tokens.loc[treaty_tokens.is_stopword==False]
    treaty_tokens = treaty_tokens.reset_index().drop(['is_stopword', 'sequence_id'], axis=1).set_index('treaty_id')

    co_occurrance = treaty_tokens.merge(treaty_tokens, how='inner', left_index=True, right_index=True)
    co_occurrance = co_occurrance.loc[(co_occurrance['token_x'] < co_occurrance['token_y'])]
    #co_occurrance['token'] = co_occurrance.apply(lambda row: row[groupby_pair[0]] + ' - ' + row[groupby_pair[1]], axis=1)
    co_occurrance['token'] = co_occurrance.apply(lambda row: ' - '.join([row['token_x'].upper(), row['token_y'].upper()]), axis=1)
    co_occurrance['lemma'] = co_occurrance.apply(lambda row: ' - '.join([row['lemma_x'].upper(), row['lemma_y'].upper()]), axis=1)
    co_occurrance = co_occurrance.assign(is_stopword=False, sequence_id=0)[['sequence_id', 'token', 'lemma', 'is_stopword']]
    
    return co_occurrance

def create_bigram_transformer(documents):
    import gensim.models.phrases
    bigram = gensim.models.phrases.Phrases(map(nltk.tokenize.word_tokenize, documents))
    return lambda ws: bigram[ws]

def remove_snake_case(snake_str):
    return ' '.join(x.title() for x in snake_str.split('_'))

def get_top_partiesssss(data, period, party_name, n_top=5):
    xd = data.groupby([period, party_name]).size().rename('TopCount').reset_index()
    top_list = xd.groupby([period]).apply(lambda x: x.nlargest(n_top, 'TopCount'))\
        .reset_index(level=0, drop=True)\
        .set_index([period, party_name])
    return top_list

result=None
def display_headnote_toplist(
    period=None,
    parties=None,
    extra_groupbys=None,
    only_is_cultural=True,
    use_lemma=False,
    compute_co_occurance=False,
    remove_stopwords=True,
    min_word_size=2,
    n_min_count=1,
    output_format='table',
    n_top=50
    # plot_style=tw.plot_style
):
    global ihnw, result
    
    try:
        hnw.progress.value = 1    
        treaties = state.treaties.loc[state.treaties.signed_period != 'other']

        if state.treaty_headnote_corpus is None:
            print('Preparing headnote corpus for first time use')
            state.treaty_headnote_corpus = HeadnoteTokenCorpus(treaties=treaties)

        if only_is_cultural:
            treaties = treaties.loc[(state.treaties.is_cultural)]

        if parties is not None:
            ids = state.stacked_treaties.loc[(state.stacked_treaties.party.isin(parties))].index
            treaties = treaties.loc[ids]

        hnw.progress.value += 1

        if compute_co_occurance:

            treaty_tokens = compute_co_occurrance(treaties)

        else:

            treaty_tokens = state.treaty_headnote_corpus.treaty_tokens

            if remove_stopwords is True:
                treaty_tokens = treaty_tokens.loc[treaty_tokens.is_stopword==False]

            treaty_tokens = treaty_tokens.reset_index().set_index('treaty_id')

        hnw.progress.value += 1

        treaty_tokens = treaty_tokens\
            .merge(treaties, how='inner', left_index=True, right_index=True)\
            .drop(['source', 'signed', 'headnote', 'is_cultural', # 'is_cultural_yesno', 'sequence',
                   'topic1', 'topic2', 'title'], axis=1)

        hnw.progress.value += 1

        token_or_lemma = 'token' if not use_lemma else 'lemma'

        groupbys  = []
        groupbys += [ period ] if not period is None else []
        groupbys += (extra_groupbys or [])
        groupbys += [ token_or_lemma ]

        result = treaty_tokens.groupby(groupbys).size().reset_index().rename(columns={0: 'Count'})

        hnw.progress.value += 1

        ''' Filter out the n_top most frequent words from each group '''
        result = result.groupby(groupbys[-1]).apply(lambda x: x.nlargest(n_top, 'Count'))\
            .reset_index(level=0, drop=True)\
            # .set_index(groupbys)

        if min_word_size > 0:
            result = result.loc[result[token_or_lemma].str.len() >= min_word_size]

        if n_min_count > 1:
            result = result.loc[result.Count >= n_min_count]

        hnw.progress.value += 1

        result = result.sort_values(groupbys[:-1] + ['Count'], ascending=len(groupbys[:-1])*[True] + [False])

        hnw.progress.value += 1

        if output_format in ('table', 'qgrid'):
            result.columns = [ remove_snake_case(x) for x in result.columns ]
            if output_format == 'table':
                display(HTML(result.to_html()))
            else:
                qgrid_widget = qgrid.show_grid(result, show_toolbar=True)
                qgrid_widget
        elif output_format == 'unstack':
            result = result.set_index(groupbys).unstack(level=0).fillna(0).astype('int32')
            result.columns = [ x[1] for x in result.columns ]
            display(HTML(result.to_html()))
        elif output_format == 'unstack_plot':
            result = result.set_index(list(reversed(groupbys))).unstack(level=0).fillna(0).astype('int32')
            result.columns = [ x[1] for x in result.columns ]
            result.plot(kind='bar', figsize=(16,8))

    except Exception as ex:
        logger.error(ex)
        
    hnw.progress.value += 1
    hnw.progress.value = 0

hnw = BaseWidgetUtility(
    period=widgets.Dropdown(
        options={
            '': None,
            'Year': 'signed_year',
            'Default division': 'signed_period',
            'Alt. division': 'signed_period_alt'
        },
        value='signed_period',
        description='Period:',
        icon='',
        layout=widgets.Layout(width='200px', left='0')
    ),
    parties=widgets.Dropdown(
        options=config.default_party_options,
        value=None,
        description='Parties:', icon='', layout=widgets.Layout(width='200px', left='0')
    ),
    use_lemma=widgets.ToggleButton(
        description='Use lemma', value=False,
        tooltip='Use WordNet lemma', icon='', layout=widgets.Layout(width='140px', left='0')
    ),
    remove_stopwords=widgets.ToggleButton(
        description='Remove stopwords', value=True,
        tooltip='Do not include stopwords', icon='', layout=widgets.Layout(width='140px', left='0')
    ),
    extra_groupbys=widgets.Dropdown(
        options={
            '': None,
            'Topic': [ 'Topic' ],
        },
        value=None,
        description='Groupbys:', icon='', layout=widgets.Layout(width='140px', left='0')
    ),
    min_word_size=widgets.BoundedIntText(
        value=2, min=0, max=5, step=1,
        description='Min word:', layout=widgets.Layout(width='140px')
    ),
    only_is_cultural=widgets.ToggleButton(
        description='Only Cultural', value=True,
        tooltip='Display only "is_cultural" treaties', icon='', layout=widgets.Layout(width='140px', left='0')
    ),
    compute_co_occurance=widgets.ToggleButton(
        description='Cooccurrence', value=True,
        tooltip='Compute Cooccurrence', icon='', layout=widgets.Layout(width='140px', left='0')
    ),
    output_format=widgets.Dropdown(
        description='Output', value='table',
        options={
            'Table': 'table',
            'Qgrid': 'qgrid',
            'Unstack': 'unstack',
            'Unstack plot': 'unstack_plot'
        },
        icon='',
        layout=widgets.Layout(width='140px', left='0')
    ),
    plot_style=widgets.Dropdown(
        options=config.matplotlib_plot_styles,
        value='seaborn-pastel',
        description='Style:', icon='', layout=widgets.Layout(width='140px', left='0')
    ),
    n_top=widgets.IntSlider(
        value=25, min=2, max=100, step=10,
        description='Top/grp #:', # continuous_update=False,
    ),
    n_min_count=widgets.IntSlider(
        value=2, min=1, max=10, step=1,
        tooltip='Filter out words with count less than specified value',
        description='Min count:', # continuous_update=False,
    ),
    progress=widgets.IntProgress(
        min=0, max=10, step=1, value=0, layout=widgets.Layout(width='99%')
    )
)

ihnw = widgets.interactive(
    display_headnote_toplist,
    period=hnw.period,
    parties=hnw.parties,
    extra_groupbys=hnw.extra_groupbys,
    only_is_cultural=hnw.only_is_cultural,
    n_min_count=hnw.n_min_count,
    n_top=hnw.n_top,
    min_word_size=hnw.min_word_size,
    use_lemma=hnw.use_lemma,
    compute_co_occurance=hnw.compute_co_occurance,
    remove_stopwords=hnw.remove_stopwords,
    output_format=hnw.output_format,
    # plot_style=tw.plot_style
)

boxes = widgets.HBox(
    [
        widgets.VBox([ hnw.period, hnw.parties, hnw.min_word_size ]),
        widgets.VBox([ hnw.extra_groupbys, hnw.n_top, hnw.n_min_count]),
        widgets.VBox([ hnw.only_is_cultural, hnw.use_lemma, hnw.remove_stopwords, hnw.compute_co_occurance]),
        widgets.VBox([ hnw.output_format, hnw.progress ])
    ]
)
display(widgets.VBox([boxes, ihnw.children[-1]]))
ihnw.update()


VBox(children=(HBox(children=(VBox(children=(Dropdown(description='Period:', index=1, layout=Layout(left='0', …