### The Culture of International Relations

### DH Nordic 2018 task list

<pre>
<b>DONE GRAPH 1: A graph with the top five countries in terms of how many new cultural treaties they signed, per period.</b>
<b>DONE GRAPH 2: France’s (cultural=yes) treaty totals by period.</b>
NEW GRAPH 3: All (cult=yes) treaties, 1919-1944) in cytoscape.  
NEW GRAPH 4: All (cult=yes) treaties, 1945-1955) in cytoscape.  
<b>DONE GRAPH 5: 7CULT, 7SCI, and 7EDUC over time (by year)</b>
<b>DONE GRAPH 6: 7CULT, 7SCI, and 7EDUC+4EDUC (integrated into one variable) over time (by year)</b>
<b>DONE GRAPH 7: 7CULT, 7SCI, and 7EDUC+4EDUC over time (by period)</b>
DRAFT GRAPH 8: Most frequent (meaningful) co-occurances in treaty headings.
DRAFT GRAPH 9: plot selected co-occurences over time?
</pre>

### Brief Instructions on Jupyter Notebooks
Please see [add link] for an introduction on what Jupyter notebooksare and how to use them. In short, a notebook is a document with embedded executable code presented in a simple and easy to use web interface. Most important things to note are:
- Click on the menu Help -> User Interface Tour for an overview of the Jupyter Notebook App user interface.
- The **code cells** contains the script code (Python in this case, but can be other languages are also suported) and are the sections marked by **In [x]** in the left margin. It is marked as **In []** if it hasn't been executed, and as **In [n]** when it has been executed(n is an integer). A cell marked as **In [\*]** is either executing, or waiting to be executed (i.e. other cells are executing).
- The **current cell** is highlighted with a blue (or green if in "edit" mode) border. You make a cell current by clicking on it,
- Code cells aren't executed automatically. Instead you execute the current cell by either pressing **shift+enter** or the **play** button in the toolbar. The output (or result) of a cell's execution is presented directly below the cell prefixed by **Out[n]**.
- The next cell will automatically be selected (made current) after a cell has been executed. Repeatadly pressing **shift+enter** or the play button hence executes the cells in sequence.
- You can run the entire notebook in a single step by clicking on the menu Cell -> Run All. Note that this can take some time to finish. You can see how cells are executed in sequence via the indicator in the margin (i.e. "In [\*]" changes to "In [n]" where n is an integer).
- The cells can be edited if they are double-clicked, in which case the cell border turns green. Use the ESC key to escape edit mode (or click on any other cell).

To restart the kernel (i.e. the computational engine assigned to your session), click on the menu Kernel -> Restart. 


In [None]:
# YouTube video
from IPython.lib.display import YouTubeVideo
YouTubeVideo("h9S4kN4l5Is", width=400, height=300)

### How to update data from Google Drive
The statistics computed on this page is dependent on a recent verison of the WTI treaties master list. This file is stored on Google Drive, and the script "./google_drive.py" can be used to download and update the data. Please note that the load script below reads CSV-files, with specific names, so a manual download of the master list must be followed by saving each sheet as an CSV. The script ./google_drive.py does this automatically.


In [2]:
# Code: Update WTI master data from Google Drive
%run ./google_drive
%run ./widgets_utility

import logging

logging.getLogger().setLevel(logging.INFO)
logger.setLevel(logging.INFO)

files_to_download = {
    'WTI Master Index': {
        'file_id': '1V8KPeghLQ2iOMWkbPqff480zDSLa5YDX',
        'destination': './data/Treaties_Master_List.xlsx',
        'sheets': [ 'Treaties' ]
    },
    'Curated Parties': {
        'file_id': '1k4dOPuqR7oi4K8SazoGN6R40jOBWOdWp',
        'destination': './data/parties_curated.xlsx',
        'sheets': ['parties', 'group', 'continent']
    },
    'Country & Continent': {
        'file_id': '19lEmVPu7hNmr1MaMpU0VvKL7muu-OKg9',
        'destination': './data/country_continent.csv',
        'sheets': [ ]
    }
}

def update_file(file, confirm):
    global upw
    if file is None:
        return
    if confirm is False:
        print('Please confirm update by checking the CONFIRM button!')
        return
    upw.confirm.value = False
    print('Updatating Google file with ID: {}'.format(file['file_id']))
    process_file(file, overwrite=confirm)
    
upw = BaseWidgetUtility(
    file=widgets.Dropdown(
        options=files_to_download,
        value=None,
        description='File:',
    ),
    confirm=widgets.ToggleButton(
        description='Confirm',
        button_style='',
        icon='check',
        value=False
    ),)
iupw = widgets.interactive(update_file, file=upw.file, confirm=upw.confirm)
display(widgets.VBox([widgets.HBox([upw.file, upw.confirm]), iupw.children[-1]]))
# iupw.update()

In [None]:
%%html
<style>
.jupyter-widgets {
    font-size: 9pt;
}
.widget-label {
    font-size: 8pt;
}
.widget-dropdown > select {
    font-size: 8pt;
}
</style>

### <span style='color:blue'>**Mandatory Prepare Step**</span>: Setup Notebook
The following code cell must to be executed once for each user session.

The step loads utility Python code stored in separate files, and imports dependencies to external libraries. The following external libraries are used:
- [NLTK](https://www.nltk.org/): *Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.*
- [gensim](https://radimrehurek.com/gensim/index.html): [Google scholar](https://scholar.google.com/citations?view_op=view_citation&hl=en&user=9vG_kV0AAAAJ&citation_for_view=9vG_kV0AAAAJ:NaGl4SEjCO4C)
- pandas
- networkx
- bokeh

In [11]:
# Setup
%run ./file_utility
%run ./network_utility
%run ./widgets_utility

import os
import re
import glob
import logging
import fnmatch
import datetime
import wordcloud
import warnings
import pandas as pd
import numpy as np
import networkx as nx
import bokeh.plotting as bp
import bokeh.palettes
import bokeh.models as bm
import ipywidgets as widgets
import matplotlib.pyplot as plt
import matplotlib
import zipfile
import nltk.tokenize
import nltk.corpus
import gensim.models

from pivottablejs import pivot_ui
from math import sqrt
from bokeh.io import push_notebook
from gensim.corpora.textcorpus import TextCorpus

from IPython.display import display, HTML #, clear_output, IFrame
from IPython.core.interactiveshell import InteractiveShell

logging.basicConfig(format="%(asctime)s : %(levelname)s : %(message)s", level=logging.ERROR)
logger = logging.getLogger()
logger.setLevel(logging.ERROR)

TOOLS = "pan,wheel_zoom,box_zoom,reset,hover,previewsave"

InteractiveShell.ast_node_interactivity = "all"
warnings.filterwarnings('ignore')
bp.output_notebook()

%run utility
#%autosave 120
%config IPCompleter.greedy=True

pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

matplotlib_plot_styles =[
    'ggplot',
    'bmh',
    'seaborn-notebook',
    'seaborn-whitegrid',
    '_classic_test',
    'seaborn',
    'fivethirtyeight',
    'seaborn-white',
    'seaborn-dark',
    'seaborn-talk',
    'seaborn-colorblind',
    'seaborn-ticks',
    'seaborn-poster',
    'seaborn-pastel',
    'fast',
    'seaborn-darkgrid',
    'seaborn-bright',
    'Solarize_Light2',
    'seaborn-dark-palette',
    'grayscale',
    'seaborn-muted',
    'dark_background',
    'seaborn-deep',
    'seaborn-paper',
    'classic'
]

output_formats = {
    'Plot vertical bar': 'plot_bar',
    'Plot horisontal bar': 'plot_barh',
    'Plot vertical bar, stacked': 'plot_bar_stacked',
    'Plot horisontal bar, stacked': 'plot_barh_stacked',
    'Plot line': 'plot_line',
    'Plot stacked line': 'plot_line_stacked',
    # 'Chart ': 'chart',
    'Table': 'table',
    'Pivot': 'pivot'
}
toggle_style = dict(icon='', layout=widgets.Layout(width='100px', left='0'))
drop_style = dict(layout=widgets.Layout(width='260px'))

### <span style='color:blue'>**Mandatory Prepare Step**</span>: Load and Process Treaty Master Index
The following code cell to be executed once for each user session. The code loads the WTI master index (and some related data files), and prepares the data for subsequent use.

The treaty data is processed as follows:
- All the treaty data are loaded.Extract year treaty was signed as seperate fields
- Add new fields for specified signed period divisions
- Fields 'group1' and 'group2' are ignored (many missing values). Instead group are fetched via party code from encoding found in the "groups" table.

In [21]:
# Load and process treaties master index

period_divisions = [
    [ (1919, 1939), (1940, 1944), (1945, 1955), (1956, 1966), (1967, 1972) ],
    [ (1919, 1944), (1945, 1955), (1956, 1966), (1967, 1972) ]
]

parties_of_interest = ['FRANCE', 'GERMU', 'ITALY', 'GERMAN', 'UK', 'GERME', 'GERMW', 'INDIA' ]

class TreatyState:
    
    def __init__(self, data_folder='./data'):
        self.data_folder = data_folder
        self.treaties_skip_columns = [
            'extra_entry', 'dbflag', 'dummy1', 'english', 'french', 'ispartyof4', 'other',
            'regis', 'regisant', 'vol', 'page', 'force', 'group1', 'group2'
        ]
        self.treaties_columns = [
            'sequence',
            'treaty_id',
            'is_cultural_yesno',
            'english',
            'french',
            'other',
            'source',
            'vol',
            'page',
            'signed',
            'force',
            'regis',
            'regisant',
            'party1',
            'group1',
            'party2',
            'group2',
            'laterality',
            'headnote',
            'topic',
            'topic1',
            'topic2',
            'title',
            'extra_entry',
            'dbflag',
            'ispartyof4',
            'dummy1'
        ]
        self.csv_files = [
            ('Treaties_Master_List_Treaties.csv', 'treaties', None),
            ('country_continent.csv', 'country_continent', None),
            ('parties_curated_parties.csv', 'parties', None),
            ('parties_curated_continent.csv', 'continent', None),
            ('parties_curated_group.csv', 'group', None)
        ]
        self.data = self.read_data(data_folder)
        self.treaty_headnote_corpus = None
        
    def read_data(self, data_folder):
        data = {}
        for (filename, key, dtype) in self.csv_files:
            path = os.path.join(self.data_folder, filename)
            data[key] = pd.read_csv(path, sep='\t', low_memory=False)
            print('Imported: {}'.format(filename))
            
        return data
    
    def process_treaties(self):

        def get_period(division, year):
            match = [ p for p in division if p[0] <= year <= p[1]]
            return '{} to {}'.format(match[0][0], match[0][1]) if len(match) > 0 else 'other'
    
        treaties = self.data['treaties']
        treaties.columns = self.treaties_columns
        
        treaties['vol'] = treaties.vol.fillna(0).astype('int', errors='ignore')
        treaties['page'] = treaties.page.fillna(0).astype('int', errors='ignore')
        treaties['signed'] = pd.to_datetime(treaties.signed, errors='coerce')
        treaties['is_cultural_yesno'] = treaties.is_cultural_yesno.astype(str)
        treaties['signed_year'] = treaties.signed.apply(lambda x: x.year)
        treaties['signed_period'] = treaties.signed.apply(lambda x: get_period(period_divisions[0], x.year))
        treaties['signed_period_alt'] = treaties.signed.apply(lambda x: get_period(period_divisions[1], x.year))
        treaties['force'] = pd.to_datetime(treaties.force, errors='coerce')
        treaties['sequence'] = treaties.sequence.astype('int', errors='ignore')
        treaties['group1'] = treaties.group1.fillna(0).astype('int', errors='ignore')
        treaties['group2'] = treaties.group2.fillna(0).astype('int', errors='ignore')
        treaties['is_cultural'] = treaties.is_cultural_yesno.apply(lambda x: x.lower() == 'yes')
        treaties['headnote'] = treaties.headnote.fillna('').astype(str).str.upper()

        # Drop columns not used
        treaties.drop(self.treaties_skip_columns, axis=1, inplace=True)
        treaties = treaties.set_index(['treaty_id'])
        return treaties

    def get_stacked_treaties(self):
        '''
        Returns a bi-directional (duplicated) and processed version of the treaties master list.
        Each treaty has two records where party1 and party2 are reversed:
            Record #1: party=party1, party_other=party2, reversed=False
            Record #2: party=party2, party_other=party1, reversed=True
        Fields are also added for the party's and party_other's country code (2 chars), continent and WTI group.
        The two rows are identical for all other fields.
        '''
        df1 = self.treaties\
                .rename(columns={
                    'party1': 'party',
                    'party2': 'party_other',
                    'group1': 'party_group_no',
                    'group2': 'party_other_group_no'
                })\
                .assign(reversed=False)

        df2 = self.treaties\
                .rename(columns={
                    'party2': 'party',
                    'party1': 'party_other',
                    'group2': 'party_group_no',
                    'group1': 'party_other_group_no'
                })\
                .assign(reversed=True)
        
        treaties = df1.append(df2) #.set_index(['treaty_id'])
        
        # Add fields for party's country, continent and WTI group
        parties = self.parties[['country_code', 'continent_code', 'group_name']]
        
        parties.columns = ['party_country', 'party_continent', 'party_group']
        treaties = treaties.merge(parties, how='left', left_on='party', right_index=True)
        
        # Add fields for party_other's country, continent and WTI group
        parties.columns = ['party_other_country', 'party_other_continent', 'party_other_group']
        treaties = treaties.merge(parties,how='left', left_on='party_other', right_index=True)
        
        # Drop columns
        treaties = treaties.drop(['sequence', 'is_cultural_yesno'], axis=1)
        return treaties
    
    def get_continents(self):
        
        df = self.data['continent'].drop(['Unnamed: 0'], axis=1).set_index('country_code2')
            
        return df
    
    def get_groups(self):
        
        df = self.data['group']\
            .drop(['Unnamed: 0'], axis=1)\
            .rename(columns={'GroupNo': 'group_no','GroupName': 'group_name'})\

        df['group_no'] = df.group_no.astype(np.int32)
        df['group_name'] = df.group_name.astype(str)
        
        df = df.set_index('group_no')

        return df
        
    def get_parties(self):
        
        parties = self.data['parties']\
            .drop(['Unnamed: 0'], axis=1)\
            .dropna(subset=['PartyID'])\
            .rename(columns={
                'PartyID': 'party',
                'PartyName': 'party_name',
                'ShortName': 'short_name',
                'GroupNo': 'group_no',
                'reversename': 'reverse_name'
            })\
            .dropna(subset=['party'])\
            .set_index('party')
            
        parties['group_no'] = parties.group_no.astype(np.int32)
        parties['party_name'] = parties.party_name.apply(lambda x: re.sub(r'\(.*\)', '', x))
        parties['short_name'] = parties.short_name.apply(lambda x: re.sub(r'\(.*\)', '', x))

        parties.loc[(parties.group_no==8), ['country', 'country_code', 'country_code3']] = ''

        parties = pd.merge(parties, self.groups, how='left', left_on='group_no', right_index=True)
        parties = pd.merge(parties, self.continents, how='left', left_on='country_code', right_index=True)
        
        return parties
    
    def get_party_name(self, party, party_name_column):
        try:
            if party in self.parties.index:
                return self.parties.loc[party, party_name_column]
            return party
        except:
            print('Warning: {} not in curated parties list'.format(party))
            return party
        
    def process(self):
        
        self.groups = self.get_groups()
        self.continents = self.get_continents()
        self.parties = self.get_parties()
        
        self.treaties = self.process_treaties()
        self.stacked_treaties = self.get_stacked_treaties()
        
        self.cultural_treaties = self.treaties[self.treaties.is_cultural]
        self.cultural_treaties_of_interest = self.cultural_treaties[(self.cultural_treaties.signed_period != 'other')]
        
        print('Number of treaties loaded: {}'.format(len(self.treaties)))
        print('Number of cultural treaties: {} (total), {} within periods'.format(
            len(self.cultural_treaties),
            len(self.cultural_treaties_of_interest)
        ))
        
        self.tagged_headnotes = None
        return self

    def get_headnotes(self):
        return self.treaties.headnote.fillna('').astype(str)
    
    def get_tagged_headnotes(self, tags=None):
        if self.tagged_headnotes is None:
            filename = os.path.join(self.data_folder, 'tagged_headnotes.csv')
            self.tagged_headnotes = pd.read_csv(filename, sep='\t').drop('Unnamed: 0', axis=1)
        if tags is None:
            return self.tagged_headnotes
        return self.tagged_headnotes.loc[(self.tagged_headnotes.pos.isin(tags))]

try:
    state = TreatyState().process()
except Exception as ex:
    logger.error(ex)
print("Data loaded!")


Imported: Treaties_Master_List_Treaties.csv
Imported: country_continent.csv
Imported: parties_curated_parties.csv
Imported: parties_curated_continent.csv
Imported: parties_curated_group.csv
Number of treaties loaded: 61365
Number of cultural treaties: 2231 (total), 1266 within periods
Data loaded!


### Step Sanity Checks: Per field (pair) value counts

In [None]:
# Code

treaty_fields = [
    '', 'is_cultural_yesno', 'source', 'party1', 'party2', 'laterality',
    'headnote', 'topic', 'topic1', 'topic2', 'title', 'signed_year', 'signed_period', 'signed_period_alt', 'is_cultural'
]    

def display_variable_stats(field1, field2, crosstab):
    
    columns = [ x for x in set([field1, field2 ]) if x != '' ]
    
    if len(columns) > 0:
        df = state.treaties.groupby(columns).size().reset_index()\
            .rename(columns={0: 'Count'})\
            .sort_values(['Count'], ascending=False)
            
        if crosstab is True:
            if len(columns) == 2:
                display(pd.crosstab(df[field1], df[field1]))
            else:
                print('Both fields are needed for crosstab')
        else:
            display(df)
        # df.set_index(columns).plot.bar(figsize=(16,8))

sw = BaseWidgetUtility(
    field1=wf.create_select_widget('Field 1:', treaty_fields, default=''),
    field2=wf.create_select_widget('Field 2:', treaty_fields, default=''),
    crosstab=widgets.ToggleButton(
        description='Crosstab',
        button_style='',
        icon='check'
    ),
)

isw = widgets.interactive(display_variable_stats, field1=sw.field1, field2=sw.field2, crosstab=sw.crosstab)

display(widgets.VBox([widgets.HBox([sw.field1, sw.field2, sw.crosstab]), isw.children[-1]]))

isw.update()


### Chart: Treaty Quantities by Selected Parties 
```
DONE GRAPH 1: A graph with the top five countries in terms of how many new cultural treaties they signed, per period.
DONE GRAPH 2: France’s (cultural=yes) treaty totals by period.
```

In [None]:
# Code
%matplotlib inline
import matplotlib.pyplot as plt
#colors = hsv(np.linspace(0, 1.0, 16))
colors = bokeh.palettes.Category20[20] #plt.get_cmap('jet')(np.linspace(0, 1.0, 16))
def plot_treaties_per_period(data, output_format, plot_style, figsize=(12,6), xlabel='', ylabel=''):

    matplotlib.style.use(plot_style)
    stacked = 'stacked' in output_format
    kind = output_format.split('_')[1]
    ax = data.plot(kind=kind, stacked=stacked, figsize=figsize, color=colors)
    ax.set_ylabel(ylabel)
    ax.set_xlabel(xlabel)

    box = ax.get_position()
    ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])

    # Put a legend to the right of the current axis
    legend = ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
    legend.get_frame().set_linewidth(0.0)

    for tick in ax.get_xticklabels():
        tick.set_rotation(45)


def bokeh_plot_signed_treaties_per_period(df, pivot_column, key='mean', n_topics=3, year=None, n_words=100):
    
    def generate_category_colors(n_items, palette=bokeh.palettes.Category20[20]):
        ''' Repeat palette to get n_items colors '''
        colors = (((n_items // len(palette)) + 1) * palette)[:n_items]
        return colors
    
    categories = list(df.columns[1:])
    colors = generate_category_colors(n_topics)
    source = ColumnDataSource(df)
    
    p = bp.figure(plot_width=900, plot_height=800, title=state.basename, tools=TOOLS, toolbar_location="right")
    
    p.xaxis[0].axis_label = key.title() + ' weight'
    p.yaxis[0].axis_label = pivot_column.title()
    
    #legend = [ value(x) for x in categories ]
    #p.hbar_stack(categories, y=pivot_column, source=source, color=colors, height=0.5, legend=legend)
        
    bottoms, tops = [], []
    for i, category in enumerate(categories):
        tops = tops + [category]
        cr = p.hbar(y=pivot_column,
                    left=expr(bm.expressions.Stack(fields=bottoms)),
                    right=expr(bm.expressions.Stack(fields=tops)),
                    color=colors[i],
                    height=0.5,
                    source=source,
                    legend='Topic ' + str(category))
        topic_id = int(category)
        tooltip = 'ID {}: {}'.format(topic_id, state.get_topics_tokens_as_text(n_words=200, cache=True).iloc[topic_id])
        p.add_tools(bm.HoverTool(tooltips=tooltip, renderers=[cr]))
        bottoms = bottoms + [category]
            
    return p

def get_top_parties(data, period, party_name, n_top=5):
    xd = data.groupby([period, party_name]).size().rename('TopCount').reset_index()
    top_list = xd.groupby([period]).apply(lambda x: x.nlargest(n_top, 'TopCount'))\
        .reset_index(level=0, drop=True)\
        .set_index([period, party_name])
    return top_list

def display_treaties_per_period(
    period,
    party_name,
    parties_selection,
    only_is_cultural=False,
    normalize_values=False,
    output_format='chart',
    plot_style='classic',
    top_n_parties=5
):
    try:
        data = state.stacked_treaties.copy()

        # if only_within_period_of_interest:
        data = data.loc[(data.signed_period!='other')]

        if only_is_cultural:
            data = data.loc[(data.is_cultural==True)]

        if isinstance(parties_selection, list):
            data = data.loc[(data.party.isin(parties_selection))]

        data = data.merge(state.parties, how='left', left_on='party', right_index=True)

        n_top_list = get_top_parties(data, period, party_name, n_top=top_n_parties)

        data = data.groupby([period, party_name])\
                .size()\
                .reset_index()\
                .rename(columns={ period: 'Period', party_name: 'Party', 0: 'Count' })

        if isinstance(parties_selection, str) and parties_selection.startswith('only_top_'):
            join = 'inner' #if parties_selection == 'only_top_parties' else 'left'
            data = data.merge(n_top_list, how=join, left_on=['Period', 'Party'], right_index=True)

        pivot = pd.pivot_table(data, index=['Period'], values=["Count"], columns=['Party'], fill_value=0)
        pivot.columns = [ x[-1] for x in pivot.columns ]

        if normalize_values is True:
            pivot = pivot.div(0.01 * pivot.sum(1), axis=0)

        if output_format.startswith('plot'):

            label = 'Number of treaties' if not normalize_values else 'Share%'

            ylabel = label if 'barh' not in output_format else ''
            xlabel = label if 'barh' in output_format else ''

            height = 10 if 'barh' in output_format and period == 'signed_year' else 6
            plot_treaties_per_period(pivot, output_format, plot_style, figsize=(16, height), xlabel=xlabel, ylabel=ylabel)

        elif output_format == 'table':
            display(data)
            # display(HTML(data.to_html()))
        else:
            display(pivot)
            
    except Exception as ex:
        logger.error(ex)

tw = BaseWidgetUtility(
    period=widgets.Dropdown(
        options={
            'Year': 'signed_year',
            'Default division': 'signed_period',
            'Alt. division': 'signed_period_alt'
        },
        value='signed_period',
        description='Period:',
        layout=widgets.Layout(width='200px')
    ),
    party_name=widgets.Dropdown(
        options={
            'WTI Code': 'party',
            'WTI Name': 'party_name',
            'WTI Short': 'short_name',
            'Country': 'party_country'
        },
        value='party',
        description='Name:',
        layout=widgets.Layout(width='200px')
    ),
    parties_selection=widgets.Dropdown(
        options={
            'Only top parties': 'only_top_parties',
            'Only the five parties': parties_of_interest,
            'France': ['FRANCE'],
            'UK': ['UK'],
            'Germany': [ 'GERMU', 'GERMAN', 'GERME', 'GERMW' ],
            'India': [ 'INDIA' ]
            # 'Only top + others': 'only_top_and_others',
            # 'Only five + others': 'only_five_and_others',
            # 'All (long list)': 'all_parties'
        },
        value='only_top_parties',
        description='Parties:',
        layout=widgets.Layout(width='200px')
    ),

    only_is_cultural=widgets.ToggleButton(
        description='Only Cultural', value=True, **toggle_style
    ),
    normalize_values=widgets.ToggleButton(
        description='Share%', **toggle_style
    ),
    output_format=widgets.Dropdown(
        description='Output', options=output_formats, layout=widgets.Layout(width='200px')
    ),
    plot_style=widgets.Dropdown(
        options=matplotlib_plot_styles, value='seaborn-pastel',
        description='Style:', layout=widgets.Layout(width='200px')
    ),
    top_n_parties=widgets.IntSlider(
        value=3, min=1, max=10, step=1,
        description='Top #:',
        continuous_update=True,
        layout=widgets.Layout(width='220px')
    )
)

itw = widgets.interactive(
    display_treaties_per_period,
    period=tw.period,
    party_name=tw.party_name,
    parties_selection=tw.parties_selection,
    only_is_cultural=tw.only_is_cultural,
    normalize_values=tw.normalize_values,
    output_format=tw.output_format,
    plot_style=tw.plot_style,
    top_n_parties=tw.top_n_parties
)

first_column_box = widgets.VBox([tw.period, tw.party_name,])
second_column_box = widgets.VBox([ tw.parties_selection, tw.top_n_parties ])
third_column_box = widgets.VBox([ tw.only_is_cultural, tw.normalize_values])
fourth_column_box = widgets.VBox([ tw.output_format, tw.plot_style ])
boxes = widgets.HBox([first_column_box, second_column_box, third_column_box, fourth_column_box ])
display(widgets.VBox([boxes, itw.children[-1]]))
itw.update()



###  Chart: Treaty Quantities by Selected Topics
TODO Uses currently only topic1
```
DONE GRAPH 5: 7CULT, 7SCI, and 7EDUC over time (by year)
DONE GRAPH 6: 7CULT, 7SCI, and 7EDUC+4EDUC (integrated into one variable) over time (by year)
DONE GRAPH 7: 7CULT, 7SCI, and 7EDUC+4EDUC over time (by period)
```

In [None]:
# Code
%matplotlib inline
category_maps = {
    '7CULT, 7SCIEN, and 7EDUC': {
        '7CULT': '7CULT',
        '7SCIEN': '7SCIEN',
        '7EDUC': '7EDUC'
    },
    '7CULT, 7SCI, and 7EDUC+4EDUC': {
        '7CULT': '7CULT',
        '7SCIEN': '7SCIEN',
        '7EDUC': '7EDUC+4EDUC',
        '4EDUC': '7EDUC+4EDUC'
    }
}

def plot_display_quantity_of_topics(pivot, kind, stacked, xlabel='', ylabel='', plot_style='classic', figsize=(12,10)):
   
    matplotlib.style.use(plot_style)
    ax = pivot.plot(kind=kind, stacked=stacked, figsize=figsize)

    ax.set_ylabel(ylabel)
    ax.set_xlabel(xlabel)
    # legend = ax.legend(loc='upper center', bbox_to_anchor=(0.5, 1.1), ncol=4)
    legend = ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
    legend.get_frame().set_linewidth(0.0)
    
    for tick in ax.get_xticklabels():
        tick.set_rotation(45)    
    
def display_quantity_of_topics(
    period,
    category_map_name,
    recode_7cult=False,
    normalize_values=False,
    include_other=False,
    output_format='chart',
    plot_style='classic'
):
    global X
    try:
        data = state.treaties.copy()

        category_map = category_maps[category_map_name]

        if not include_other:
            data = data.loc[(data.topic1.isin(category_map.keys()))]

        data = data.loc[(data.signed_period!='other')]

        if recode_7cult:
            data.loc[(data.is_cultural==True), 'topic1'] = '7CULT'

        data['category'] = data.topic1.apply(lambda x: category_map.get(x, 'OTHER'))

        data = data\
                .groupby([period, 'category'])\
                .size()\
                .reset_index()\
                .rename(columns={ period: 'Period', 'category': 'Category', 0: 'Count' })

        pivot = pd.pivot_table(data, index=['Period'], values=["Count"], columns=['Category'], fill_value=0)
        pivot.columns = [ x[-1] for x in pivot.columns ]

        if normalize_values is True:
            pivot = pivot.div(0.01 * pivot.sum(1), axis=0)

        if output_format.startswith('plot'):

            label = 'Number of treaties' if not normalize_values else 'Share%'

            ylabel = label if 'barh' not in output_format else ''
            xlabel = label if 'barh' in output_format else ''

            stacked = 'stacked' in output_format
            kind = output_format.split('_')[1]
            height = 10 if 'barh' in output_format and period == 'signed_year' else 6

            plot_display_quantity_of_topics(
                pivot, kind=kind, stacked=stacked, xlabel=xlabel, ylabel=ylabel, plot_style=plot_style, figsize=(16,height)
            )

        elif output_format == 'chart':
            print('bokeh plot not implemented')
            data.plot.line(figsize=(12,8))
        elif output_format == 'table':
            #display(data)
            display(HTML(data.to_html()))
        else:
            display(pivot)
    except Exception as ex:
        logger.error(ex)
        
tw = BaseWidgetUtility(
    period=widgets.Dropdown(
        options={
            'Year': 'signed_year',
            'Default division': 'signed_period',
            'Alt. division': 'signed_period_alt'
        },
        value='signed_period',
        description='Period:', **drop_style
    ),
    category_map_name=widgets.Dropdown(
        options=category_maps.keys(),
        description='Category:', **drop_style
    ),
    recode_7cult=widgets.ToggleButton(
        description='Recode 7CULT',
        tooltip='Treat all treaties with cultural=yes as 7CULT',
        value=False, **toggle_style
    ),
    normalize_values=widgets.ToggleButton(
        description='Normalize%',
        tooltip='Display shares per category instead of count', **toggle_style
    ),
    include_other=widgets.ToggleButton(
        description='+Other', value=False,  **toggle_style
    ),
    output_format=widgets.Dropdown(
        description='Output',
        options=output_formats, **drop_style
    ),
    plot_style=widgets.Dropdown(
        options=matplotlib_plot_styles,
        value='seaborn-pastel',
        description='Style:', **drop_style
    ),
)

itw = widgets.interactive(
    display_quantity_of_topics,
    period=tw.period,
    category_map_name=tw.category_map_name,
    recode_7cult=tw.recode_7cult,
    normalize_values=tw.normalize_values,
    include_other=tw.include_other,
    output_format=tw.output_format,
    plot_style=tw.plot_style
)

boxes = widgets.HBox(
    [
        widgets.VBox([ tw.period, tw.category_map_name]),
        widgets.VBox([ tw.recode_7cult, tw.normalize_values]),
        widgets.VBox([ tw.include_other]),
        widgets.VBox([ tw.output_format, tw.plot_style ])
    ]
)
display(widgets.VBox([boxes, itw.children[-1]]))
itw.update()



### Task: Headnote Word and Cooccurence Toplist

In [None]:
# Code
from nltk.stem import WordNetLemmatizer

toggle_style = dict(icon='', layout=widgets.Layout(width='140px', left='0'))

class HeadnoteTokenServiceOLD():

    def __init__(self, tokenizer, stopwords=None, lemmatizer=None, min_word_size=2):
        
        self.transforms = [
            tokenizer,
            lambda ws: ( x for x in ws if len(x) >= min_word_size ),
            lambda ws: ( x for x in ws if any(ch.isalpha() for ch in x)) 
        ]
        
        if stopwords is not None:
            self.transforms += [ lambda ws: ( x for x in ws if x not in stopwords ) ]
            
        if lemmatizer is not None:
            self.transforms += [ lambda ws: ( lemmatizer(x) for x in ws ) ]

    def _apply_transforms(self, ws):
        for f in self.transforms:
            ws = f(ws)
        return list(ws)
    
    def parse_headnotes(self, treaties):
        
        headnotes = treaties['headnote']
        
        texts = [ x.lower() for x in list(headnotes) ]
        #tokens = list(map(self._apply_transforms, texts))
        df = pd.DataFrame({'headnote': headnotes, 'tokens': tokens })
        
        return df
    
    def compute_stacked(self, treaties):
        
        df = self.parse_headnotes(treaties)
        
        df_stacked = pd.DataFrame(df.tokens.tolist(), index=df.index).stack()\
            .reset_index().rename(columns={'level_1': 'sequence_id', 0: 'token'})
            
        return df_stacked
    
    def compute_co_occurrence(self, treaties, pos_tags, only_cultural_treaties=False):

        # Filter out tags based on treaties of interest
        pos_tags = pos_tags.merge(treaties, how='inner', left_on='treaty_id', right_index=True)[[]]
        
        if only_cultural_treaties:
            df_pos_tags = df_pos_tags[(df_pos_tags.is_cultural.str.contains('yes',na=False))]

        # Self join of words within same treaty
        df_co_occurrence = pd.merge(df_pos_tags, df_pos_tags, how='inner', left_on='treaty_id', right_on='treaty_id')
        # Only consider a specific poir once
        df_co_occurrence = df_co_occurrence[(df_co_occurrence.wid_x < df_co_occurrence.wid_y)]
        # Reduce number of returned columns
        df_co_occurrence = df_co_occurrence[['treaty_id', 'year_x', 'is_cultural_x', 'lemma_x', 'lemma_y' ]]
        # Rename columns
        df_co_occurrence.columns = ['treaty_id', 'year', 'is_cultural', 'lemma_x', 'lemma_y' ]

        # Sort token pair so smallest always comes first
        lemma_x = df_co_occurrence[['lemma_x', 'lemma_y']].min(axis=1)
        lemma_y = df_co_occurrence[['lemma_x', 'lemma_y']].max(axis=1)
        df_co_occurrence['lemma_x'] = lemma_x
        df_co_occurrence['lemma_y'] = lemma_y

        return df_co_occurrence

class HeadnoteTokenCorpus():

    def __init__(self, treaties, tokenize=None, stopwords=None, lemmatize=None, min_size=2):
        
        tokenize = tokenize or nltk.tokenize.word_tokenize
        lemmatize = lemmatize or WordNetLemmatizer().lemmatize
        stopwords = stopwords or nltk.corpus.stopwords.words('english')
        
        self.transforms = [
            tokenize,
            lambda ws: ( x for x in ws if len(x) >= min_size ),
            lambda ws: ( x for x in ws if any(ch.isalpha() for ch in x)),
            lambda ws: list(set(ws)) 
        ]
        
        #if stopwords is not None:
        #    self.transforms += [ lambda ws: ( x for x in ws if x not in stopwords ) ]
            
        #if lemmatizer is not None:
        #    self.transforms += [ lambda ws: ( lemmatizer(x) for x in ws ) ]
        
        treaty_tokens = self._compute_stacked(treaties)
        vocabulary = treaty_tokens.token.unique()
        lemmas = list(map(lemmatize, vocabulary))
        lemma_map = { w: l for (w, l) in zip(*(vocabulary, lemmas)) if w != l }
        stopwords_map = { s : True for s in stopwords }
        treaty_tokens['lemma'] = treaty_tokens.token.apply(lambda x: lemma_map.get(x, x))
        treaty_tokens['is_stopword'] = treaty_tokens.token.apply(lambda x: stopwords_map.get(x, False))

        self.treaty_tokens = treaty_tokens.set_index(['treaty_id', 'sequence_id'])
        
    def _apply_transforms(self, ws):
        for f in self.transforms:
            ws = f(ws)
        return list(ws)
    
    def _parse_headnotes(self, treaties):
        
        headnotes = treaties['headnote']
        
        texts = [ x.lower() for x in list(headnotes) ]
        tokens = list(map(self._apply_transforms, texts))
        df = pd.DataFrame({'headnote': headnotes, 'tokens': tokens })
        
        return df
    
    def _compute_stacked(self, treaties):
        
        df = self._parse_headnotes(treaties)
        
        df_stacked = pd.DataFrame(df.tokens.tolist(), index=df.index).stack()\
            .reset_index().rename(columns={'level_1': 'sequence_id', 0: 'token'})
            
        return df_stacked
    
def compute_co_occurrance(treaties):
    
    treaty_tokens = state.treaty_headnote_corpus.treaty_tokens
    i1 = treaties.index
    # i2 = treaty_tokens.reset_index().set_index('treaty_id').index
    i2 = treaty_tokens.index.get_level_values(0)
    treaty_tokens = treaty_tokens[i2.isin(i1)]
    
    treaty_tokens = treaty_tokens.loc[treaty_tokens.is_stopword==False]
    treaty_tokens = treaty_tokens.reset_index().drop(['is_stopword', 'sequence_id'], axis=1).set_index('treaty_id')

    co_occurrance = treaty_tokens.merge(treaty_tokens, how='inner', left_index=True, right_index=True)
    co_occurrance = co_occurrance.loc[(co_occurrance['token_x'] < co_occurrance['token_y'])]
    #co_occurrance['token'] = co_occurrance.apply(lambda row: row[groupby_pair[0]] + ' - ' + row[groupby_pair[1]], axis=1)
    co_occurrance['token'] = co_occurrance.apply(lambda row: ' - '.join([row['token_x'].upper(), row['token_y'].upper()]), axis=1)
    co_occurrance['lemma'] = co_occurrance.apply(lambda row: ' - '.join([row['lemma_x'].upper(), row['lemma_y'].upper()]), axis=1)
    co_occurrance = co_occurrance.assign(is_stopword=False, sequence_id=0)[['sequence_id', 'token', 'lemma', 'is_stopword']]
    
    return co_occurrance

def create_bigram_transformer(documents):
    import gensim.models.phrases
    bigram = gensim.models.phrases.Phrases(map(nltk.tokenize.word_tokenize, documents))
    return lambda ws: bigram[ws]

def remove_snake_case(snake_str):
    return ' '.join(x.title() for x in snake_str.split('_'))

def get_top_partiesssss(data, period, party_name, n_top=5):
    xd = data.groupby([period, party_name]).size().rename('TopCount').reset_index()
    top_list = xd.groupby([period]).apply(lambda x: x.nlargest(n_top, 'TopCount'))\
        .reset_index(level=0, drop=True)\
        .set_index([period, party_name])
    return top_list

result=None
def display_headnote_toplist(
    period=None,
    parties=None,
    extra_groupbys=None,
    only_is_cultural=True,
    use_lemma=False,
    compute_co_occurance=False,
    remove_stopwords=True,
    min_word_size=2,
    n_min_count=1,
    output_format='table',
    n_top=50
    # plot_style=tw.plot_style
):
    global ihnw, result
    
    try:
        hnw.progress.value = 1    
        treaties = state.treaties.loc[state.treaties.signed_period != 'other']

        if state.treaty_headnote_corpus is None:
            print('Preparing headnote corpus for first time use')
            state.treaty_headnote_corpus = HeadnoteTokenCorpus(treaties=treaties)

        if only_is_cultural:
            treaties = treaties.loc[(state.treaties.is_cultural)]

        if parties is not None:
            ids = state.stacked_treaties.loc[(state.stacked_treaties.party.isin(parties))].index
            treaties = treaties.loc[ids]

        hnw.progress.value += 1

        if compute_co_occurance:

            treaty_tokens = compute_co_occurrance(treaties)

        else:

            treaty_tokens = state.treaty_headnote_corpus.treaty_tokens

            if remove_stopwords is True:
                treaty_tokens = treaty_tokens.loc[treaty_tokens.is_stopword==False]

            treaty_tokens = treaty_tokens.reset_index().set_index('treaty_id')

        hnw.progress.value += 1

        treaty_tokens = treaty_tokens\
            .merge(treaties, how='inner', left_index=True, right_index=True)\
            .drop(['sequence', 'is_cultural_yesno', 'source', 'signed', 'headnote', 'is_cultural',
                   'topic1', 'topic2', 'title'], axis=1)

        hnw.progress.value += 1

        token_or_lemma = 'token' if not use_lemma else 'lemma'

        groupbys  = []
        groupbys += [ period ] if not period is None else []
        groupbys += (extra_groupbys or [])
        groupbys += [ token_or_lemma ]

        result = treaty_tokens.groupby(groupbys).size().reset_index().rename(columns={0: 'Count'})

        hnw.progress.value += 1

        ''' Filter out the n_top most frequent words from each group '''
        result = result.groupby(groupbys[-1]).apply(lambda x: x.nlargest(n_top, 'Count'))\
            .reset_index(level=0, drop=True)\
            # .set_index(groupbys)

        if min_word_size > 0:
            result = result.loc[result[token_or_lemma].str.len() >= min_word_size]

        if n_min_count > 1:
            result = result.loc[result.Count >= n_min_count]

        hnw.progress.value += 1

        result = result.sort_values(groupbys[:-1] + ['Count'], ascending=len(groupbys[:-1])*[True] + [False])

        hnw.progress.value += 1

        if output_format == 'table':
            result.columns = [ remove_snake_case(x) for x in result.columns ]
            display(result)
            # display(HTML(result.to_html()))
        elif output_format == 'unstack':
            result = result.set_index(groupbys).unstack(level=0).fillna(0).astype('int32')
            result.columns = [ x[1] for x in result.columns ]
            display(result)
        elif output_format == 'unstack_plot':
            result = result.set_index(list(reversed(groupbys))).unstack(level=0).fillna(0).astype('int32')
            result.columns = [ x[1] for x in result.columns ]
            result.plot(kind='bar', figsize=(16,8))

    except Exception as ex:
        logger.error(ex)
        
    hnw.progress.value += 1
    hnw.progress.value = 0

hnw = BaseWidgetUtility(
    period=widgets.Dropdown(
        options={
            '': None,
            'Year': 'signed_year',
            'Default division': 'signed_period',
            'Alt. division': 'signed_period_alt'
        },
        value='signed_period',
        description='Period:', **drop_style
    ),
    parties=widgets.Dropdown(
        options={
            '(all)': None,
            'PartyOf5': parties_of_interest,
            'France': [ 'FRANCE' ],
            'Italy': [ 'ITALY' ],
            'UK': [ 'UK' ],
            'India': [ 'INDIA' ],
            'Germany': [ 'GERMU', 'GERMAN', 'GERME', 'GERMW' ]
        },
        value=None,
        description='Parties:', **drop_style
    ),
    use_lemma=widgets.ToggleButton(
        description='Use lemma', value=False,
        tooltip='Use WordNet lemma', **toggle_style
    ),
    remove_stopwords=widgets.ToggleButton(
        description='Remove stopwords', value=True,
        tooltip='Do not include stopwords', **toggle_style
    ),
    extra_groupbys=widgets.Dropdown(
        options={
            '': None,
            'Topic': [ 'Topic' ],
        },
        value=None,
        description='Groupbys:', **drop_style
    ),
    min_word_size=widgets.BoundedIntText(
        value=2, min=0, max=5, step=1,
        description='Min word:', layout=widgets.Layout(width='140px')
    ),
    only_is_cultural=widgets.ToggleButton(
        description='Only Cultural', value=True,
        tooltip='Display only "is_cultural" treaties', **toggle_style
    ),
    compute_co_occurance=widgets.ToggleButton(
        description='Cooccurrence', value=True,
        tooltip='Compute Cooccurrence', **toggle_style
    ),
    output_format=widgets.Dropdown(
        description='Output', value='table',
        options={
            'Table': 'table',
            'Unstack': 'unstack',
            'Unstack plot': 'unstack_plot'
        }, **drop_style
    ),
    plot_style=widgets.Dropdown(
        options=matplotlib_plot_styles,
        value='seaborn-pastel',
        description='Style:', **drop_style
    ),
    n_top=widgets.IntSlider(
        value=25, min=2, max=100, step=10,
        description='Top/grp #:', # continuous_update=False,
    ),
    n_min_count=widgets.IntSlider(
        value=2, min=1, max=10, step=1,
        tooltip='Filter out words with count less than specified value',
        description='Min count:', # continuous_update=False,
    ),
    progress=wf.create_int_progress_widget(min=0, max=10, step=1, value=0, layout=widgets.Layout(width='100%')),
)

ihnw = widgets.interactive(
    display_headnote_toplist,
    period=hnw.period,
    parties=hnw.parties,
    extra_groupbys=hnw.extra_groupbys,
    only_is_cultural=hnw.only_is_cultural,
    n_min_count=hnw.n_min_count,
    n_top=hnw.n_top,
    min_word_size=hnw.min_word_size,
    use_lemma=hnw.use_lemma,
    compute_co_occurance=hnw.compute_co_occurance,
    remove_stopwords=hnw.remove_stopwords,
    output_format=hnw.output_format,
    # plot_style=tw.plot_style
)

boxes = widgets.HBox(
    [
        widgets.VBox([ hnw.period, hnw.parties, hnw.min_word_size ]),
        widgets.VBox([ hnw.extra_groupbys, hnw.n_top, hnw.n_min_count]),
        widgets.VBox([ hnw.only_is_cultural, hnw.use_lemma, hnw.remove_stopwords, hnw.compute_co_occurance]),
        widgets.VBox([ hnw.output_format, hnw.progress ])
    ]
)
display(widgets.VBox([boxes, ihnw.children[-1]]))
ihnw.update()


###  <span style='color:blue'>**Mandatory Step**</span>: Prepare Treaty Text Corpora

This code cell is a mandatory step for subsequent text corpus statistics. 

This step processes the treaty text for from given compressed archive (ZIP-file), each language , and stores in an efficient Market-Matrix (MM) corpus format. The corpora is only stored if it is not previously stored, or the "Force Update" is specified. Note that an update MUST be forced whenever the treaty archive is updated - otherwise the text in the new archive is ignored.

In [None]:
# Code

sort_chained = lambda x, f: list(x).sort(key=f) or x
    
def ls_sorted(path):
    return sort_chained(list(filter(os.path.isfile, glob.glob(path))), os.path.getmtime)
       
class CompressedFileReader(object):

    def __init__(self, archive_pattern, filename_pattern='*.txt'):
        self.archive_pattern = archive_pattern
        self.filename_pattern = filename_pattern

    def __iter__(self):

        for zip_path in glob.glob(self.archive_pattern):
            with zipfile.ZipFile(zip_path) as zip_file:
                filenames = [ name for name in zip_file.namelist() if fnmatch.fnmatch(name, self.filename_pattern) ]
                for filename in filenames:
                    try:
                        with zip_file.open(filename, 'rU') as text_file:
                            content = text_file.read()
                            content = gensim.utils.to_unicode(content, 'utf8', errors='ignore')
                            content = content.replace('-\r\n', '').replace('-\n', '')
                            yield os.path.basename(filename), content
                    except:
                        print('Unicode error: {}'.format(filename))
                        raise
                        
class TreatyCorpus(TextCorpus):

    def __init__(self, content_iterator, dictionary=None, metadata=False, character_filters=None,
                 tokenizer=None, token_filters=None, bigram_transform=False
    ):
        self.content_iterator = content_iterator
        
        token_filters = [
           (lambda tokens: [ x.lower() for x in tokens ]),
           (lambda tokens: [ x for x in tokens if any(map(lambda x: x.isalpha(), x)) ])
        ] + (token_filters or [])
        
        #if bigram_transform is True:
        #    train_corpus = TreatyCorpus(content_iterator, token_filters=[ x.lower() for x in tokens ])
        #    phrases = gensim.models.phrases.Phrases(train_corpus)
        #    bigram = gensim.models.phrases.Phraser(phrases)
        #    token_filters.append(
        #        lambda tokens: bigram[tokens]
        #    )           
        
        super(TreatyCorpus, self).__init__(
            input=True,
            dictionary=dictionary,
            metadata=metadata,
            character_filters=character_filters,
            tokenizer=tokenizer,
            token_filters=token_filters
        )
        
    def getstream(self):
        """Generate documents from the underlying plain text collection (of one or more files).
        Yields
        ------
        str
            Document read from plain-text file.
        Notes
        -----
        After generator end - initialize self.length attribute.
        """
        filenames = []
        num_texts = 0
        for filename, content in self.content_iterator:
            yield content
            filenames.append(filename)
        self.length = num_texts
        self.filenames = filenames
        self.document_names = self._compile_document_names()
                 
    def get_texts(self):
        '''
        This is mandatory method from gensim.corpora.TextCorpus. Returns stream of documents.
        '''
        for document in self.getstream():
            yield self.preprocess_text(document)
            
    def preprocess_text(self, text):
            """Apply `self.character_filters`, `self.tokenizer`, `self.token_filters` to a single text document.
            Parameters
            ---------
            text : str
                Document read from plain-text file.
            Return
            ------
            list of str
                List of tokens extracted from `text`.
            """
            for character_filter in self.character_filters:
                text = character_filter(text)

            tokens = self.tokenizer(text)
            for token_filter in self.token_filters:
                tokens = token_filter(tokens)

            return tokens
        
    def _compile_document_names(self):
        
        document_names = pd.DataFrame(dict(
            document_name=self.filenames,
            treaty_id=[ x.split('_')[0] for x in self.filenames ]
        )).reset_index().rename(columns={'index': 'document_id'})
        
        document_names = document_names.set_index('document_id')   
        dupes = document_names.groupby('treaty_id').size().loc[lambda x: x > 1]
        
        if len(dupes) > 0:
            logger.critical('Warning! Duplicate treaties found in corpus: {}'.format(' '.join(list(dupes.index))))
            
        return document_names

class MmCorpusStatisticsService():
    
    def __init__(self, corpus, dictionary, language):
        self.corpus = corpus
        self.dictionary = dictionary
        self.stopwords = nltk.corpus.stopwords.words(language[1])
        _ = dictionary[0]
        
    def get_total_token_frequencies(self):
        dictionary = self.corpus.dictionary
        freqencies = np.zeros(len(dictionary.id2token))
        document_stats = []
        for document in corpus:
            for i, f in document:
                freqencies[i] += f
        return freqencies

    def get_document_token_frequencies(self):
        from itertools import chain
        '''
        Returns a DataFrame with per document token frequencies i.e. "melts" doc-term matrix
        '''
        data = ((document_id, x[0], x[1]) for document_id, values in enumerate(self.corpus) for x in values )
        pd = pd.DataFrame(list(zip(*data)), columns=['document_id', 'token_id', 'count'])
        pd = pd.merge(self.corpus.document_names, left_on='document_id', right_index=True)

        return pd

    def compute_word_frequencies(self, remove_stopwords):
        id2token = self.dictionary.id2token
        term_freqencies = np.zeros(len(id2token))
        document_stats = []
        for document in self.corpus:
            for i, f in document:
                term_freqencies[i] += f
        stopwords = set(self.stopwords).intersection(set(id2token.values()))
        df = pd.DataFrame({
            'token_id': list(id2token.keys()),
            'token': list(id2token.values()),
            'frequency': term_freqencies,
            'dfs':  list(self.dictionary.dfs.values())
        })
        df['is_stopword'] = df.token.apply(lambda x: x in stopwords)
        if remove_stopwords is True:
            df = df.loc[(df.is_stopword==False)]
        df['frequency'] = df.frequency.astype(np.int64)
        df = df[['token_id', 'token', 'frequency', 'dfs', 'is_stopword']].sort_values('frequency', ascending=False)
        return df.set_index('token_id')

    def compute_document_stats(self):
        id2token = self.dictionary.id2token
        stopwords = set(self.stopwords).intersection(set(id2token.values()))
        df = pd.DataFrame({
            'document_id': self.corpus.index,
            'document_name': self.corpus.document_names.document_name,
            'treaty_id': self.corpus.document_names.treaty_id,
            'size': [ sum(list(zip(*document))[1]) for document in self.corpus],
            'stopwords': [ sum([ v for (i,v) in document if id2token[i] in self.stopwords]) for document in self.corpus],
        }).set_index('document_name')
        df[['size', 'stopwords']] = df[['size', 'stopwords']].astype('int')
        return df

    def compute_word_stats(self):
        df = self.compute_document_stats()[['size', 'stopwords']]
        df_agg = df.agg(['count', 'mean', 'std', 'min', 'median', 'max', 'sum']).reset_index()
        legend_map = {
            'count': 'Documents',
            'mean': 'Mean words',
            'std': 'Std',
            'min': 'Min',
            'median': 'Median',
            'max': 'Max',
            'sum': 'Sum words'
        }
        df_agg['index'] = df_agg['index'].apply(lambda x: legend_map[x]).astype('str')
        df_agg = df_agg.set_index('index')
        df_agg[df_agg.columns] = df_agg[df_agg.columns].astype('int')
        return df_agg.reset_index()
    
#@staticmethod

class ExtMmCorpus(gensim.corpora.MmCorpus):
    """Extension of MmCorpus that allow TF normalization based on document length.
    """

    @staticmethod
    def norm_tf_by_D(doc):
        D = sum([x[1] for x in doc])
        return doc if D == 0 else map(lambda tf: (tf[0], tf[1]/D), doc)

    def __init__(self, fname):
        gensim.corpora.MmCorpus.__init__(self, fname)
        
    def __iter__(self):
        for doc in gensim.corpora.MmCorpus.__iter__(self):
            yield self.norm_tf_by_D(doc)

    def __getitem__(self, docno):
        return self.norm_tf_by_D(gensim.corpora.MmCorpus.__getitem__(self, docno))

class TreatyCorpusSaveLoad():

    def __init__(self, source_folder, lang):
        
        self.mm_filename = os.path.join(source_folder, 'corpus_{}.mm'.format(lang))
        self.dict_filename = os.path.join(source_folder, 'corpus_{}.dict.gz'.format(lang))
        self.document_index = os.path.join(source_folder, 'corpus_{}_documents.csv'.format(lang))
        
    def store_as_mm_corpus(self, treaty_corpus):
        
        gensim.corpora.MmCorpus.serialize(self.mm_filename, treaty_corpus, id2word=treaty_corpus.dictionary.id2token)
        treaty_corpus.dictionary.save(self.dict_filename)
        treaty_corpus.document_names.to_csv(self.document_index, sep='\t')

    def load_mm_corpus(self, normalize_by_D=False):
    
        corpus_type = ExtMmCorpus if normalize_by_D else gensim.corpora.MmCorpus
        corpus = corpus_type(self.mm_filename)
        corpus.dictionary = gensim.corpora.Dictionary.load(self.dict_filename)
        corpus.document_names = pd.read_csv(self.document_index, sep='\t').set_index('document_id')  

        return corpus
    
    def exists(self):
        return os.path.isfile(self.mm_filename) and \
            os.path.isfile(self.dict_filename) and \
            os.path.isfile(self.document_index)

def store_mm_corpora(source_path, force, languages):
    
    try:
        print('Current archive:{}'.format(source_path))
        tokenizer = nltk.tokenize.word_tokenize
        source_folder = os.path.split(source_path)[0]
        for language in languages.split(','):
            loader = TreatyCorpusSaveLoad(source_folder, language)
            if not loader.exists() or force:
                print('Processing: {}'.format(language))
                stream = CompressedFileReader(source_path, filename_pattern='*_{}*.txt'.format(language))
                treaty_corpus = TreatyCorpus(stream, tokenizer=tokenizer)        
                loader.store_as_mm_corpus(treaty_corpus)
        print('Corpus is up-to-date!')
    except Exception as ex:
        logger.error(ex)
        
current_archives = (ls_sorted('./data/*.zip') or [])

cuw = BaseWidgetUtility(
    source_path=widgets.Dropdown(
        options=current_archives,
        value=current_archives[-1] if len(current_archives) else None,
        description='Corpus:' #, **drop_style
    ),
    force_corpus_update=widgets.ToggleButton(
        description='Force Update',
        tooltip='Force refresh saved corpus cache (a performance feature). Use when ZIP-archive has been updated.',
        value=False #, **toggle_style
    )
)

icuw = widgets.interactive(
    store_mm_corpora,
    source_path=cuw.source_path,
    force=cuw.force_corpus_update,
    languages='en,it,fr,de'
)

display(widgets.VBox([widgets.HBox([cuw.source_path, cuw.force_corpus_update]), icuw.children[-1]]))

icuw.update()


### Task: Basic Corpus Statistics

In [None]:
# Code 

corpus = None
def display_token_toplist(source_folder, language, statistics='', remove_stopwords=False):
    global tlw, corpus
    try:
        
        tlw.progress.value = 1

        corpus = TreatyCorpusSaveLoad(source_folder=source_folder, lang=language[0]).load_mm_corpus()

        tlw.progress.value = 2
        service = MmCorpusStatisticsService(corpus, dictionary=corpus.dictionary, language=language)

        print("Corpus consists of {} documents, {} words in total and a vocabulary size of {} tokens."\
                  .format(len(corpus), corpus.dictionary.num_pos, len(corpus.dictionary)))

        tlw.progress.value = 3
        if statistics == 'word_freqs':
            display(service.compute_word_frequencies(remove_stopwords))
        elif statistics == 'documents':
            display(service.compute_document_stats())
        elif statistics == 'word_count':
            display(service.compute_word_stats())
        else:
            print('Unknown: ' + statistics)
            
    except Exception as ex:
        logger.error(ex)
        
    tlw.progress.value = 5
    tlw.progress.value = 0
    
tlw = BaseWidgetUtility(
    language=widgets.Dropdown(
        options={
            'English': ('en', 'english'),
            'French': ('fr', 'french'),
            'German': ('de', 'german'),
            'Italian': ('it', 'italian')
        },
        value=('en', 'english'),
        description='Language:', **drop_style
    ),
    statistics=widgets.Dropdown(
        options={
            'Word freqs': 'word_freqs',
            'Documents': 'documents',
            'Word count': 'word_count'
        },
        value='word_count',
        description='Statistics:', **drop_style
    ),    
    remove_stopwords=widgets.ToggleButton(
        description='Remove stopwords', value=True,
        tooltip='Do not include stopwords in token toplist', **toggle_style
    ),    
    progress=wf.create_int_progress_widget(min=0, max=5, step=1, value=0) #, layout=widgets.Layout(width='100%')),
)

itlw = widgets.interactive(
    display_token_toplist,
    source_folder='./data',
    language=tlw.language,
    statistics=tlw.statistics,
    remove_stopwords=tlw.remove_stopwords
)

boxes = widgets.HBox(
    [
        tlw.language, tlw.statistics, tlw.remove_stopwords, tlw.progress
    ]
)
display(widgets.VBox([boxes, itlw.children[-1]]))
itlw.update()


### <span style='color: red'>WORK IN PROGRESS</span> Task: Treaty Keyword Extraction (using TF-IDF weighing)
- [ML Wiki.org](http://mlwiki.org/index.php/TF-IDF)
- [Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
- Spärck Jones, K. (1972). "A Statistical Interpretation of Term Specificity and Its Application in Retrieval".
- Manning, C.D.; Raghavan, P.; Schutze, H. (2008). "Scoring, term weighting, and the vector space model". ([PDF](http://nlp.stanford.edu/IR-book/pdf/06vect.pdf))
- https://markroxor.github.io/blog/tfidf-pivoted_norm/
$\frac{tf-idf}{\sqrt(rowSums( tf-idf^2 ) )}$
- https://nlp.stanford.edu/IR-book/html/htmledition/pivoted-normalized-document-length-1.html

Neural Network Methods in Natural Language Processing, Yoav Goldberg:
![image.png](attachment:image.png)

In [None]:
# Code
from scipy.sparse import csr_matrix
%timeit

    
def get_top_tfidf_words(data, n_top=5):
    top_list = data.groupby(['treaty_id'])\
        .apply(lambda x: x.nlargest(n_top, 'score'))\
        .reset_index(level=0, drop=True)
    return top_list

def compute_tfidf_scores(corpus, dictionary, smartirs='ntc'):
    #model = gensim.models.logentropy_model.LogEntropyModel(corpus, normalize=True)
    model = gensim.models.tfidfmodel.TfidfModel(corpus, dictionary=dictionary, normalize=True) #, smartirs=smartirs)
    rows, cols, scores = [], [], []
    for r, document in enumerate(corpus): 
        vector = model[document]
        c, v = zip(*vector)
        rows += (len(c) * [ int(r) ])
        cols += c
        scores += v
        
    return csr_matrix((scores, (rows, cols)))
    
if True: #'tfidf_cache' not in globals():
    tfidf_cache = {
    }
    
def display_tfidf_scores(source_folder, language, period, n_top=5, threshold=0.001):
    
    global state, tfw, tfidf_cache
    
    try:
        treaties = state.treaties

        tfw.progress.value = 0
        tfw.progress.value += 1
        if language[0] not in tfidf_cache.keys():
            corpus = TreatyCorpusSaveLoad(source_folder=source_folder, lang=language[0])\
                .load_mm_corpus(normalize_by_D=True)
            document_names = corpus.document_names
            dictionary = corpus.dictionary
            _ = dictionary[0]

            tfw.progress.value += 1
            A = compute_tfidf_scores(corpus, dictionary)

            tfw.progress.value += 1
            scores = pd.DataFrame(
                [ (i, j, dictionary.id2token[j], A[i, j]) for i, j in zip(*A.nonzero())],
                columns=['document_id', 'token_id', 'token', 'score']
            )
            tfw.progress.value += 1
            scores = scores.merge(document_names, how='inner', left_on='document_id', right_index=True)\
                .drop(['document_id', 'token_id', 'document_name'], axis=1)

            scores = scores[['treaty_id', 'token', 'score']]\
                .sort_values(['treaty_id', 'score'], ascending=[True, False])

            tfidf_cache[language[0]] = scores

        scores = tfidf_cache[language[0]]
        if threshold > 0:
            scores = scores.loc[scores.score >= threshold]

        tfw.progress.value += 1

        #scores = get_top_tfidf_words(scores, n_top=5)
        #scores = scores.groupby(['treaty_id']).sum() 

        scores = scores.groupby(['treaty_id'])\
            .apply(lambda x: x.nlargest(n_top, 'score'))\
            .reset_index(level=0, drop=True)\
            .set_index('treaty_id')

        if period is not None:
            periods = state.treaties[period]
            scores = scores.merge(periods.to_frame(), left_index=True, right_index=True, how='inner')\
                .groupby([period, 'token']).score.agg([np.mean])\
                .reset_index().rename(columns={0:'score'}) #.sort_values('token')

        #['token'].apply(' '.join)

        display(scores)
    except Exception as ex:
        logger.error(ex)
        
    tfw.progress.value = 0

#if 'tfidf_scores' not in globals():
#    tfidf_scores = compute_document_tfidf(corpus, corpus.dictionary, state.treaties)
#    tfidf_scores = tfidf_scores.sort_values(['treaty_id', 'score'], ascending=[True, False])

tfw = BaseWidgetUtility(
    language=widgets.Dropdown(
        options={
            'English': ('en', 'english'),
            'French': ('fr', 'french'),
            'German': ('de', 'german'),
            'Italian': ('it', 'italian')
        },
        value=('en', 'english'),
        description='Language:', **drop_style
    ),
    remove_stopwords=widgets.ToggleButton(
        description='Remove stopwords', value=True,
        tooltip='Do not include stopwords in token toplist', **toggle_style
    ),    
    n_top=widgets.IntSlider(
        value=5, min=1, max=25, step=1,
        description='Top #:',
        continuous_update=False
    ),
    threshold=widgets.FloatSlider(
        value=0.001, min=0.0, max=0.5, step=0.01,
        description='Threshold:',
        tooltip='Word having a TF-IDF score below this value is filtered out',
        continuous_update=False,
        readout_format='.3f',
    ), 
    period=widgets.Dropdown(
        options={
            '': None,
            'Year': 'signed_year',
            'Default division': 'signed_period',
            'Alt. division': 'signed_period_alt'
        },
        value='signed_period',
        description='Period:', **drop_style
    ),
    output=widgets.Dropdown(
        options={
            '': None,
            'Year': 'signed_year',
            'Default division': 'signed_period',
            'Alt. division': 'signed_period_alt'
        },
        value='signed_period',
        description='Output:', **drop_style
    ),
    progress=widgets.IntProgress(min=0, max=5, step=1, value=0) #, layout=widgets.Layout(width='100%')),
)

itfw = widgets.interactive(
    display_tfidf_scores,
    source_folder='./data',
    language=tfw.language,
    n_top=tfw.n_top,
    threshold=tfw.threshold,
    period=tfw.period
)

boxes = widgets.HBox(
    [
        widgets.VBox([tfw.language, tfw.period]),
        widgets.VBox([tfw.n_top, tfw.threshold]),
        widgets.VBox([tfw.progress, tfw.output])
    ]
)

display(widgets.VBox([boxes, itfw.children[-1]]))
itfw.update()


### <span style='color: red'>WORK IN PROGRESS</span> Task: Network Visualization of Signed Treaties

In [59]:
# Visualize treaties
import bokeh.palettes as pals

periods_division = [
    (1919, 1939), (1940, 1944), (1919, 1944), (1945, 1955), (1956, 1966), (1967, 1972)
]
%run ./network_utility
%run ./plot_utility
def display_party_network(
    parties,
    period,
    only_is_cultural=True,
    layout_algorithm='',
    C=1.0,
    K=0.10,
    p1=0.10,
    output='network_bokeh',
    party_name='party',
    node_size_range=[40,60],
    refresh=False,
    palette_name=None
):
    global state, zn
    
    figsize=(900, 900)
    palette_id = max(pals.all_palettes[palette_name].keys())
    palette = pals.RdYlBu[11] if palette_name is None else pals.all_palettes[palette_name][palette_id]
    
    zn.refresh.value = False
    zn.progress.value = 1
    
    data = state.stacked_treaties.copy()
    
    data = data.loc[(data.signed_period!='other')]

    if only_is_cultural:
        data = data.loc[(data.is_cultural==True)]
        
    if isinstance(parties, list):
        data = data.loc[(data.party.isin(parties))]
    else:
        data = data.loc[(data.reversed==False)]
        
    data = data.loc[(data.signed_period != period)]
    data = data.loc[(data.signed_year.between(period[0], period[1]))]
    data = data.sort_values('signed')
    zn.progress.value = 2
    data = data.groupby(['party', 'party_other']).size().reset_index().rename(columns={0: 'weight'})
    data = data[[ 'party', 'party_other', 'weight']]

    if party_name != 'party':
        for column in ['party', 'party_other']:
            data[column] = data[column].apply(lambda x: state.get_party_name(x, party_name))

    edges_data = [ tuple(x) for x in data.values ]

    #network = NetworkUtility.create_network_from_xyw_list(edges_data)
    
    G = nx.Graph(K=K)
    G.add_weighted_edges_from(edges_data)

    zn.progress.value = 3
        
    if output == 'network_graphviz':
        import graphviz, pydotplus
        def apply_styles(graph, styles):
            graph.graph_attr.update(
                ('graph' in styles and styles['graph']) or {}
            )
            graph.node_attr.update(
                ('nodes' in styles and styles['nodes']) or {}
            )
            graph.edge_attr.update(
                ('edges' in styles and styles['edges']) or {}
            )
            return graph
        styles = {
            'graph': {
                'label': 'Graph',
                'fontsize': '16',
                'fontcolor': 'white',
                'bgcolor': '#333333',
                'rankdir': 'BT',
            },
            'nodes': {
                'fontname': 'Helvetica',
                'shape': 'hexagon',
                'fontcolor': 'white',
                'color': 'white',
                'style': 'filled',
                'fillcolor': '#006699',
            },
            'edges': {
                'style': 'dashed',
                'color': 'white',
                'arrowhead': 'open',
                'fontname': 'Courier',
                'fontsize': '12',
                'fontcolor': 'white',
            }
        }
        P=nx.nx_pydot.to_pydot(G)
        P.format = 'svg'
        #if root is not None :
        #    P.set("root",make_str(root))
        D=P.create_dot(prog='circo')
        if D=="":
            return
        Q=pydotplus.graph_from_dot_data(D)
        #Q = apply_styles(Q, styles)
        from IPython.display import Image
        I = Image(Q.create_png())
        display(I)
        
    elif output == 'network_bokeh':
        args = PlotNetworkUtility.layout_args(layout_algorithm, network=G, scale=1.0, k=K)
        if layout_algorithm.startswith('graphtool'):
            global layout_gt, G_gt
            import graph_tool.draw as gt_draw
            import graph_tool.all as gt

            G_gt = GraphToolUtility.nx2gt(G)
            G_gt.set_directed(False)
            weights = G_gt.edge_properties['weight']
            N = len(G)
            if layout_algorithm.endswith('sfdp'):
                layout_gt = gt_draw.sfdp_layout(
                    G_gt, eweight=weights, K=K, C=C, p=p1
                )
            elif layout_algorithm.endswith('arf'):
                layout_gt = gt_draw.arf_layout(G_gt, weight=weights, d=K, a=C)
            elif layout_algorithm.endswith('fruchterman_reingold'):
                layout_gt = gt_draw.fruchterman_reingold_layout(G_gt, weight=weights, a=(2.0*N*K), r=2.0*C)
                
            if False:  # graph-tool plot
                v_text = G_gt.vertex_properties['id']
                v_degrees_p = G_gt.degree_property_map('out')
                v_degrees_p.a = np.sqrt(v_degrees_p.a)+2
                v_size_p = gt.prop_to_size(v_degrees_p, node_size_range[0], node_size_range[1])
                e_size_p = gt.prop_to_size(weights, 1.0, 4.0)
                #state = gt.minimize_blockmodel_dl(G_gt)
                #state.draw(
                #c = gt.all.closeness(G_gt)

                v_blocks = gt.minimize_blockmodel_dl(G_gt).get_blocks()
                print(list(v_blocks))
                plot_color = G_gt.new_vertex_property('vector<double>')
                G_gt.vertex_properties['plot_color'] = plot_color
                for v_i, v in enumerate(G_gt.vertices()):
                    scolor = palette[v_blocks[v_i]]
                    plot_color[v] = tuple(int(scolor[i:i+2], 16) for i in (1, 3, 5)) + (1,)

                gt_draw.graph_draw(
                    G_gt,
                    #vorder=c,
                    pos=layout_gt,
                    output_size=(1000, 1000),
                    vertex_text=v_text,
                    vertex_color=[1,1,1,0],
                    vertex_fill_color=plot_color,
                    vertex_size=v_degrees_p,
                    edge_pen_width=e_size_p
                )
                return
            
            layout = { G_gt.vertex_properties['id'][i]: layout_gt[i] for i in G_gt.vertices() }

        elif layout_algorithm.startswith('graphviz'):
            def norm_layout(layout):
                max_xy = max([ max(x,y) for x,y in layout.values()])
                layout = { n: (layout[n][0]/max_xy, layout[n][1]/max_xy) for n in layout.keys() }
                return layout
            
            G.graph['K'] = K
            G.graph['overlap'] = False
            engine = layout_algorithm.split('_')[1]
            args={} # "-Goverlap=scalexy -Gepsilon=5 -GK{}".format(k).replace(',','.')
            layout = nx.nx_pydot.pydot_layout(G, prog=engine, args=args)
            layout = norm_layout(layout)
            
        else:
            layout = (get_layout_function(layout_algorithm))(G, **args)
            
        zn.progress.value = 4
        p = PlotNetworkUtility.plot_network(
            network=G,
            layout=layout,
            scale=1.0,
            text_opts=dict(
                x_offset=0, #y_offset=5,
                level='overlay',
                text_align='center',
                text_baseline='middle',
                render_mode='canvas',
                text_font="Tahoma",
                text_font_size="9pt",
                text_color='black'
            ),
            node_opts= dict(
                color=None, #'green',
                level='overlay',
                alpha=1.0
            ),
            line_opts=dict(
                color='green',
                alpha=0.5
            ),
            figsize=figsize,
            node_size_source=list(G.degree()),
            node_size_range=node_size_range,
            palette=palette,  # Spectral[9]
            x_axis_type=None,
            y_axis_type=None,
            background_fill_color='white'
        )
        zn.progress.value = 6
        bp.show(p)
        
    elif output == 'table':
        display(data)
    else:
        display(pivot_ui(data))
        
    zn.progress.value = 0

zn = BaseWidgetUtility(
    period=widgets.Dropdown(
        options={
            '{} to {}'.format(x[0], x[1]): x for x in list(set(period_divisions[0] + period_divisions[1]))
        },
        value=period_divisions[0][0],
        description='Period:', layout=widgets.Layout(width='200px')
    ),
    parties=widgets.Dropdown(
        description='Parties:',
        options={
            '(all)': None,
            'PartyOf5': parties_of_interest,
            'France': [ 'FRANCE' ],
            'Italy': [ 'ITALY' ],
            'UK': [ 'UK' ],
            'India': [ 'INDIA' ],
            'Germany': [ 'GERMU', 'GERMAN', 'GERME', 'GERMW' ]
        },
        value=None,
        layout=widgets.Layout(width='220px')
    ),
    party_name=widgets.Dropdown(
        description='Name:',
        options={
            'WTI Code': 'party',
            'WTI Name': 'party_name',
            'WTI Short': 'short_name',
            'CC': 'country_code',
            'Country': 'party_country'
        },
        value='short_name',
        layout=widgets.Layout(width='220px')
    ),
    palette=widgets.Dropdown(
        description='Color:',
        options={
            palette_name: palette_name
                    for palette_name in bokeh.palettes.all_palettes.keys()
                        if any([ len(x) > 7 for x in bokeh.palettes.all_palettes[palette_name].values()])
        },
        #value='short_name',
        layout=widgets.Layout(width='220px')
    ),
    C=widgets.IntSlider(
        description='C', min=0, max=100, step=1, value=1,
        continuous_update=False, orientation='vertical', layout=widgets.Layout(width='30px', height='160px')
    ),
    K=widgets.FloatSlider(
        description='K', min=0.01, max=1.0, step=0.01, value=0.10,
        continuous_update=False, orientation='vertical', layout=widgets.Layout(width='30px', height='160px')
    ),
    p=widgets.FloatSlider(
        description='p', min=0.01, max=2.0, step=0.01, value=1.10,
        continuous_update=False, orientation='vertical', layout=widgets.Layout(width='30px', height='160px')
    ),
    node_size_range=widgets.IntRangeSlider(
        description='Node size',
        value=[20, 40], min=5, max=100, step=1,
        continuous_update=False, orientation='vertical', layout=widgets.Layout(width='70px', height='160px')
    ),
    only_is_cultural=widgets.ToggleButton(
        description='Only Cultural', value=True,
        tooltip='Display only "is_cultural" treaties', layout=widgets.Layout(width='100px')
    ),
    output=widgets.Dropdown(
        description='Output:',
        options={
            'Bokeh': 'network_bokeh',
            'Graphviz': 'network_graphviz',
            'List': 'table'
        },
        value='network_bokeh',
        layout=widgets.Layout(width='220px')
    ),
    layout=widgets.Dropdown(
        description='Layout',
        options={
            x[0]: x[1] for x in list(zip(layout_function_name.values(), layout_function_name.keys())) +
                    [('Graphviz ({})'.format(x), 'graphviz_{}'.format(x))
                     for x in  ['neato', 'dot', 'circo', 'fdp', 'sfdp']
                    ] + # 'wc', 'gvcolor', 'ccomps', 'sccmap', 'twopi', 'gvpr', 'nop', 'tred' , 'acyclic'
                    [
                        ('Graph-Tool ({})'.format(x), 'graphtool_{}'.format(x))
                             for x in  ['sfdp', 'arf', 'fruchterman_reingold']
                    ]
        }, 
        layout=widgets.Layout(width='220px')
    ),
    progress=wf.create_int_progress_widget(min=0, max=4, step=1, value=0, layout=widgets.Layout(width="99%")),
    refresh=widgets.ToggleButton(
        description='Refresh', value=False,
        tooltip='Update plot', layout=widgets.Layout(width='100px')
    ),
) 
#search_text = widgets.Text(description = 'Search') 
#search_result = widgets.Select(description = 'Select table')

#def search_action(sender):
#    phrase = search_text.value
#    df = search(phrase) # A function that returns the results in a pandas df
#    titles = df['title'].tolist()
#    with search_result.hold_trait_notifications():
#        search_result.options = titles
        
wn = widgets.interactive(
    display_party_network,
    parties=zn.parties,
    period=zn.period,
    only_is_cultural=zn.only_is_cultural,
    layout_algorithm=zn.layout,
    C=zn.C,
    K=zn.K,
    p1=zn.p,
    output=zn.output,
    party_name=zn.party_name,
    node_size_range=zn.node_size_range,
    refresh=zn.refresh,
    palette_name=zn.palette
)
boxes = widgets.HBox([
    widgets.VBox([widgets.HBox([zn.parties, zn.only_is_cultural]), zn.period, zn.progress]),
    widgets.VBox([zn.layout, zn.party_name, zn.output, zn.palette]),
    widgets.HBox([zn.K, zn.C, zn.p, zn.node_size_range]),
    widgets.VBox([zn.refresh]),
])

display(widgets.VBox([boxes, wn.children[-1]]))

wn.update()

### <span style='color:red'>IGNORE EVERYTHING BELOW</span>

### Chart: Headnote Co-Occurrences

In [None]:
# Code
from nltk.stem import WordNetLemmatizer

class CoOccurrance():

    def __init__(self, tokenizer, stopwords=None, lemmatizer=None, min_word_size=2):
        
        self.transforms = [
            tokenizer,
            lambda ws: ( x for x in ws if len(x) >= min_word_size ),
            lambda ws: ( x for x in ws if any(ch.isalpha() for ch in x)) 
        ]
        
        if stopwords is not None:
            self.transforms += [ lambda ws: ( x for x in ws if x not in stopwords ) ]
            
        if lemmatizer is not None:
            self.transforms += [ lambda ws: ( lemmatizer(x) for x in ws ) ]

    def _apply_transforms(self, ws):
        for f in self.transforms:
            ws = f(ws)
        return list(ws)
    
    def compute(self, headnotes):
        
        texts = [ x.lower() for x in list(headnotes) ]
        tokens = list(map(self._apply_transforms, texts))
        df = pd.DataFrame({'headnote': headnotes, 'tokens': tokens })
        
        df_stacked = pd.DataFrame(df.tokens.tolist(), index=df.index).stack()\
            .reset_index().rename(columns={'level_1': 'sequence_id', 0: 'token'})
            
        return df_stacked
    
    def compute_co_occurrence(self, treaties, pos_tags, only_cultural_treaties=False):

        # Filter out tags based on treaties of interest
        pos_tags = pos_tags.merge(treaties, how='inner', left_on='treaty_id', right_index=True)[[]]
        
        if only_cultural_treaties:
            df_pos_tags = df_pos_tags[(df_pos_tags.is_cultural.str.contains('yes',na=False))]

        # Self join of words within same treaty
        df_co_occurrence = pd.merge(df_pos_tags, df_pos_tags, how='inner', left_on='treaty_id', right_on='treaty_id')
        # Only consider a specific poir once
        df_co_occurrence = df_co_occurrence[(df_co_occurrence.wid_x < df_co_occurrence.wid_y)]
        # Reduce number of returned columns
        df_co_occurrence = df_co_occurrence[['treaty_id', 'year_x', 'is_cultural_x', 'lemma_x', 'lemma_y' ]]
        # Rename columns
        df_co_occurrence.columns = ['treaty_id', 'year', 'is_cultural', 'lemma_x', 'lemma_y' ]

        # Sort token pair so smallest always comes first
        lemma_x = df_co_occurrence[['lemma_x', 'lemma_y']].min(axis=1)
        lemma_y = df_co_occurrence[['lemma_x', 'lemma_y']].max(axis=1)
        df_co_occurrence['lemma_x'] = lemma_x
        df_co_occurrence['lemma_y'] = lemma_y

        return df_co_occurrence
    
def create_bigram_transformer(documents):
    import gensim.models.phrases
    bigram = gensim.models.phrases.Phrases(map(nltk.tokenize.word_tokenize, documents))
    return lambda ws: bigram[ws]

treaties = state.treaties.loc[(state.treaties.is_cultural)]
headnotes = treaties['headnote']
stopwords = nltk.corpus.stopwords.words('english')
tokenizer = nltk.tokenize.word_tokenize
lemmatizer = WordNetLemmatizer().lemmatize
df = CoOccurrance(tokenizer=tokenizer, stopwords=stopwords, lemmatizer=lemmatizer, min_word_size=2).compute(headnotes)
df.head()

In [None]:
state.parties.loc['ADF', 'party_name']

In [None]:
def get_document_token_frequencies(corpus, treaties):
    from itertools import chain
    '''
    Returns a DataFrame with per document token frequencies i.e. "melts" doc-term matrix
    '''
    data = ((document_id, x[0], x[1]) for document_id, values in enumerate(corpus) for x in values )
    df = pd.DataFrame(list(data), columns=['document_id', 'token_id', 'count'])
    df = df.merge(corpus.document_names, left_on='document_id', right_index=True)[['treaty_id', 'token_id', 'count']]
    df = df.merge(treaties[['signed_year', 'signed_period', 'signed_period_alt', 'is_cultural']], 
                  left_on='treaty_id', right_index=True)
    _ = corpus.dictionary[0]
    id2token = corpus.dictionary.id2token
    df['token'] = df.token_id.apply(lambda x: id2token[x])
    print(len(df))
    df = df.loc[(df.token.str.len() > 2)]
    print(len(df))
    return df

df = get_document_token_frequencies(corpus, state.treaties)
print(df.head())

In [None]:
#pip install -U spacy
#python -m spacy download en
#python -m spacy download de
#python -m spacy download fr
#python -m spacy download it

import spacy
import nltk

from spacy.tokens import Doc

class SpacyNltkTokenizer(object):
    def __init__(self, vocab):
        self.vocab = vocab

    def __call__(self, text):
        words = nltk.tokenize.word_tokenize(text)
        # All tokens 'own' a subsequent space character in this tokenizer
        spaces = [True] * len(words)
        return Doc(self.vocab, words=words, spaces=spaces)
    
nlp = spacy.load('en')
nlp.tokenizer = SpacyNltkTokenizer(nlp.vocab)


In [None]:
#
nlp.pipeline

text = 'Donald Trump lives in New York.'
doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
    
for ent in doc.ents:
    if ent.label_ == "PERSON":
        ent.merge(ent.root.tag_, ent.text, ent.label_)
        
for ent in doc.ents:
    if len(ent.orth_.split()) > 1:
      start = text.index(ent.orth_)
      end = start+len(ent.orth_)
      print(ent.orth_ + ' start: ' + str(start) + ' ' + 'end: ' + str(end) + ' ' + 'entity: ' + ent.label_)
      doc.merge(start, end, '', '', ent.label_)
      for token in doc:    
          print(token.orth_)

In [None]:
# doc = nlp(u'') #u'Apple is looking at buying U.K. startup for $1 billion')
#special_case = [{ORTH: u'gim', LEMMA: u'give', POS: u'VERB'}, {ORTH: u'me'}]
#nlp.tokenizer.add_special_case(u'gimme', special_case)

#nlp.tokenizer = nltk.tokenize.word_tokenize
# nlp.tokenizer = my_tokenizer_factory(nlp.vocab)

treaty_headnotes = pd.DataFrame(state.treaties.head(10)['headnote'])
texts = ( (index, [ (t.text, t.lemma_, t.tag_, t.is_alpha, t.is_stop) for t in nlp(row[0]) ]) 
         for index, row in treaty_headnotes.iterrows() )

result = [x for x in texts]
# print(result)

#tagged_documents = (index, row[0], [ x for x in nlp(row[0])]) for index, row in treaty_headnotes.iterrows() )
#print(next(tagged_documents))
#tokens = ( (id,) + doc for (id, doc) in (x[0],x[1]) for x in treaty_headnotes.iteritems())
#print(next(tokens))

#    for treaty, doc in tagged_documents:
#    tokens = [ ]
#    for token in doc:
#        print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop)
         

In [None]:
from nltk.corpus import wordnet
print(next(wordnet.words()))

In [None]:
# Code
from numpy import exp
from scipy.special import factorial
import matplotlib.pyplot as plt
%matplotlib inline

poisson_pmf = lambda y, mu: mu**y / factorial(y) * exp(-mu)
y_values = range(0, 25)

fig, ax = plt.subplots(figsize=(12, 8))

for mu in [1, 5, 10]:
    distribution = []
    for y_i in y_values:
        distribution.append(poisson_pmf(y_i, mu))
    ax.plot(y_values, distribution, label=('$\mu$=' + str(mu)),
            alpha=0.5, marker='o', markersize=8)

ax.grid()
ax.set_xlabel('$y$', fontsize=14)
ax.set_ylabel('$f(y \mid \mu)$', fontsize=14)
ax.axis(xmin=0, ymin=0)
ax.legend(fontsize=14)

plt.show()

In [None]:
# Code
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from gensim import corpora, models, similarities
tweets=[
['human', 'interface', 'computer', 'human'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']] 

# create dictionary (index of each element)
dictionary = corpora.Dictionary(tweets)
_ = dictionary[0]
raw_corpus = [dictionary.doc2bow(t) for t in tweets]
tfidf = models.TfidfModel(raw_corpus, smartirs='ntc') # step 1 -- initialize a model
document = tfidf[raw_corpus[0]]

x = [ { dictionary.id2token[id]: score for (id, score) in document } for document in tfidf[raw_corpus]]
y = [ sum([x[1] for x in tfidf[d]]) for d in raw_corpus]
z = [ [x[1] for x in tfidf[d]] for d in raw_corpus]
y = [ gensim.matutils.unitvec(np.array([x[1] for x in tfidf[d]]), norm='l1') for d in raw_corpus]

w = np.array([0.40824829046386296, 0.8164965809277259, 0.40824829046386296])
gensim.matutils.unitvec(w, norm='l1')
print(y)

#https://stackoverflow.com/questions/42269313/interpreting-the-sum-of-tf-idf-scores-of-words-across-documents
#The interpretation of TF-IDF in corpus is the highest TF-IDF in corpus for a given term.
corpus_tfidf = tfidf[raw_corpus]

toplist = {}
for doc in corpus_tfidf:
    for token_id, score in doc:
        if token_id not in toplist:
            toplist[token_id] = 0

        if score > toplist[token_id]:
            toplist[token_id] = score

for i, item in enumerate(sorted(topWords.items(), key=lambda x: x[1], reverse=True), 1):
    print("%2s: %-13s %s" % (i, dictionary[item[0]], item[1]))
    if i == 6: break
            

In [None]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
 
'''
 
This script just show the basic workflow to compute TF-IDF similarity matrix with Gensim 
 
 
OUTPUT :
 
clemsos@miner $ python gensim_workflow.py 
 
 
How to use Gensim to compute TF-IDF similarity step by step
----------
Let's start with a raw corpus :<type 'list'>
 
STEP 1 : Index and vectorize
----------
We create a dictionary, an index of all unique values: <class 'gensim.corpora.dictionary.Dictionary'>
Then convert convert tokenized documents to vectors: <type 'list'>
Save the vectorized corpus as a .mm file
 
STEP 2 : Transform and compute similarity between corpuses
----------
We load our dictionary : <class 'gensim.corpora.dictionary.Dictionary'>
We load our vector corpus : <class 'gensim.corpora.mmcorpus.MmCorpus'> 
We initialize our TF-IDF transformation tool : <class 'gensim.models.tfidfmodel.TfidfModel'>
We convert our vectors corpus to TF-IDF space : <class 'gensim.interfaces.TransformedCorpus'>
 
STEP 3 : Create similarity matrix of all files
----------
We compute similarities from the TF-IDF corpus : <class 'gensim.similarities.docsim.MatrixSimilarity'>
We get a similarity matrix for all documents in the corpus <type 'numpy.ndarray'>
 
Done in 0.011s
 
'''
from gensim import corpora, models, similarities
from time import time
 
t0=time()
 
# keywords have been extracted and stopwords removed.
 
tweets=[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']] 
 
print "How to use Gensim to compute TF-IDF similarity step by step"
print '-'*10
print "Let's start with a raw corpus :%s"%type(tweets)
print
# STEP 1 : Compile corpus and dictionary
print "STEP 1 : Index and vectorize"
print '-'*10
 
# create dictionary (index of each element)
dictionary = corpora.Dictionary(tweets)
dictionary.save('/tmp/tweets.dict') # store the dictionary, for future reference
print "We create a dictionary, an index of all unique values: %s"%type(dictionary)
 
# compile corpus (vectors number of times each elements appears)
raw_corpus = [dictionary.doc2bow(t) for t in tweets]
print "Then convert convert tokenized documents to vectors: %s"% type(raw_corpus)
corpora.MmCorpus.serialize('/tmp/tweets.mm', raw_corpus) # store to disk
print "Save the vectorized corpus as a .mm file"
print
 
# STEP 2 : similarity between corpuses
print "STEP 2 : Transform and compute similarity between corpuses"
print '-'*10
dictionary = corpora.Dictionary.load('/tmp/tweets.dict')
print "We load our dictionary : %s"% type(dictionary)
 
corpus = corpora.MmCorpus('/tmp/tweets.mm')
print "We load our vector corpus : %s "% type(corpus) 
 
# Transform Text with TF-IDF
tfidf = models.TfidfModel(corpus) # step 1 -- initialize a model
print "We initialize our TF-IDF transformation tool : %s"%type(tfidf)
 
# corpus tf-idf
corpus_tfidf = tfidf[corpus]
print "We convert our vectors corpus to TF-IDF space : %s"%type(corpus_tfidf)
print
 
# STEP 3 : Create similarity matrix of all files
print "STEP 3 : Create similarity matrix of all files"
print '-'*10
index = similarities.MatrixSimilarity(tfidf[corpus])
print "We compute similarities from the TF-IDF corpus : %s"%type(index)
index.save('/tmp/deerwester.index')
index = similarities.MatrixSimilarity.load('/tmp/deerwester.index')
 
sims = index[corpus_tfidf]
print "We get a similarity matrix for all documents in the corpus %s"% type(sims)
print
print "Done in %.3fs"%(time()-t0)
 
# print sims
# print list(enumerate(sims))
# sims = sorted(enumerate(sims), key=lambda item: item[1])
# print sims # print sorted (document number, similarity score) 2-tuples