## UN Topic Modeling

1. Run all the cells below. 
 * Compare results.
 * Do the topics make sense?
2. Have different members of your group try different vectorizer parameters. What differences do you observe?
2. Using the same vectorizer parameters, have different members of your group try different number of topics. What differences do you observe? What is the most sensical number of topics?
3. Use a model developed on one sample from the UN to predict document topics for a different sample. Any differences?
5. Bonus: Build a pipeline that includes a count vectorizer, topic models, and logistic regression to find the optimal number of topics for predicting speech era.

In [1]:
%matplotlib inline

import pandas as pd
%pip install pdtext --upgrade

Collecting pdtext
  Downloading https://files.pythonhosted.org/packages/7a/dc/91f5a9ac7fac68257f07322202fcd856d628eca06555efdfef6e1e7662e0/pdtext-0.2.0-py2.py3-none-any.whl
Installing collected packages: pdtext
  Found existing installation: pdtext 0.1.2
    Uninstalling pdtext-0.1.2:
      Successfully uninstalled pdtext-0.1.2
Successfully installed pdtext-0.2.0
Note: you may need to restart the kernel to use updated packages.


In [2]:
un_df = pd.read_json('files/un-general-debates.json')
print(len(un_df))

7507


In [3]:
un_df.head()

Unnamed: 0,country_code,speech_text,speech_year
0,MDV,﻿It is indeed a pleasure for me and the member...,1989
1,FIN,"﻿\nMay I begin by congratulating you. Sir, on ...",1989
2,NER,"﻿\nMr. President, it is a particular pleasure ...",1989
3,URY,﻿\nDuring the debate at the fortieth session o...,1989
4,ZWE,﻿I should like at the outset to express my del...,1989


The three cells below, which produce the `un_df_sample` dataframe split the speeches into paragraphs and then create a new dataset where each paragraph is a case. The short texts allow for faster, more meaninful analysis. However, 1,386,887 texts is a lot, so only work with a sample of them. I suggest 10,000 cases.

In [4]:
# split text into paragraphs
un_df['speech_split'] = un_df['speech_text'].str.split('\n')

from pdtext.tools import tokens_to_rows
# create a new dataframe where each paragraph is a row
long_df = tokens_to_rows(un_df['speech_split'])

# merge with original dataset to get year and country
long_df = long_df.set_index('index')
long_df = long_df.merge(un_df[['speech_year', 'country_code']], left_index=True, right_index=True).reset_index()

# eliminate cases where text is blank
filter = long_df["token"] != ""
paragraph_df = long_df[filter].copy().reset_index()


In [5]:
len(paragraph_df)

1386887

In [6]:
un_df_sample = paragraph_df.sample(10000)

In [7]:
un_df_sample.head(10)

Unnamed: 0,level_0,index,order,token,speech_year,country_code
481440,499061,2769,124,"and defend human rights, our Government",2009,TGO
530161,548329,3093,42,184.\tPointing to the example of the European ...,1976,IRL
1069186,1103525,5965,18,"52.\tIn accelerating the arms race, especially...",1981,UKR
721304,744863,4116,13,"to preside at this session, I should like also...",1982,LSO
728499,752253,4138,93,humanity is still very far from having inscrib...,1998,CRI
184590,193971,1392,100,objectives contained in these documents. Women...,1996,ZMB
314358,327151,1834,45,continue to be one of our best hopes for resol...,1997,THA
353721,369624,2140,2,The present session of the United Nations Gene...,1988,MNG
611757,633715,3656,153,geography of terrorism envelopes all humanity. We,2005,IRQ
1256686,1295985,6977,121,of the source countries and the destination co...,2006,WSM


In [8]:
un_df_sample['post_soviet'] = un_df_sample['speech_year'] > 1991

In [9]:
# Import the two libraries

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [16]:
# set up a vectorizer with 1000 features

vectorizer = CountVectorizer(lowercase    = True,
                             ngram_range  = (1,2),
                             min_df       = .01,
                             stop_words   = 'english',
                             max_df       = .90,
                             max_features = 500)

vectorizer.fit(un_df_sample['token'])

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=0.9, max_features=500, min_df=0.01,
                ngram_range=(1, 2), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [17]:
# produce a tf matrix from the vectorizer

un_word_counts = vectorizer.transform(un_df_sample['token'])

In [18]:
# run th LDA model
lda_model = LatentDirichletAllocation(n_components = 10,
                                      max_iter     = 100,
                                      n_jobs       = -1,
                                      verbose      = 1)

lda_model.fit(un_word_counts) 

iteration: 1 of max_iter: 100
iteration: 2 of max_iter: 100
iteration: 3 of max_iter: 100
iteration: 4 of max_iter: 100
iteration: 5 of max_iter: 100
iteration: 6 of max_iter: 100
iteration: 7 of max_iter: 100
iteration: 8 of max_iter: 100
iteration: 9 of max_iter: 100
iteration: 10 of max_iter: 100
iteration: 11 of max_iter: 100
iteration: 12 of max_iter: 100
iteration: 13 of max_iter: 100
iteration: 14 of max_iter: 100
iteration: 15 of max_iter: 100
iteration: 16 of max_iter: 100
iteration: 17 of max_iter: 100
iteration: 18 of max_iter: 100
iteration: 19 of max_iter: 100
iteration: 20 of max_iter: 100
iteration: 21 of max_iter: 100
iteration: 22 of max_iter: 100
iteration: 23 of max_iter: 100
iteration: 24 of max_iter: 100
iteration: 25 of max_iter: 100
iteration: 26 of max_iter: 100
iteration: 27 of max_iter: 100
iteration: 28 of max_iter: 100
iteration: 29 of max_iter: 100
iteration: 30 of max_iter: 100
iteration: 31 of max_iter: 100
iteration: 32 of max_iter: 100
iteration: 33 of 

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=100,
                          mean_change_tol=0.001, n_components=10, n_jobs=-1,
                          perp_tol=0.1, random_state=None,
                          topic_word_prior=None, total_samples=1000000.0,
                          verbose=1)

In [21]:
lda_model.components_

array([[1.00022588e-01, 1.00004299e-01, 1.00004164e-01, 1.00006508e-01,
        3.86390425e+01, 1.00008305e-01, 1.00013143e-01, 1.35976825e+01,
        1.00009560e-01, 1.00003300e-01, 1.00009067e-01, 1.00012060e-01,
        1.00019214e-01, 1.00004306e-01, 1.00008504e-01, 1.00009718e-01,
        1.00006478e-01, 1.00014271e-01, 1.00007824e-01, 1.00014046e-01,
        1.00010015e-01, 1.00004568e-01, 1.00008881e-01, 1.00010103e-01,
        1.00006855e-01, 1.00013382e-01, 1.00007232e-01, 1.00010593e-01,
        1.00017348e-01, 1.00005763e-01, 1.00010438e-01, 1.00009738e-01,
        1.00021162e-01, 3.16019725e+01, 1.00015273e-01, 8.44099920e+02,
        1.00015853e-01, 1.00017667e-01, 1.00014463e-01, 1.00003375e-01,
        1.00009982e-01, 1.00015047e-01, 1.00010648e-01, 3.97775205e+02,
        1.00019006e-01, 1.00006577e-01, 1.00015995e-01, 1.00008315e-01,
        1.00012352e-01, 1.00021492e-01, 1.00016300e-01, 1.00015672e-01,
        1.00017088e-01, 1.00004653e-01, 1.00008541e-01, 1.000159

In [22]:
from pdtext.tm import topic_words, topic_pred

topic_words(lda_model, vectorizer, ntokens = 20)



Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
Topic 1,nations,united,united nations,people,states,charter,member,years,continue,action,problems,make,role,year,delegation,peoples,new,international,work,recent
Topic 2,efforts,government,support,nuclear,president,weapons,delegation,just,recent,republic,non,people,states,continue,make,progress,policy,right,country,conference
Topic 3,world,states,state,year,national,today,process,way,respect,war,make,recent,action,people,peoples,years,political,problems,non,progress
Topic 4,international,order,peoples,conference,situation,region,east,cooperation,end,member,states,new,non,make,just,people,recent,action,war,year
Topic 5,development,africa,south,african,work,years,policy,continue,action,problems,trade,need,recent,peoples,make,states,national,countries,year,people
Topic 6,assembly,general,human,session,rights,general assembly,right,non,peoples,people,political,work,respect,year,action,delegation,president,progress,years,problems
Topic 7,security,council,security council,secretary,global,general,secretary general,action,war,continue,make,problems,year,east,role,process,need,states,efforts,delegation
Topic 8,countries,economic,community,international,political,social,developing,international community,future,need,resources,progress,problems,trade,development,global,make,recent,peoples,non
Topic 9,time,like,great,relations,republic,principles,role,charter,states,delegation,people,international,respect,new,organization,continue,years,peoples,year,policy
Topic 10,peace,country,organization,new,important,hope,security,problems,trade,international,just,progress,people,work,peoples,east,war,years,make,delegation


In [23]:
un_topics = topic_pred(lda_model, 
                       un_word_counts, 
                       vectorizer)

In [24]:
un_topics.groupby(un_df_sample['post_soviet']).mean().T

post_soviet,False,True
nations_united_united nations,0.122089,0.0625
efforts_government_support,0.094391,0.0625
world_states_state,0.119481,0.0625
international_order_peoples,0.103398,0.0625
development_africa_south,0.086565,0.1875
assembly_general_human,0.083406,0.0625
security_council_security council,0.091137,0.0625
countries_economic_community,0.132011,0.062503
time_like_great,0.07689,0.0625
peace_country_organization,0.09063,0.312497


In [25]:
def make_eras(year):
    if year < 1980:
        return 1975
    elif year < 1990:
        return 1985
    elif year < 2000:
        return 1995
    elif year < 2010:
        return 2005
    return 2015
    
un_df_sample['era'] = un_df_sample['speech_year'].apply(make_eras)

In [26]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV


In [35]:
pipeline = Pipeline([
                     ('vectorizer' , CountVectorizer(max_df      = .90,
                                                     stop_words   = 'english',
                                                     )),
                     ('lda' , LatentDirichletAllocation()),
                     ('classifier' , LogisticRegression())
                      ])

parameters = {'lda__n_components' : [10, 20, 40],
              'vectorizer__max_df' : [.5, .8, .9 ], 
              'vectorizer__min_df' : [.01, .02, .05], 
             }

In [36]:
grid_search = GridSearchCV(pipeline,
                           parameters,
                           n_jobs = -1,
                           cv = 5,
                           verbose = 2)

In [37]:
grid_search.fit(un_df_sample['token'],
                un_df_sample['era'])

Fitting 5 folds for each of 27 candidates, totalling 135 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   31.2s
[Parallel(n_jobs=-1)]: Done 135 out of 135 | elapsed:  2.4min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('vectorizer',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=0.9,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                      

In [38]:
grid_search.best_estimator_

Pipeline(memory=None,
         steps=[('vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=0.5,
                                 max_features=None, min_df=0.02,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words='english', strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None,...
                                           perp_tol=0.1, random_state=None,
                                           topic_word_prior=None,
                                           total_samples=1000000.0,
                                           verbose=0)),
                ('classifier',
                 LogisticRegression(C=1.

In [39]:
grid_search.best_params_

{'lda__n_components': 40,
 'vectorizer__max_df': 0.5,
 'vectorizer__min_df': 0.02}