## UN Security Council Speeches `9 points`

Source: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/KGVSYH

Description from [Data Is Plural](https://www.data-is-plural.com/archive/2019-07-17-edition/):

> Two decades of UN Security Council debates. A group of researchers have collected, parsed, and added metadata to all UN Security Council debates from 1995 through 2017. The dataset includes more than 65,000 speeches (with information about each speaker), extracted from nearly 5,000 meeting transcripts.

**Topics:**

* Reading in many files
* Extracting content from strings (regex, maybe)
* K-means clustering

## Opening the dataset `2 points`

You're interested in the `.tar` file. It should extract just like a `.zip` file and create a folder with many many files in it.

In [18]:
import pandas as pd
import numpy as np
import glob

### How many speeches does this dataset have?

In [4]:
filenames = glob.glob('*/*.txt')
len(filenames)

82165

### Put the speeches into a dataframe

In [30]:
def extract_text(path):
    with open (path, "r") as myfile:
        text = myfile.read().replace('\n',' ').strip()
    return text

contents = [extract_text(filename) for filename in filenames]

In [50]:
df1 = pd.DataFrame({
    'filename': filenames,
    'speech_content': contents
})

df1.filename = df1.filename.replace('speeches/UNSC_','', regex=True)
df1.filename = df1.filename.replace('.txt','', regex=True)

df1.head()

Unnamed: 0,filename,speech_content
0,2013_SPV.6977_spch017,"Mr. Quinlan (Australia): Thank you, Mr. Presid..."
1,2019_SPV.8456_spch004,Mr. Faki Mahamat (spoke in French): I am pleas...
2,2014_SPV.7281_spch008,"Ms. Power (United States of America): I, too, ..."
3,2015_SPV.7508_spch007,The President: I thank Ms. Nakamitsu for her b...
4,2003_SPV.4865_spch018,Mr. Pujalte (Mexico) (spoke in Spanish): On be...


### How many speeches are from each year? `1 point`

You'll want to create a new column.

In [60]:
df1['year'] = df1['filename'].str[:4].astype(int)
df1.year.value_counts()

2019    6168
2018    6160
2017    5411
2016    5005
2015    4790
2014    4769
2020    4308
2011    3198
2013    3104
2009    3088
2012    3057
2010    3036
2002    3026
2003    3023
2008    2997
2004    2819
2001    2655
2000    2571
2006    2549
2007    2071
2005    1879
1999    1613
1996    1394
1995    1374
1998    1187
1997     913
Name: year, dtype: int64

## Speech topics `2 points`

### Join with `meta.tsv` to see the topic of each speech

You'll need to massage the filename a lot.

In [64]:
df2 = pd.read_csv('meta.tsv', sep='\t')
df2.basename = df2.basename.replace('UNSC_','', regex=True)
df2.head()

Unnamed: 0,basename,date,num_speeches,topic,pressrelease,outcome,year,month,day
0,1995_SPV.3486,6 January 1995,1,Bosnia and Herzegovina,,http://www.un.org/en/ga/search/view_doc.asp?sy...,1995,1,6
1,1995_SPV.3487,12 January 1995,40,Federal Republic of Yugoslavia (Serbia and Mon...,,http://www.un.org/en/ga/search/view_doc.asp?sy...,1995,1,12
2,1995_SPV.3488,12 January 1995,12,Georgia,,http://www.un.org/en/ga/search/view_doc.asp?sy...,1995,1,12
3,1995_SPV.3489,13 January 1995,16,Liberia,,http://www.un.org/en/ga/search/view_doc.asp?sy...,1995,1,13
4,1995_SPV.3490,13 January 1995,1,Western Sahara,,http://www.un.org/en/ga/search/view_doc.asp?sy...,1995,1,13


In [65]:
df1['basename'] = df1['filename'].str[:13]
df1.head()

Unnamed: 0,filename,speech_content,year,basename
0,2013_SPV.6977_spch017,"Mr. Quinlan (Australia): Thank you, Mr. Presid...",2013,2013_SPV.6977
1,2019_SPV.8456_spch004,Mr. Faki Mahamat (spoke in French): I am pleas...,2019,2019_SPV.8456
2,2014_SPV.7281_spch008,"Ms. Power (United States of America): I, too, ...",2014,2014_SPV.7281
3,2015_SPV.7508_spch007,The President: I thank Ms. Nakamitsu for her b...,2015,2015_SPV.7508
4,2003_SPV.4865_spch018,Mr. Pujalte (Mexico) (spoke in Spanish): On be...,2003,2003_SPV.4865


In [72]:
df3 = pd.merge(df1,df2, on = 'basename')
df3 = df3.drop(columns=['year_x'])
df3.head()

Unnamed: 0,filename,speech_content,basename,date,num_speeches,topic,pressrelease,outcome,year_y,month,day
0,2013_SPV.6977_spch017,"Mr. Quinlan (Australia): Thank you, Mr. Presid...",2013_SPV.6977,12 June 2013,36,International Tribunal - Yugoslavia & Rwanda,http://www.un.org/press/en/2013/sc11031.doc.htm,,2013,6,12
1,2013_SPV.6977_spch003,The President: I thank Judge Meron for his bri...,2013_SPV.6977,12 June 2013,36,International Tribunal - Yugoslavia & Rwanda,http://www.un.org/press/en/2013/sc11031.doc.htm,,2013,6,12
2,2013_SPV.6977_spch002,Judge Meron: It is an honour for me to appear ...,2013_SPV.6977,12 June 2013,36,International Tribunal - Yugoslavia & Rwanda,http://www.un.org/press/en/2013/sc11031.doc.htm,,2013,6,12
3,2013_SPV.6977_spch016,Mr. Briens (France) (spoke in French): I would...,2013_SPV.6977,12 June 2013,36,International Tribunal - Yugoslavia & Rwanda,http://www.un.org/press/en/2013/sc11031.doc.htm,,2013,6,12
4,2013_SPV.6977_spch014,Mr. Masood Khan (Pakistan): I thank Judge Mero...,2013_SPV.6977,12 June 2013,36,International Tribunal - Yugoslavia & Rwanda,http://www.un.org/press/en/2013/sc11031.doc.htm,,2013,6,12


### What are the most common speech topics?

In [75]:
df3.topic.value_counts(ascending = False).head(10)

Maintenance of international peace and security                         4907
Women and peace and security                                            3854
Middle East situation, including the Palestinian question               3778
The situation in the Middle East                                        3632
The situation in the Middle East, including the Palestinian question    3440
Children and armed conflict                                             2415
Protection of civilians in armed conflict                               2061
Afghanistan                                                             1734
Reports of the Secretary-General on the Sudan and South Sudan           1468
Iraq-Kuwait                                                             1378
Name: topic, dtype: int64

### Do you find these classifications useful? Why or why not?

Not really. They are very broad and unspecific. Also: Many topics pop up again under a slightly different name. 

# Automatic organization `4 points`

Using k-means clustering, try to organize the speeches into 5 to 10 groups. Play with hyperparameters like `max_df` and stopwords to try and improve on the existing speech `topics` column.

In [161]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import text

add_words = ['security','council','dated','peace']
my_stop_words = text.ENGLISH_STOP_WORDS.union(add_words)


vectorizer = TfidfVectorizer(ngram_range=(1, 5), stop_words = my_stop_words, max_df = 0.05)

matrix = vectorizer.fit_transform(df3.topic)
words_df = pd.DataFrame(matrix.toarray(),
                        columns=vectorizer.get_feature_names_out(),
                        index=df3.index)

words_df.head()

Unnamed: 0,11,11 september,11 september 2001,11 september 2001 acts,11 september 2001 acts international,1160,1160 1998,1160 1998 1199,1160 1998 1199 1998,1160 1998 1199 1998 1203,...,yugoslavia rwanda,yugoslavia serbia,yugoslavia serbia montenegro,yugoslavia serbia montenegro sanctions,yugoslavia termination,yugoslavia termination sanctions,zaire,zimbabwe,zone,zone treaty
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.388017,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.388017,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.388017,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.388017,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.388017,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [159]:
from sklearn.cluster import KMeans

number_of_clusters = 10
km = KMeans(n_clusters=number_of_clusters)

km.fit(matrix)

print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names_out()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :8]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

Top terms per cluster:
Cluster 0: afghanistan sudan iraq haiti timor republic situation afghanistan angola
Cluster 1: maintenance maintenance international zone treaty discussion work month august democratic poeple republic korea democratic republic democratic republic congo destruction
Cluster 2: situation middle east situation middle middle east including palestinian east including palestinian question situation middle east including middle east including palestinian question middle east including situation middle east including palestinian
Cluster 3: women zone treaty disarmament discussion work month august discussion work month april discussion work month discussion work discussion
Cluster 4: east situation middle east situation east situation including middle east situation including east situation including palestinian question east situation including palestinian middle east situation including palestinian situation including palestinian question
Cluster 5: africa west africa w