# United Nations Voting Patterns
#### [CMSC320 Final Tutorial]
Authors: Lauren Brown, Angel Lin


**TODO**
* get rid of requests we don't use
* annotate the code to explain what we're doing and why [tutorial!]
* have to explain what broad categories we wanted to analyze and why
    * clean up our doc about what categories we want to use and add it to the github, explain why we are only using certain [un] categories for now (most bang for our buck)
    
* say that we want to do a frequency analysis of which countries abstain/are not present the most so that we can figure out if there are patterns to certain countries at all
* also say that we want to include abstaintions/absences in our error analysis but didn't get to it
* add background readings

## **Introduction**

### **United Nations Overview**

The United Nations (UN) is an international organization founded after World War II. The purpose of the UN is to maintain international peace and security, provide humanitarian assistance to those in need, protect human rights, and uphold international laws. It pursues these goals through several bodies, including:
* the General Assembly, the main policy-making body which includes one seat for each recognized member-state. Many committees and sub-committees help to form and inform policies, within their specific jurisdictions, that may be considered by the GA; 
* several councils (Security Council, Economic and Social Council, and Trusteeship Council) which are responsible for international peace and security, economic, social, and environmental challenges, and trust territories, respectively. These bodies each have their own membership processes but membership or decisions are often biased towards the preferences of the P5: China, France, Russia, France,  the United Kingdom, and the United States;
* the International Court of Justice (ICJ) which settles international legal disputes between countries and gives advice to the UN when requested; and 
* the Secretariat, which consist of staff from all over the world and carry out the day to day operations of the UN.

#### **Background Readings**
For more information regarding:
* the founding and history of the United Nations: [click here](https://www.history.com/topics/world-war-ii/united-nations).
* the structure of the United Nations, [click here](https://guides.lib.fsu.edu/c.php?g=946756&p=6852483).

### **Motivation**

This tutorial looks at how countries in the United Nations vote with respect to other countries. We ask the question, as countries rise and fall, how do UN voting patterns shift, if at all? How do voting blocks form and dissapate over time? If we can discover which countries are the center of voting blocks, we can determine which countries will be able to most able to influence the outcomes of UN resolutions. 

In this project, we use an unsupervised machine learning algorithm, k-means clustering, to determine how similiarly countries vote compared to each other and to visualize the how voting blocks have formed and changed over time. We hypothesize that as countries rise and fall from power, clusters will form and dissolve around these countries. The null hypothesis, then, is that there is no relationship between any voting blocks that may form and the rise and fall of countries.

Specifically, the countries we are interested in for this dataset are the United States, the Soviet Union, and China. These three countries have been major world players since the founding of the United Nations, and while we will be looking for the center of voting blocks regardless of who it may be, these three countries will be highlighted throughout our data exploration and analysis. 




In the General Assembly, different resolutions may be voting or non-voting resolutions. This project concentrates on voting data in the General Assembly. For any given voting resolution, a country may vote Yay, Nay, Abstain, or not be present and thus not vote. While the meaning Yay and Nay are fairly straitforward, the reasons behind abstentions or absenses are less so. Countries usually abstain for political reasons; for example, one may abstain from a resolution rebuking a nation because while they recognize a wrongdoing, that nation is an ally. They may also abstain from a resolution that may otherwise force them to choose between an ally and strong trading partner. A country may not be present for a vote for a variety of reasons, from the representative being unwell to intentionally protesting against the vote. There is no real way to generalize the reasons for these two responses, which will come into play later in our dataset.

Finally, over time countries rise and fall, and appropriately join or leave the United Nations. This will show up in our dataset later as missing data for certain resolutions for which the country did not exist


### **Required Libraries and Tools**

This project utilizes the Python3 languages and packages to collect, explore, visualize, and analyze the data. To reproduce the code, please ensure you have the following packages installed; you can install any of these by using the command <code>$ pip3 install [package] </code> in your terminal or command prompt. 

* <code>reqests</code>: allows us to retrieve website data using Python [(docs)](https://docs.python-requests.org/en/latest/)
* <code>BeautifulSoup</code>: allows us to parse and scrape HTML after retrieving website data [(docs)](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* <code>numpy</code>: supports a wide range of operations. Required for Pandas installation [(docs)](https://numpy.org/)
* <code>pandas</code>: used for data manipulation and analysis [(docs)](https://pandas.pydata.org/)
* <code>time</code>: allows us to create delays in code operations [(docs)](https://docs.python.org/3/library/time.html)
* <code>datetime</code>: used for parsing date strings into objects [(docs)](https://docs.python.org/3/library/datetime.html)
* <code>matplotlib</code>: includes an extrordinary amount of data visualization functions [(docs)](https://matplotlib.org/)


In [2]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import time
from datetime import datetime
import matplotlib.pyplot as plt

### **Data Collection** 


* collected categories of resolutions
* decided which to focus on/had relations --> development & human rights
* sorted them manually
* scraped resolution links and then voting data using requests, beautiful soup, etc
* needed to use rate limiter because of unconfirmed scraping permissions
* needed to save data collected to file incase anything was lost during long scraping session due to rate limiter (took approx. 3 hours)

In [None]:
category_links_df = pd.read_csv('category-links.csv')

visited_categories = pd.read_csv('visited_cats.csv')['visited categories'].tolist()
resolution_urls = pd.read_csv('res_urls.csv')['res_urls'].tolist()
total_resolutions = 0

category_links_df.head(5)

In [9]:
def get_res_links_from_page(page_link):
    
    links = []
    
    page = requests.get(page_link)
    if page.status_code != 200:
        raise Exception("Something went wrong loading category: " + category + ", error code: " + page.status_code)

    html = page.text
    soup = BeautifulSoup(html, 'html.parser')
    resolution_cnt = int(soup.find("strong", {"class": None}).text)

    if resolution_cnt > 50:
        if resolution_cnt%50 == 0:
            page_count = (int) (resolution_cnt/50)
        else: page_count = int((resolution_cnt/50) + 1)
    else: page_count = 1
    
    for page in range(page_count):

        for div in soup.find_all("div", {"class": "moreinfo"}):
            res_link_suffix = div.find("a")["href"]
            links.append('https://digitallibrary.un.org' + res_link_suffix)
        
        time.sleep(5)

        #load the next page if you're not already on the last page (0 indexing)
        if (page+1 != page_count):
            next_page_link_suffix = soup.find("span", {"class": "rec-navigation"}).findAll("a")[-1]["href"]
            next_page_link = 'https://digitallibrary.un.org' + next_page_link_suffix

            page = requests.get(next_page_link)
            if page.status_code != 200:
                raise Exception("Something went wrong loading next page, error code: " + page.status_code)

            html = page.text
            soup = BeautifulSoup(html, 'html.parser')
            
        
    if (resolution_cnt != len(links)):
        raise Exception("resolution count does not match")
        
    return links

In [10]:
#Make resolution_urls unique
def unique(list1):
    # initialize a null list
    unique_list = []
     
    # traverse for all elements
    for x in list1:
        # check if exists in unique_list or not
        if x not in unique_list:
            unique_list.append(x)
    return unique_list

In [11]:
#get the initial website with the categories
for cat, link in zip (category_links_df["un category"], category_links_df["link"]):
    if cat not in visited_categories:
        try:
            cat_res_links = get_res_links_from_page(link) #grab the links, may throw an exception
            resolution_urls = resolution_urls + cat_res_links #append the new links list to the bigger old links list
            visited_categories.append(cat)
        except :
            print("oh no the UN blocked you maybe :(")
        finally:
            time.sleep(5) #delay to hopefully prevent the un from detecting and blocking us

unq_res_urls = unique(resolution_urls)
res_urls_df = pd.DataFrame(unq_res_urls, columns = ['res_urls'])
res_urls_df.to_csv('res_urls.csv', index=False)
visited_cats = pd.DataFrame(visited_categories, columns = ['visited categories'])
visited_cats.to_csv('visited_cats.csv', index=False)

In [25]:
# for all voting data df, index column is the country and each following column is the resolution id. 
# cell is each country's vote (Y if yes, N if no, A if abstain, NP if not present, or NaN if the country didn't exist to vote at the time
all_voting_data = pd.read_csv('all_voting_data.csv')

## rows are resolution id's, columns are names of resolutions and years they were voted on
all_res_data = pd.read_csv('all_res_data.csv')

resolution_urls = pd.read_csv('res_urls.csv')['res_urls'].tolist()
visited_res_urls = pd.read_csv('visited_res_urls.csv')['visited_res_urls'].tolist()

### **Data Processing**  

* stored data in 2 data frames
* explain 2 df setup

In [43]:
def process_resolution(res_url):
    res_page = requests.get(res_url)
    if res_page.status_code != 200:
        raise Exception("Something went wrong loading resolution, error code: " + res_page.status_code)
    
    html = res_page.text
    soup = BeautifulSoup(html, 'html.parser')

    metadata = soup.find("div", {"id" : "details-collapse"})

    #checking to make sure the vote was recorded
    # we only care about recorded votes since they allow us to track how countries change their views over time
    row_content_meta = metadata.find_all("span", {"class" : "value col-xs-12 col-sm-9 col-md-10"})
    recorded_vote = [False if 'NON-RECORDED' in row.get_text() else True for row in row_content_meta]

    if False not in recorded_vote:

        rows = metadata.find_all("div", {"class" : "metadata-row"})

        title = ""
        res_id = ""
        date = ""
        vote_table = ""

        for row in rows:
            row_title = row.find("span", {"class" : "title col-xs-12 col-sm-3 col-md-2"}).text
            row_value = row.find("span", {"class" : "value col-xs-12 col-sm-9 col-md-10"})

            #strip newline chars from the string
            row_title = row_title.strip()

            #get the information we want from the html
            if row_title == 'Title':
                title = row_value.text
            elif row_title == 'Resolution':
                res_id = row_value.text
            elif row_title == 'Vote date':
                date = row_value.text
                dt = datetime.strptime(date, "%Y-%m-%d")
            elif row_title == 'Vote':
                vote_table = row_value

        # some resolutions don't have the full date but the year exists elsewhere on the page, so extract that in those instances
        if date == "":
            year = soup.find("div", {"class" : "one-row-metadata value"}).text.strip()
            dt = datetime.strptime(year, "%Y")
            
        #get resolution metadata minus voting data, append it as a row to the df of all resolution metadata     
        res_data = pd.DataFrame({'Resolution ID': [res_id],
                        'Resolution Name' : [title],
                        'Year' : [dt.year]})  

        #get vote information into a dataframe 
        vts = str(vote_table)

        for i in range(len(vts)):
            if vts[i] in ['Y', 'N', 'A']:
                vts = vts[:i]+'<br/> ' + vts[i:]
                break
            elif vts[i:i+2] == '> ': #if the first country in the list was absent and didn't vote, string should look like this
                vts = vts[:i]+'<br/> ' + vts[i:]
                break
        
        vts = vts.replace("<br>", "<br/>")

        #vts is a string but to parse it using beautifulsoup we want it as soup
        vote_table = BeautifulSoup(vts, 'html.parser') 
        #print(str(vote_table))
        
        res_voting_data = pd.DataFrame(columns = ['country', res_id])
        
        for br in vote_table.findAll('br'):
            next_s = br.nextSibling
            
            if next_s[0:2].strip() in ['Y', 'N', 'A']:
                vote = next_s[0:2].strip()
                country = next_s[3:].strip()
            else:
                vote = 'NP'
                country = next_s.strip()

            res_voting_data.loc[len(res_voting_data.index)] = [country, vote] 

        return (res_data, res_voting_data)
    return (0, 0)

In [None]:
#Get vote data from each resolution

#resolution df (x= resolution index, y = name, year)
#country_votes df (row = country, col= resolution index) row,col = vote
try:
    for res_url in resolution_urls:
        # print(res_url)
        try:
            if res_url not in visited_res_urls:
                res_data, res_voting_data = process_resolution(res_url)
                #print(res_data)
                #print(res_voting_data)

                #once you have the data for the resolution, add it to the larger dfs of all the data (voting and otherwise) 
                #for all resolutions
                if (res_data, res_voting_data) is not (None, None):
                    
                    #add as a row to the end of the df
                    all_res_data = pd.concat([all_res_data, res_data], ignore_index = True, axis = 0)
                    
                    #add as a column -> outer merge to make sure countries join correctly
                    all_voting_data = all_voting_data.merge(res_voting_data, how='outer', on='country') 

                visited_res_urls.append(res_url)
                time.sleep(5) 
        except:
            print("error somewhere")
            break

finally:
    # save everything
    all_res_data.to_csv('all_res_data.csv', index=False)
    all_voting_data.to_csv('all_voting_data.csv', index=False)
    
    visited_res_url_df = pd.DataFrame(visited_res_urls, columns = ['visited_res_urls'])
    visited_res_url_df.to_csv('visited_res_urls.csv', index=False)

### **Data Visualization/Representation**  


### **Exploratory Data Analysis**  


### **Hypothesis Testing**  


### **Communication of insights attained**  
