# Case Study: Exploratory Data Analysis in R

Once you've started learning tools for data manipulation and visualization like dplyr and ggplot2, this course gives you a chance to use them in action on a real dataset. You'll explore the historical voting of the United Nations General Assembly, including analyzing differences in voting between countries, across time, and among international issues. In the process you'll gain more practice with the dplyr and ggplot2 packages, learn about the broom package for tidying model output, and experience the kind of start-to-finish exploratory analysis common in data science.

## Data cleaning and summarizing with dplyr

The best way to learn data wrangling skills is to apply them to a specific case study. Here you'll learn how to clean and filter the United Nations voting dataset using the dplyr package, and how to summarize it into smaller, interpretable units.

### Filtering rows
The vote column in the dataset has a number that represents that country's vote:

1 = Yes

2 = Abstain

3 = No

8 = Not present

9 = Not a member

One step of data cleaning is removing observations (rows) that you're not interested in. In this case, you want to remove "Not present" and "Not a member".

In [5]:
# read data
votes = readRDS("votes.rds")

# inspect data
str(votes)
head(votes)

# Load the dplyr package
library(dplyr)

# Filter for votes that are "yes", "abstain", or "no"
votes %>%
    filter(vote <= 3)

Classes 'tbl_df', 'tbl' and 'data.frame':	508929 obs. of  4 variables:
 $ rcid   : num  46 46 46 46 46 46 46 46 46 46 ...
  ..- attr(*, "comment")= chr ""
 $ session: num  2 2 2 2 2 2 2 2 2 2 ...
  ..- attr(*, "comment")= chr ""
 $ vote   : num  1 1 9 1 1 1 9 9 9 9 ...
  ..- attr(*, "comment")= chr ""
 $ ccode  : int  2 20 31 40 41 42 51 52 53 54 ...
  ..- attr(*, "comment")= chr ""
 - attr(*, "var.type")= int  2 2 2 2
 - attr(*, "Rsafe2raw")= list()
 - attr(*, "var.labels")= chr  "" "" "" ""
 - attr(*, "val.table")= list()
 - attr(*, "missval.table")= list()
 - attr(*, "val.list")= logi NA
 - attr(*, "missval.list")= logi NA
 - attr(*, "orig.names")= chr  "" "" "" ""


rcid,session,vote,ccode
46,2,1,2
46,2,1,20
46,2,9,31
46,2,1,40
46,2,1,41
46,2,1,42


"package 'dplyr' was built under R version 3.6.3"
Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



rcid,session,vote,ccode
46,2,1,2
46,2,1,20
46,2,1,40
46,2,1,41
46,2,1,42
46,2,1,70
46,2,1,90
46,2,1,91
46,2,1,92
46,2,1,93


### Adding a year column
The next step of data cleaning is manipulating your variables (columns) to make them more informative.

In this case, you have a session column that is hard to interpret intuitively. But since the UN started voting in 1946, and holds one session per year, you can get the year of a UN resolution by adding 1945 to the session number.

In [6]:
# Add another %>% step to add a year column
votes %>%
  filter(vote <= 3) %>%
  mutate(year = session + 1945)

rcid,session,vote,ccode,year
46,2,1,2,1947
46,2,1,20,1947
46,2,1,40,1947
46,2,1,41,1947
46,2,1,42,1947
46,2,1,70,1947
46,2,1,90,1947
46,2,1,91,1947
46,2,1,92,1947
46,2,1,93,1947


### Adding a country column
The country codes in the ccode column are what's called Correlates of War codes. This isn't ideal for an analysis, since you'd like to work with recognizable country names.

You can use the countrycode package to translate. For example:

library(countrycode)

#Translate the country code 2

countrycode(2, "cown", "country.name")

[1] "United States"

#Translate multiple country codes

countrycode(c(2, 20, 40), "cown", "country.name")

[1] "United States" "Canada"        "Cuba"

In [8]:
# Load the countrycode package
# install.packages("countrycode")
library(countrycode)

# Convert country code 100
countrycode(100, "cown", "country.name")

# Add a country column within the mutate: votes_processed
votes_processed <- votes %>%
  filter(vote <= 3) %>%
  mutate(year = session + 1945) %>%
  mutate(country = countrycode(ccode, "cown", "country.name"))


  There is a binary version available but the source version is later:
            binary source needs_compilation
countrycode  1.1.3  1.2.0             FALSE



installing the source package 'countrycode'



"Problem with `mutate()` input `country`.
i Some values were not matched unambiguously: 260

"Some values were not matched unambiguously: 260
"

### Summarizing the full dataset
In this analysis, you're going to focus on "% of votes that are yes" as a metric for the "agreeableness" of countries.

You'll start by finding this summary for the entire dataset: the fraction of all votes in their history that were "yes". Note that within your call to summarize(), you can use n() to find the total number of votes and mean(vote == 1) to find the fraction of "yes" votes.

In [10]:
# Print votes_processed
head(votes_processed)

# Find total and fraction of "yes" votes
votes_processed %>%
summarize(total = n(), percent_yes = mean(vote == 1))

rcid,session,vote,ccode,year,country
46,2,1,2,1947,United States
46,2,1,20,1947,Canada
46,2,1,40,1947,Cuba
46,2,1,41,1947,Haiti
46,2,1,42,1947,Dominican Republic
46,2,1,70,1947,Mexico


total,percent_yes
353547,0.7999248


### Summarizing by year
The summarize() function is especially useful because it can be used within groups.

For example, you might like to know how much the average "agreeableness" of countries changed from year to year. To examine this, you can use group_by() to perform your summary not for the entire dataset, but within each year.

In [12]:
# Change this code to summarize by year
votes_processed %>%
  group_by(year) %>%
  summarize(total = n(),
            percent_yes = mean(vote == 1)) 

`summarise()` ungrouping output (override with `.groups` argument)


year,total,percent_yes
1947,2039,0.5693968
1949,3469,0.4375901
1951,1434,0.5850767
1953,1537,0.6317502
1955,2169,0.6947902
1957,2708,0.6085672
1959,4326,0.5880721
1961,7482,0.5729751
1963,3308,0.7294438
1965,4382,0.7078959


### Summarizing by country
In the last exercise, you performed a summary of the votes within each year. You could instead summarize() within each country, which would let you compare voting patterns between countries.

In [14]:
# Summarize by country: by_country
by_country = votes_processed %>%
  group_by(country) %>%
  summarize(total = n(),
            percent_yes = mean(vote == 1))

`summarise()` ungrouping output (override with `.groups` argument)


### Sorting by percentage of "yes" votes
Now that you've summarized the dataset by country, you can start examining it and answering interesting questions.

For example, you might be especially interested in the countries that voted "yes" least often, or the ones that voted "yes" most often.

In [15]:
# Sort in ascending order of percent_yes
by_country %>%
  arrange(percent_yes)

# Now sort in descending order

by_country %>%
  arrange(desc(percent_yes))

country,total,percent_yes
Zanzibar,2,0.0000000
United States,2568,0.2694704
Palau,369,0.3387534
Israel,2380,0.3407563
,1075,0.3972093
United Kingdom,2558,0.4167318
France,2527,0.4265928
Micronesia (Federated States of),724,0.4419890
Marshall Islands,757,0.4914135
Belgium,2568,0.4922118


country,total,percent_yes
São Tomé & Príncipe,1091,0.9761687
Seychelles,881,0.9750284
Djibouti,1598,0.9612015
Guinea-Bissau,1538,0.9603381
Timor-Leste,326,0.9570552
Mauritius,1831,0.9497542
Zimbabwe,1361,0.9493020
Comoros,1133,0.9470432
United Arab Emirates,1934,0.9467425
Mozambique,1701,0.9465021


### Filtering summarized output
In the last exercise, you may have noticed that the country that voted least frequently, Zanzibar, had only 2 votes in the entire dataset. You certainly can't make any substantial conclusions based on that data!

Typically in a progressive analysis, when you find that a few of your observations have very little data while others have plenty, you set some threshold to filter them out.

In [17]:
# Filter out countries with fewer than 100 votes
by_country %>%
  arrange(percent_yes) %>%
  filter(total > 100)

country,total,percent_yes
United States,2568,0.2694704
Palau,369,0.3387534
Israel,2380,0.3407563
,1075,0.3972093
United Kingdom,2558,0.4167318
France,2527,0.4265928
Micronesia (Federated States of),724,0.4419890
Marshall Islands,757,0.4914135
Belgium,2568,0.4922118
Canada,2576,0.5081522
