Skip to content

Latest commit

 

History

History
157 lines (116 loc) · 3.91 KB

Risky-Business.md

File metadata and controls

157 lines (116 loc) · 3.91 KB

Risky Business

Jack Carter 2/21/2022

Summary

This project uses web scraping, regex and a labeling mechanism to uncover the relative distribution of different violence risks in the emerging world. The results include 822 risk headlines from 140 news sources. Just as Tom Cruise discovered in Risky Business, those seeking to venture into the unknown could find things get out of hand quickly if they’re not careful.

 

Results

The results show a strong risk of military conflict in Europe due to the Ukraine crisis and a high risk of internal conflict, such as organized crime, social unrest and terrorism, in regions with weak states like Africa and Latin America.

 

Disclaimer

The data above show only the relative distribution of headlines for each region, not differences in the total number of headlines. This means we can only hope to better understand variations in the types of risks in such regions, not how dangerous they are overall.

 

Method

1) Assemble Links Database:

A database containing over 250 online newspapers in the emerging world is assembled.

Africa Asia Europe Latin America MENA
34 77 56 45 55

 

2) Scrape Headlines:

The links in the database is scraped several times a week from 17 January 2022 to present.

—EXAMPLE CODE SNIPET—

# reads in the link html text. 
read_link <- function(link) {
  elements <- c("h1", "h2", "h3", "h4", "h5")
  headlines <- read_html(link)
  html <- lapply(elements, html_elements, x = headlines)
  text <- unlist(lapply(html, html_text))
  on.exit(closeAllConnections())
  return(text)
}

 

3) Regex:

A series of regex strings, including 1) keywords, 2) actors and actions, 3) and hyperbolic situations, are combined to calculate the scores of four different kinds of violence.

—EXAMPLE CODE SNIPET—

# strong capture words.
strong_capture_words <-  paste0(
  "(",
  capture <-  "captur(e|es|ed|ing)\\b|",
  hijack <-  "hijac(k|ks|ked|king)\\b|",
  hold_prisoner <-  "h(old|olds|eld|olding) (hostag(e|es)|to ransom)\\b|",
  take_prisoner <-  "t(ake|akes|ook|aken|aking) (prisoner|hostage)\\b|",
  abduct <-  "abduc(t|ts|ted|ting)\\b|",
  kidnap <-  "kidna(p|ps|ped|ping)\\b",
  ")"
)

 

4) Labeling:

The headlines are assigned one of the four violence type labels, including military conflict, organized crime, social unrest and terrorism.

—EXAMPLE CODE SNIPET—

# determines the main category under which the headlines fall. 
assign_category <- function(scores) {
  data <- tibble()
  for(i in 1:nrow(scores)) {
    data[i,1] <- if(sum(scores[i,1:5])==0) "None"
    else {
      categories[max(which(scores[i,1:5]==max(scores[i,1:5])))]
    } 
  }
  return(data)                
}

 

5) Check & Re-assign:

The top 100 scoring headlines are checked each time the data is scraped, with only those deemed genuine violence risks kept and those mislabeled re-assigned.

—EXAMPLE CODE SNIPET—

# creates a list of mislabeled headlines. 
mislabeled <- tibble(
  number = c(2),
  category = c("organized_crime")
)

# adjusts mislabeled headlines. 
mislabeled_adjust <- function(scores) {
  scores <- scores
  for(i in 1:nrow(mislabeled))
    scores[pull(mislabeled[i,1]),3] <- mislabeled$category[i]
  return(scores)
}
mislabeled_adjusted <- mislabeled_adjust(scores)

 

Sources