# hcds-a2-bias

## Purpose 

This notebook details the steps needed to construct and analyze a dataset which details the articles on politicians of various countries together with the modelled quality of those articles as predicted by a machine learning system called __[ORES](https://www.mediawiki.org/wiki/ORES)__. It is divided into three sections:

1. Data acquisition - collecting data from various sources 
2. Data processing - preparing the data in (1) for analysis 
3. Data analysis - visualizing the dataset created in (2) as a series of bar charts 

## 1. Data acquisition

Three types of data were required for the analysis: 

1. Population data - A list of countries together with their mid-2015 populations was obtained from the __[Population Research Bureau website](http://www.prb.org/DataFinder/Topic/Rankings.aspx?ind=14)__ in the form of a csv file (__[Population Mid-2015.csv](http://www.prb.org/RawData.axd?ind=14&fmt=14&tf=76&loc=34235%2c249%2c250%2c251%2c252%2c253%2c254%2c34227%2c255%2c257%2c258%2c259%2c260%2c261%2c262%2c263%2c264%2c265%2c266%2c267%2c268%2c269%2c270%2c271%2c272%2c274%2c275%2c276%2c277%2c278%2c279%2c280%2c281%2c282%2c283%2c284%2c285%2c286%2c287%2c288%2c289%2c290%2c291%2c292%2c294%2c295%2c296%2c297%2c298%2c299%2c300%2c301%2c302%2c304%2c305%2c306%2c307%2c308%2c311%2c312%2c315%2c316%2c317%2c318%2c319%2c320%2c321%2c322%2c324%2c325%2c326%2c327%2c328%2c34234%2c329%2c330%2c331%2c332%2c333%2c334%2c336%2c337%2c338%2c339%2c340%2c342%2c343%2c344%2c345%2c346%2c347%2c348%2c349%2c350%2c351%2c352%2c353%2c354%2c358%2c359%2c360%2c361%2c362%2c363%2c364%2c365%2c366%2c367%2c368%2c369%2c370%2c371%2c372%2c373%2c374%2c375%2c377%2c378%2c379%2c380%2c381%2c382%2c383%2c384%2c385%2c386%2c387%2c388%2c389%2c390%2c392%2c393%2c394%2c395%2c396%2c397%2c398%2c399%2c400%2c401%2c402%2c404%2c405%2c406%2c407%2c408%2c409%2c410%2c411%2c415%2c416%2c417%2c418%2c419%2c420%2c421%2c422%2c423%2c424%2c425%2c427%2c428%2c429%2c430%2c431%2c432%2c433%2c434%2c435%2c437%2c438%2c439%2c440%2c441%2c442%2c443%2c444%2c445%2c446%2c448%2c449%2c450%2c451%2c452%2c453%2c454%2c455%2c456%2c457%2c458%2c459%2c460%2c461%2c462%2c464%2c465%2c466%2c467%2c468%2c469%2c470%2c471%2c472%2c473%2c474%2c475%2c476%2c477%2c478%2c479%2c480)__.  

2. Article data - A dataset containing a list of English language Wikipedia articles on politicians mapped to the country of the politician, together with the revision id of the most recent edit to the article. This list was obtained from an existing project on __[Figshare](https://figshare.com/articles/Untitled_Item/5513449)__ in the form of a csv file (page_data.csv).  

3. Article quality data - For each article in (2), a prediction of the quality of the article was obtained by calling the __[ORES API](https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model)__ and passing it three parameters:  i) context (the name of the Wikipedia project, in this case 'enwiki' for English language Wikipedia), ii) the revision id to score as detailed in (2) and iii) the scoring model, in this case wp10.   

For example: https://ores.wikimedia.org/v3/scores/enwiki/235107991/wp10

The API returns a JSON object with a key-value pair "prediction" and one of six quality values (in order of quality from best to worst): 
 - FA     Featured article
 - GA     Good article
 - B      B-class article
 - C      C-class article 
 - Start  Start-class article 
 - Stub   Stub-class article 

Since steps (1) and (2) can be replicated simply by downloading the relevant csv files, this notebook provides only the code needed to produce (3).  

### Article quality data

#### a) Setup 
Loading packages and setting the working directory.   This assumes that the packages 'httr, 'jsonlite', 'data.table', 'plyr', 'dplyr', 'tidyr', 'stringr' and 'ggplot2' which are available from __[CRAN](https://cran.r-project.org/)__, have been installed.  

In [None]:
library (httr)
library(jsonlite)
library(data.table)
library(plyr)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)

#### b) Set the working directory 
Set the working directory to match the directory which contains the csv files:  page_data.csv and Population Mid-2015.csv as detailed in 1. Data acquisition. 

In [None]:
wd <- "/Users/MB/Desktop/DATA_512/Week 4/Assignment/"       # specify working directory here 
setwd(wd)
getwd()

#### c) Read the page_data and Population Mid-2015 csv files into R

In [29]:
page_data <-read.csv(file="page_data.csv", header=TRUE, sep=",")
population_data <-read.csv(file="Population Mid-2015.csv", header=TRUE, sep=",")
head(page_data)

page,country,rev_id
Template:ZambiaProvincialMinisters,Zambia,235107991
Bir I of Kanem,Chad,355319463
Template:Zimbabwe-politician-stub,Zimbabwe,391862046
Template:Uganda-politician-stub,Uganda,391862070
Template:Namibia-politician-stub,Namibia,391862409
Template:Nigeria-politician-stub,Nigeria,391862819


#### d) Get quality article assessment using ORES API 
Since the ORES API can only accept one revision id at a time, it is necessary to loop through each of the rev_ids in the third column of the page_data table created above.  A function, return_category, was created which takes a single rev_id as input and returns an object, edit_category, which contains the rev_id together with the quality_category.  Using lapply, this function was applied over all values in the rev_id column of the page_data table above. 

Note that due to the large number of API request it can take up to 5 hours to complete this step. 

In [None]:
ores_api <- "https://ores.wikimedia.org/v3/scores/enwiki/"
edit_id <- as.character(page_data[,3])
model <- "/wp10"

return_category <- function(last_edit) {
  ores_url <- str_replace_all(paste(ores_api, last_edit, model), fixed(" "), "")
  quality_estimate <- GET(ores_url)
  quality_estimate.json <- jsonlite::fromJSON(toJSON(content(quality_estimate)))
  quality_category <- unlist(quality_estimate.json)[[2]] 
  edit_category <- cbind(last_edit, quality_category)
  return(edit_category)
}

edit_id_quality_list <- lapply(edit_id, return_category) 

Then in order to work with this object, it is necessary to convert it to a data.frame using the ldply function in the dplyr package. 

In [None]:
edit_quality <- ldply(edit_id_quality_list, data.frame)

The net result is a data.frame consisting of two columns:  last_edit (the revision_id of the last edit), quality_category (the predicted quality category of the article). 

## 2. Data processing 

During this step, a single csv file was created with the following columns: 
 - country 
 - article name
 - revision_id 
 - article_quality 
 - population 
 
In other words, for each article name in the original page_data.csv file, the article quality (as identified in Step 1) and the population of the country was appended.  Articles from countries which did not have a corresponding population value in the Population Mid-2015.csv dataset were removed as were articles which did not have a corresponding quality value. 

#### a) Prepare data.frames for merging
To enable merging, new columns were created in the edit_quality data.frame from Step 2 d) and the page_data data.frame from Step 1. 

In [None]:
edit_quality$last_edit_num <- as.numeric(as.character((edit_quality$last_edit)))

page_data$last_edit_num <- as.numeric(as.character(page_data$rev_id))
page_data$country <- as.character(page_data$country)

#### b) Merge page_data and edit_quality on last_edit_num 
For each article in page_data, the article quality was appended (where this existed).  Any duplicates caused by the merge were removed. 

In [None]:
page_data_quality_dupes <- inner_join(page_data, edit_quality, by= c("last_edit_num" = "last_edit_num"))

page_data_quality <- distinct(page_data_quality_dupes)
rm(page_data_quality_dupes)

#### c) Remove redundant columns in dataset created in b)

In [None]:
keep_columns <- c("country", "page", "last_edit_num", "quality_category")
page_data_quality <- subset(page_data_quality, select = keep_columns)
page_data_quality$country <- as.character(page_data_quality$country)

#### d) Prepare population dataset for merging
To enable merging, a new dataset was created (country_population) which contained only two columns:  country and population.

In [None]:
keep_columns2 <- c("Location", "Data")
country_population <- subset(population_data, select=keep_columns2)

names(country_population)[1] <- "country"
country_population$country <- as.character(country_population$country)
country_population$Data <- as.character(country_population$Data)

country_population$population <- as.numeric(str_replace_all(country_population$Data, fixed(","), ""))
country_population <- subset(country_population, select=c("country", "population"))

#### e) Merge country_population and page_data_quality on country 
For each article in page_data_quality, the country population was appended (where this existed in the country_population dataset). 

In [None]:
page_quality_population <- inner_join(page_data_quality, country_population, by=c("country" = "country"))

#### f) Rename columns in page_quality_population

In [None]:
names(page_quality_population)[2] <- "article_name"
names(page_quality_population)[3] <- "revision_id"
names(page_quality_population)[4] <- "article_quality"

#### g) Export f) to csv without row names 
A csv file (page_quality_population.csv) was created with the columns as described above in 2. Data processing

In [None]:
write.csv(page_quality_population, file="page_quality_population.csv", row.names=FALSE)

## 3. Data analysis 

During this step, four visualizations (bar charts) were created which showed: 

1. The top 10 countries in terms of the number of politician articles as a proportion of country population.  
2. The bottom 10 countries in terms of the number of politician articles as a proportion of country population.  
3. The top 10 countries in terms of the proportion of articles which are high quality. 
4. The bottom 10 countries in terms of the proportion of articles which are high quality. 

For the purpose of this analysis, high quality articles were defined as those articles which were either:  FA (Featured article) or GA (Good article). 

#### a) Read page_quality_population.csv created in 2 g) into R and convert to a data.table 

In [15]:
page_quality_population <-read.csv(file="page_quality_population.csv", header=TRUE, sep=",")
head(page_quality_population) 

country,article_name,revision_id,article_quality,population
Zambia,Template:ZambiaProvincialMinisters,235107991,Stub,15473900
Chad,Bir I of Kanem,355319463,Stub,13707000
Zimbabwe,Template:Zimbabwe-politician-stub,391862046,Stub,17354000
Uganda,Template:Uganda-politician-stub,391862070,Stub,40141000
Namibia,Template:Namibia-politician-stub,391862409,Stub,2482100
Nigeria,Template:Nigeria-politician-stub,391862819,Stub,181839400


#### b) For each country in a), calculate number of articles and population

In [21]:
page_quality_population <- data.table(page_quality_population)
country_articles <- summarise(group_by(page_quality_population, country), count_article = n(), population = mean(population))
head(country_articles)

country,count_article,population
Zambia,26,15473900
Chad,100,13707000
Zimbabwe,167,17354000
Uganda,188,40141000
Namibia,165,2482100
Nigeria,684,181839400


#### c) For each country in a), calculate number of articles which are high quality (i.e. where the article quality is FA or GA.

In [23]:
country_high_quality <- (page_quality_population[article_quality %in% c("FA", "GA"), .N, by=country])
names(country_high_quality)[2] <- "count_high_quality" 

#### d) Merge country_articles and country_high_quality 
For each country in b), the count of high quality articles as identified in c) was appended.  Where the country did not have any high quality articles the NA value created by the merge was replaced with a zero. 

In [25]:
country_article_quality <- full_join(country_articles, country_high_quality, by=c("country" = "country"))
country_article_quality[("count_high_quality")][is.na(country_article_quality["count_high_quality"])] <- 0 
head(country_article_quality)

country,count_article,population,count_high_quality
Zambia,26,15473900,0
Chad,100,13707000,2
Zimbabwe,167,17354000,2
Uganda,188,40141000,1
Namibia,165,2482100,1
Nigeria,684,181839400,4


#### e) Calculate articles per 1000 population, proportion of articles which are high quality 

In [26]:
country_article_quality$articles_per_population <- ((country_article_quality$count_article)/(country_article_quality$population)*1000)
country_article_quality$proportion_high_quality <- (country_article_quality$count_high_quality)/(country_article_quality$count_article)

#### f) Identify top/bottom 10 countries in terms of the number of FA/GA articles as a proportion of all articles 

In [27]:
top10_quality <- subset(arrange(top_n(country_article_quality, 10, proportion_high_quality), desc(proportion_high_quality)), select=c("country", "proportion_high_quality"))
bottom10_quality <- subset(arrange(top_n(country_article_quality, 10, -proportion_high_quality), proportion_high_quality), select=c("country", "proportion_high_quality"))

head(top10_quality)
head(bottom10_quality)

country,proportion_high_quality
"Korea, North",0.23076923
Romania,0.12931034
Saudi Arabia,0.12605042
Central African Republic,0.11764706
Qatar,0.09803922
Guinea-Bissau,0.0952381


country,proportion_high_quality
Zambia,0
Solomon Islands,0
Nepal,0
Tunisia,0
Switzerland,0
Belgium,0


#### g) Identify top/bottom 10 countries in terms of the number of articles as a proportion of country population

In [28]:
top10_articles <- subset(arrange(top_n(country_article_quality, 10, articles_per_population), desc(articles_per_population)), select=c("country", "articles_per_population"))
bottom10_articles <- subset(arrange(top_n(country_article_quality, 10, -articles_per_population), articles_per_population), select=c("country", "articles_per_population"))

head(top10_articles)
head(bottom10_articles)

country,articles_per_population
Nauru,4.8802947
Tuvalu,4.6610169
San Marino,2.4848485
Monaco,1.0501995
Liechtenstein,0.7718925
Marshall Islands,0.6727273


country,articles_per_population
India,0.0007541297
China,0.0008294944
Indonesia,0.0008406911
Uzbekistan,0.0009267902
Ethiopia,0.0010698129
"Korea, North",0.0015610615


#### h) Create bar charts representing top/bottom 10 countries as identified in f) and g)

In [None]:
top10_proportion_plot <- ggplot(top10_articles, aes(x=reorder(country, articles_per_population),  y=articles_per_population)) + geom_bar(stat='identity', fill="purple") + coord_flip() + scale_y_continuous(expand=c(0,0), limits=c(0, 5)) + labs(title = "Top 10 countries based on articles as a proportion of population", x="Country", y="Politician articles per 1000 population") + theme(plot.title = element_text(size=12, hjust=1))

bottom10_proportion_plot <- ggplot(bottom10_articles, aes(x=reorder(country, articles_per_population),  y=articles_per_population)) + geom_bar(stat='identity', fill="purple") + coord_flip() + scale_y_continuous(expand=c(0,0), limits=c(0, 0.005)) + labs(title = "Bottom 10 countries based on articles as a proportion of population", x="Country", y="Politician articles per 1000 population") + theme(plot.title = element_text(size=12, hjust=1))

top10_quality_plot <- ggplot(top10_quality, aes(x=reorder(country, proportion_high_quality),  y=proportion_high_quality)) + geom_bar(stat='identity', fill="purple") + coord_flip() + scale_y_continuous(labels = scales::percent, expand=c(0,0), limits=c(0, 0.3)) + labs(title = "Top 10 countries with highest proportion of high quality articles", x="Country", y="% politician articles which are high quality") + theme(plot.title = element_text(size=12, hjust=1))

bottom10_quality_plot <-ggplot(bottom10_quality, aes(x=reorder(country, proportion_high_quality),  y=proportion_high_quality)) + geom_bar(stat='identity', fill="purple") + coord_flip() + scale_y_continuous(labels=scales::percent, expand=c(0,0), limits=c(0, 0.3)) + labs(title = "Bottom 10 countries with highest proportion of high quality articles", x="Country", y="% politician articles which are high quality") + theme(plot.title = element_text(size=12, hjust=1))

#### i) Export charts to png 

In [None]:
png(filename="top10_proportion_plot.png")
plot(top10_proportion_plot)
dev.off() 

png(filename="bottom10_proportion_plot.png")
plot(bottom10_proportion_plot)
dev.off() 

png(filename="top10_quality_plot.png")
plot(top10_quality_plot)
dev.off() 

png(filename="bottom10_quality_plot.png")
plot(bottom10_quality_plot)
dev.off() 

## END 