# Data validation

## Setup

In [1]:
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

[32m✓[39m [34mggplot2[39m 3.3.2     [32m✓[39m [34mpurrr  [39m 0.3.4
[32m✓[39m [34mtibble [39m 3.0.3     [32m✓[39m [34mdplyr  [39m 1.0.0
[32m✓[39m [34mtidyr  [39m 1.1.0     [32m✓[39m [34mstringr[39m 1.4.0
[32m✓[39m [34mreadr  [39m 1.3.1     [32m✓[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



In [2]:
na_summary <- arrow::read_feather("data/na_summary_preprocessed.feather")

head(na_summary)

rowtype,boardname,ned,directorname,rolename,rolestatus,gender,nationality,boardid,clientcompanyid,⋯,nationalitymix,numberdirectors,stdevtimebrd,stdevtimeinco,stdevtotnolstdbrd,stdevtotcurrnolstdbrd,stdevnoquals,stdevage,networksize,companyid
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<???>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Board Member,EQUITY ONE INC (De-listed 03/2017),Yes,David Fischel,Independent Director,David Fischel joined this role on 04 Jan 2011,M,British,10925,,⋯,0.4,9,7.1,7.1,3.1,1.3,1.5,8.5,6092,10925
Board Member,EQUITY ONE INC (De-listed 03/2017),Yes,David Fischel,Independent Director,David Fischel joined this role on 04 Jan 2011,M,British,10925,,⋯,0.4,9,5.8,5.8,3.2,1.2,1.2,7.7,6092,10925
Board Member,EQUITY ONE INC (De-listed 03/2017),Yes,David Fischel,Independent Director,David Fischel joined this role on 04 Jan 2011,M,British,10925,,⋯,0.4,10,5.3,5.3,3.0,1.8,1.1,6.9,6092,10925
Board Member,EQUITY ONE INC (De-listed 03/2017),Yes,David Fischel,Independent Director,David Fischel joined this role on 04 Jan 2011,M,British,10925,,⋯,0.4,9,7.1,7.1,3.1,1.4,1.5,8.5,6092,10925
Board Member,EQUITY ONE INC (De-listed 03/2017),Yes,David Fischel,Independent Director,David Fischel joined this role on 04 Jan 2011,M,British,10925,,⋯,0.5,10,5.6,5.6,3.1,1.5,1.1,7.3,6092,10925
Board Member,NATIONAL MEDICAL HEALTH CARD SYSTEMS INC (De-listed 04/2008),Yes,David Shaw,Director - SD,David Shaw joined this role on 08 Dec 2004,M,American,21616,,⋯,0.0,10,2.2,2.1,1.4,0.7,0.8,9.4,4321,21616


## Check for duplicates

### Annual report inconsistencies

In [25]:
na_summary %>%
    select(isin, annualreportdate, companyid) %>%
    group_by(isin, annualreportdate) %>%
    summarize(n = n_distinct(companyid)) %>%
    filter(n > 1)

`summarise()` regrouping output by 'isin' (override with `.groups` argument)



isin,annualreportdate,n
<chr>,<date>,<int>
US30224P2002,2013-12-01,2
US30224P2002,2014-12-01,2
US30224P2002,2015-12-01,2
US30224P2002,2016-12-01,2
US30224P2002,2017-12-01,2
US30224P2002,2018-12-01,2


In [27]:
na_summary %>%
    filter(isin == "US30224P2002" & annualreportdate == "2013-12-01") %>%
    select(isin, annualreportdate, companyid, everything()) %>%
    head()

isin,annualreportdate,companyid,rowtype,boardname,ned,directorname,rolename,rolestatus,gender,⋯,genderratio,nationalitymix,numberdirectors,stdevtimebrd,stdevtimeinco,stdevtotnolstdbrd,stdevtotcurrnolstdbrd,stdevnoquals,stdevage,networksize
<chr>,<date>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
US30224P2002,2013-12-01,2009179,Board Member,ESH Hospitality Inc (ESH Hospitality LLC prior to 11/2013),Yes,Richard Wallman,Independent Director,Richard Wallman joined this role in Nov 2013,M,⋯,1,0,5,0,0,4.3,1.5,0.4,11.8,4173
US30224P2002,2013-12-01,2009180,Board Member,EXTENDED STAY AMERICA INC,Yes,Richard Wallman,Independent Director,Richard Wallman joined this role on 13 Nov 2013,M,⋯,1,0,7,0,0,3.1,1.2,0.6,9.0,4173
US30224P2002,2013-12-01,2009180,Board Member,EXTENDED STAY AMERICA INC,Yes,Doug Geoga,Independent Chairman,Doug Geoga joined this role on 13 Nov 2013,M,⋯,1,0,7,0,0,3.1,1.2,0.6,9.0,852
US30224P2002,2013-12-01,2009179,Board Member,ESH Hospitality Inc (ESH Hospitality LLC prior to 11/2013),Yes,Doug Geoga,Independent Chairman,Doug Geoga joined this role on 12 Nov 2013,M,⋯,1,0,5,0,0,4.3,1.5,0.4,11.8,852
US30224P2002,2013-12-01,2009180,Board Member,EXTENDED STAY AMERICA INC,No,Jim Donald,CEO,Jim Donald joined this role on 13 Nov 2013,M,⋯,1,0,7,0,0,3.1,1.2,0.6,9.0,768
US30224P2002,2013-12-01,2009180,Board Member,EXTENDED STAY AMERICA INC,Yes,Anuj Agarwal,Independent Director,Anuj Agarwal joined this role on 13 Nov 2013,M,⋯,1,0,7,0,0,3.1,1.2,0.6,9.0,2435


Looks like there is something funky going on with this one observation. Let's drop it.

### Drop offending observation

In [28]:
na_summary <- filter(na_summary,isin != "US30224P2002")

### Check other variables

In [32]:
na_summary %>%
    select(isin, annualreportdate) %>%
    n_distinct()

na_summary %>%
    select(isin, annualreportdate, companyid, numberdirectors, nationalitymix, genderratio) %>%
    group_by(isin, annualreportdate) %>%
    n_distinct()

Looks good

### Multiple annual reports in a year

If there are multple annual reports in any year, the number of entries should be reduced when we extract only the year from the annual report date column.

In [33]:
na_summary %>%
    select(companyid, annualreportdate) %>%
    n_distinct()

In [34]:
na_summary %>%
    mutate(year = lubridate::year(annualreportdate)) %>%
    select(companyid, year) %>%
    n_distinct()

Let's drop these, too.

In [46]:
na_summary %>%
    mutate(year = lubridate::year(annualreportdate)) %>%
    group_by(companyid, year) %>%
    mutate(n = n_distinct(annualreportdate)) %>%
    filter(n == 1) %>%
    ungroup() %>%
    select(companyid, annualreportdate) %>%
    n_distinct()

In [38]:
na_summary %>%
    mutate(year = lubridate::year(annualreportdate)) %>%
    group_by(companyid, year) %>%
    summarize(n = n_distinct(annualreportdate)) %>%
    filter(n > 1)

`summarise()` regrouping output by 'companyid' (override with `.groups` argument)



companyid,year,n
<dbl>,<dbl>,<int>
401,2010,2
569,2012,2
692,2006,2
725,2008,2
929,2014,2
1477,2015,2
1532,2010,2
1558,2016,2
1874,2005,2
2148,2005,2
