# Examining socio-demographic factors that influence in European voters

## Leave or remain in the European Union? 

<img src='https://www.irishtimes.com/polopoly_fs/1.3657023.1539082433!/image/image.jpg' style='height:400px'/>

Photo: The Irish Times — [BREXIT: THE FACTS](https://www.irishtimes.com/news/world/brexit/brexit-the-facts)

<div class="list-group" id="list-tab" role="tablist">
  <h1 class="list-group-item list-group-item-action active" data-toggle="list"  role="tab" aria-controls="home">&nbsp;Table of Contents:</h1>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#1" role="tab" aria-controls="profile">1. Introduction<span class="badge badge-primary badge-pill">1</span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#2" role="tab" aria-controls="messages">2. Data Cleaning & Wrangling<span class="badge badge-primary badge-pill">2</span></a>
   <a class="list-group-item list-group-item-action" data-toggle="list" href="#3" role="tab" aria-controls="messages">3. Survey Weights<span class="badge badge-primary badge-pill">3</span></a>
   <a class="list-group-item list-group-item-action" data-toggle="list" href="#4" role="tab" aria-controls="messages">4. Data Exploration<span class="badge badge-primary badge-pill">4</span></a>    <a class="list-group-item list-group-item-action" data-toggle="list" href="#5" role="tab" aria-controls="messages">5. Previous Iterations<span class="badge badge-primary badge-pill">5</span></a>
</div>

<a id="1"></a> <br>
<font size="+3" color="black"><b>1 - Introduction</b></font><br><a id="1"></a>
<br> 

This report will investigated a dataset provided by [European Social Survey](https://www.europeansocialsurvey.org/) (ESS) which is a biennial cross-national survey of attitudes and behaviour from European citizens.

The study will focus on which aspects can influence for a person to vote for your country leave or remain a member of the European Union. After analyzing each variable individually I will build a model that after been trained can predict the probability on a person to vote for your country to Leave the European Union.



## Data Dictionary

- __CNTRY__ Country
- __EDUYRS__ Years of full-time education completed
- __EISCED__ Highest level of education, ES - ISCED
- __NETUSOFT__ Internet use, how often
- __UEMP3M__ Ever unemployed and seeking work for a period more than three months
- __MBTRU__ Member of trade union or similar organisation
- __VTEURMMB__ Would vote for your country to remain member of European Union or leave
- __GNDR__ Gender
- __YRBRN__  Year of birth
- __AGEA__ Age of respondent. Calculation based on year of birth and year of interview

__Questions__

**EDUYR** About how many years of education have you completed, whether full-time or part-time? Please report these in full-time equivalents and include compulsory years of schooling.
<br>**EISCED** Generated variable: Highest level of education, ES - ISCED 9 (What is the highest level of education you have successfully completed?)
<br>**NETUSOFT** People can use the internet on different devices such as computers, tablets and smartphones. How often do you use the internet on these or any other devices, whether for work or personal use?
<br>**UEMP3M** Have you ever been unemployed and seeking work for a period of more than three months?
<br>**MBTRU** Are you or have you ever been a member of a trade union or similar organisation? IF YES, is that currently or previously?
<br>**VTEURMMB** Imagine there were a referendum in [country] tomorrow about membership of the European Union. Would you vote for [country] to remain a member of the European Union or to leave the European Union?
<br>**YRBRN** And in what year were you born?

### International Standard Classification of Education (ISCED)

ISCED is the reference international classification for organising education programmes and related qualifications by levels and fields. ISCED 2011 (levels of education) has been implemented in all EU data collections
since 2014.

__Levels__

- ISCED 0: Early childhood education (‘less than primary’ for educational attainment)
- ISCED 1: Primary education
- ISCED 2: Lower secondary education
- ISCED 3: Upper secondary education
- ISCED 4: Post-secondary non-tertiary education
- ISCED 5: Short-cycle tertiary education
- ISCED 6: Bachelor’s or equivalent level
- ISCED 7: Master’s or equivalent level
- ISCED 8: Doctoral or equivalent level

More info about ISCED can be found [here](https://ec.europa.eu/eurostat/statistics-explained/index.php/International_Standard_Classification_of_Education_(ISCED)).

### Notebook settings

In [1]:
# Change the default plots size 
options(repr.plot.width=15, repr.plot.height=10)
options(warn=-1)

### Libraries

In [71]:
library(essurvey)
library(dplyr)
library(ggplot2)
library(ggthemes)
library(gghighlight)
library(foreign)
library(survey)
library(srvyr)

## Data Extraction

__Selecting the variables which will be used for the data analysis__

In [3]:
survey_rawdata <- read.spss("ESS9e03_1.sav", use.value.labels=T, max.value.labels=Inf, to.data.frame=TRUE)

In [4]:
head(survey_rawdata)

Unnamed: 0_level_0,name,essround,edition,proddate,idno,cntry,nwspol,netusoft,netustm,ppltrst,⋯,inwemm,inwtm,dweight,pspwght,pweight,anweight,domain,prob,stratum,psu
Unnamed: 0_level_1,<chr>,<dbl>,<chr>,<chr>,<dbl>,<fct>,<fct>,<fct>,<fct>,<fct>,⋯,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,ESS9e03_1,9,3.1,17.02.2021,27,Austria,60,Every day,180.0,2,⋯,7,46,0.5811743,0.2181114,0.3020914,0.06588958,2,0.0011755485,59,1688
2,ESS9e03_1,9,3.1,17.02.2021,137,Austria,10,Every day,20.0,7,⋯,6,73,1.0627724,0.4134733,0.3020914,0.12490674,2,0.0006428456,79,88
3,ESS9e03_1,9,3.1,17.02.2021,194,Austria,60,Most days,180.0,5,⋯,48,92,1.3765086,2.2702928,0.3020914,0.685836,2,0.0004963272,11,938
4,ESS9e03_1,9,3.1,17.02.2021,208,Austria,45,Every day,120.0,3,⋯,49,134,0.9933993,0.3864834,0.3020914,0.11675334,2,0.0006877382,74,1998
5,ESS9e03_1,9,3.1,17.02.2021,220,Austria,30,Never,,5,⋯,39,40,0.3773534,1.0321022,0.3020914,0.31178924,2,0.0018105009,99,601
6,ESS9e03_1,9,3.1,17.02.2021,254,Austria,45,Only occasionally,,8,⋯,12,52,1.4793528,0.5755447,0.3020914,0.17386711,2,0.0004618227,77,68


In [37]:
european_survey <- survey_rawdata[,c("cntry", "eduyrs", "eisced", "netusoft", "uemp3m", "mbtru", "vteurmmb", "yrbrn", "agea", "gndr", "anweight", "psu", "stratum")]

In [6]:
head(european_survey)

Unnamed: 0_level_0,cntry,eduyrs,eisced,netusoft,uemp3m,mbtru,vteurmmb,yrbrn,agea,gndr,anweight,psu,stratum
Unnamed: 0_level_1,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<dbl>,<dbl>,<dbl>
1,Austria,12,"ES-ISCED IIIb, lower tier upper secondary",Every day,No,No,Remain member of the European Union,1975,43,Male,0.06588958,1688,59
2,Austria,12,"ES-ISCED IIIb, lower tier upper secondary",Every day,No,"Yes, previously",Remain member of the European Union,1951,67,Male,0.12490674,88,79
3,Austria,12,"ES-ISCED II, lower secondary",Most days,No,No,Leave the European Union,1978,40,Female,0.685836,938,11
4,Austria,11,"ES-ISCED IIIb, lower tier upper secondary",Every day,No,No,Remain member of the European Union,1955,63,Male,0.11675334,1998,74
5,Austria,8,"ES-ISCED II, lower secondary",Never,No,No,Remain member of the European Union,1947,71,Female,0.31178924,601,99
6,Austria,13,"ES-ISCED IIIb, lower tier upper secondary",Only occasionally,No,"Yes, previously",Remain member of the European Union,1954,64,Male,0.17386711,68,77


In [7]:
nrow(european_survey)

<a id="2"></a> <br>
<font size="+3" color="black"><b>2 - Data Cleaning & Wrangling</b></font><br><a id="2"></a>
<br> 

In [38]:
# Checking for NA's in the dataset
sapply(european_survey, function(x) sum(is.na(x)))

In [39]:
# For the purpose of this analysis, considering Vote as Leave or Remain
european_survey$vteurmmb <- as.character(european_survey$vteurmmb)
european_survey$vteurmmb[european_survey$vteurmmb == "Remain member of the European Union"] <- "Remain"
european_survey$vteurmmb[european_survey$vteurmmb == "Leave the European Union"] <- "Leave"
european_survey$vteurmmb[european_survey$vteurmmb == "Would submit a blank ballot paper"] <- NA
european_survey$vteurmmb[european_survey$vteurmmb == "Would spoil the ballot paper"] <- NA
european_survey$vteurmmb[european_survey$vteurmmb == "Would not vote"] <- NA
european_survey$vteurmmb[european_survey$vteurmmb == "Not eligible to vote"] <- NA
european_survey$vteurmmb <- as.factor(european_survey$vteurmmb)

__Aggregation Levels for variable EISCED__

| Level            | ISCED      |
| :-----------:    |:----------:|
| Low education    | Levels 0-2 |
| Medium education | Levels 3-4 |
| High education   | Levels 5-8 |

In [47]:
# Cleaning responses that are not able to fit into ISCED
european_survey$eisced <- as.character(european_survey$eisced)
european_survey$eisced[european_survey$eisced == "Not possible to harmonise into ES-ISCED"] <- NA
european_survey$eisced[european_survey$eisced == "Other"] <- NA

In [72]:
# Creating a new feature Education by aggregating the ISCED"s levels
# Low, Medium and High Education
european_survey <- european_survey %>% 
  mutate(Education = case_when(
      eisced == "ES-ISCED I , less than lower secondary" ~ "Low Education",
      eisced == "ES-ISCED II, lower secondary" ~ "Low Education",
      eisced == "ES-ISCED IIIb, lower tier upper secondary" ~ "Medium Education",
      eisced == "ES-ISCED IIIa, upper tier upper secondary" ~ "Medium Education",
      eisced == "ES-ISCED IV, advanced vocational, sub-degree" ~ "Medium Education",
      eisced == "ES-ISCED V1, lower tertiary education, BA level" ~ "High Education",
      eisced == "ES-ISCED V2, higher tertiary education, >= MA level" ~ "High Education",
      TRUE ~ eisced))
european_survey$Education <- as.factor(european_survey$Education)
european_survey$eisced <- as.factor(european_survey$eisced)

In [74]:
# Cleaning NA values
df_european_survey <- european_survey[complete.cases(european_survey), ]
sapply(df_european_survey, function(x) sum(is.na(x)))

In [76]:
# Different way to clean the variable leaving as yes or no
df_european_survey$uemp3m <- as.character(df_european_survey$uemp3m)
df_european_survey$uemp3m <- as.factor(df_european_survey$uemp3m)

In [77]:
# For the purpose of this analysis, considering the answer if the respondent ever been a member 
# of a trade union or similar organisation - "Yes, currently" and "Yes, previously" as simple Yes
df_european_survey$mbtru <- as.character(df_european_survey$mbtru)
df_european_survey$mbtru[df_european_survey$mbtru == "Yes, currently"] <- "Yes"
df_european_survey$mbtru[df_european_survey$mbtru == "Yes, previously"] <- "Yes"
df_european_survey$mbtru <- as.factor(df_european_survey$mbtru)


In [61]:
# Transforming as numeric the variable Years of Education
df_european_survey$eduyrs <- as.numeric(df_european_survey$eduyrs)

In [68]:
# Creating a new feature as per age (eg. young, young adult, older adult, elderly)
df_european_survey$agea <- as.numeric(df_european_survey$agea)
df_european_survey <- df_european_survey %>% 
  mutate(Age_Band = case_when(
    agea < 20 ~ "<20",
    agea >= 20 & agea < 40 ~ "20-39",
    agea >= 40 & agea <= 65 ~ "40-65",
    agea > 65 ~ ">65"))
df_european_survey$Age_Band <- as.factor(df_european_survey$Age_Band)

In [75]:
head(df_european_survey)

Unnamed: 0_level_0,cntry,eduyrs,eisced,netusoft,uemp3m,mbtru,vteurmmb,yrbrn,agea,gndr,anweight,psu,stratum,Education
Unnamed: 0_level_1,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<fct>
1,Austria,12,"ES-ISCED IIIb, lower tier upper secondary",Every day,No,No,Remain,1975,43,Male,0.06588958,1688,59,Medium Education
2,Austria,12,"ES-ISCED IIIb, lower tier upper secondary",Every day,No,"Yes, previously",Remain,1951,67,Male,0.12490674,88,79,Medium Education
3,Austria,12,"ES-ISCED II, lower secondary",Most days,No,No,Leave,1978,40,Female,0.685836,938,11,Low Education
4,Austria,11,"ES-ISCED IIIb, lower tier upper secondary",Every day,No,No,Remain,1955,63,Male,0.11675334,1998,74,Medium Education
5,Austria,8,"ES-ISCED II, lower secondary",Never,No,No,Remain,1947,71,Female,0.31178924,601,99,Low Education
6,Austria,13,"ES-ISCED IIIb, lower tier upper secondary",Only occasionally,No,"Yes, previously",Remain,1954,64,Male,0.17386711,68,77,Medium Education


<a id="3"></a> <br>
<font size="+3" color="black"><b>3 - Survey Weights</b></font><br><a id="3"></a>
<br> 



The clustering variable in ESS is called ‘psu’, stratification is indicated by ‘stratum’, and weighting by ‘anweight’

Details about how ESS weights the data can be found [here](https://www.europeansocialsurvey.org/docs/methodology/ESS_weighting_data_1_1.pdf).

In [None]:
weighted_df_ess <- european_survey %>% as_survey_design(ids=psu, strata=stratum, weights=anweight)

In [None]:
# Making a frequency table using a survey object
#weighted_df_ess %>% group_by(cntry) %>% summarize(proportion=survey_mean(), total=survey_total())

In [None]:
weighted_df_ess

In [78]:
round(prop.table(table(european_survey$Education)),2)
round(prop.table(table(df_european_survey$netusoft)),2)
round(prop.table(table(df_european_survey$vteurmmb)),2)
round(prop.table(table(df_european_survey$uemp3m)),2)
round(prop.table(table(df_european_survey$mbtru)),2)
round(prop.table(table(df_european_survey$Age_Band)),2)


  High Education    Low Education Medium Education 
            0.24             0.25             0.51 


             Never  Only occasionally A few times a week          Most days 
              0.17               0.06               0.06               0.08 
         Every day 
              0.64 


 Leave Remain 
  0.13   0.87 


  No  Yes 
0.71 0.29 


 No Yes 
0.6 0.4 

<a id="4"></a> <br>
<font size="+3" color="black"><b>4 - Data Exploration</b></font><br><a id="4"></a>
<br> 

In [None]:
# Top 10 countries with the highest vote for Leave the UE (calculated by proportion of votes)
df_european_survey %>% group_by(Country,gndr,vteurmmb) %>% 
    summarise(total = n()) %>% 
    mutate(freq = round(total / sum(total),2)) %>% 
    filter(vteurmmb == "Leave") %>% 
    arrange(desc(freq,total)) %>% 
    head(10)

df_european_survey %>% group_by(Country,gndr,vteurmmb) %>% 
    summarise(total = n()) %>% 
    mutate(freq = round(total / sum(total),2)) %>% 
    filter(vteurmmb == "Remain") %>% 
    arrange(desc(freq,total)) %>% 
    head(10)

In [None]:
# Top 10 countries with the highest Unemployment rate (calculated by proportion of votes)
df_european_survey %>% group_by(Country,uemp3m) %>% 
    summarise(total = n()) %>% 
    mutate(freq = round(total / sum(total),2)) %>% 
    filter(uemp3m == "Yes") %>% 
    arrange(desc(freq,total)) %>% 
    head(10)

df_european_survey %>% group_by(Country,uemp3m) %>% 
    summarise(total = n()) %>% 
    mutate(freq = round(total / sum(total),2)) %>% 
    filter(uemp3m == "No") %>% 
    arrange(desc(freq,total)) %>% 
    head(10)

In [None]:
# Top 5 countries with the highest Union membership
df_european_survey %>% group_by(Country,mbtru) %>% 
    summarise(total = n()) %>% 
    mutate(freq = round(total / sum(total),2)) %>% 
    filter(mbtru == "Yes") %>% 
    arrange(desc(freq,total)) %>% 
    head(5)

In [None]:
# Top 5 countries with the lowest Union membership
df_european_survey %>% group_by(Country,mbtru) %>% 
    summarise(total = n()) %>% 
    mutate(freq = round(total / sum(total),2)) %>% 
    filter(mbtru == "No") %>% 
    arrange(desc(freq,total)) %>% 
    head(5)

In [None]:
df_european_survey %>% group_by(Country) %>% filter(vteurmmb == "Leave") %>% summarise(n= n()) %>% head(10) %>%
  ggplot(aes(x=reorder(Country, n), y=n)) + 
  geom_bar(position="dodge",stat="identity", width = .7) +
  scale_fill_brewer(palette='Set2') +
  labs(x = "Country",
       y = "Voted to Leave the EU",
       fill = "Vote",
       title = "Top 10 Countries with the highest votes for Leave") +
         theme_minimal() + coord_flip()

In [None]:
df_european_survey %>%
  mutate(Age_Band = factor(Age_Band, levels=c("<20", "20-39", "40-65", ">65"))) %>%
  ggplot(aes(x=Age_Band)) + 
  geom_bar(aes(fill=vteurmmb), stat="count", width = .6) +
  scale_fill_brewer(palette='Set2') +
  labs( x = "Age Band",
        y = "",
        fill = "Vote",
        title = " Proportion of votes based on Age band") + theme_minimal() +
  theme(plot.title=element_text(vjust=.5,family='', face='bold', colour='#636363', size=20))

In [None]:
df_european_survey %>%
  ggplot(aes(x=gndr)) + 
  geom_bar(aes(fill=vteurmmb), stat="count", width = .4) +
  scale_fill_brewer(palette='Set2') +
  labs( x = "Gender",
        y = "",
        fill = "Vote",
        title = " Proportion of votes based on Gender") + theme_minimal() +
  theme(plot.title=element_text(vjust=.5,family='', face='bold', colour='#636363', size=20))

<a id="5"></a> <br>
<font size="+3" color="black"><b>5 - Previous Iterations</b></font><br><a id="5"></a>
<br> 

__Alternative for Proportion of votes__: Tried to a stacked bar graph but because Age <20 there is not much data it does not present well so I've switched Age and Vote

In [None]:
df_european_survey %>%
  ggplot(aes(x=vteurmmb)) + 
  geom_bar(aes(fill=Age_Band), stat="count", width = .6) +
  scale_fill_brewer(palette='Set2') +
  labs( x = "Age Band",
        y = "",
        fill = "Age Band",
        title = " Proportion of votes based on Age band") + theme_minimal() +
  theme(plot.title=element_text(vjust=.5,family='', face='bold', colour='#636363', size=20))