## Language knowledge and place of birth as predictors of independence support in Catalonia

#### Introduction

Catalonia, a Spanish region of 7.5 millions inhabitants, has been on the spotlight in recent months because a significant part of its population is pushing for political independence from Spain. Catalonia has been under Spanish administration for centuries, but its political allegiance has often been under the [shadow of doubt](https://en.wikipedia.org/wiki/Nueva_Planta_decrees). Even though massive immigration from other Spanish regions during the XXth century seemed to herald cultural convergence towards a common Spanish political culture, the independence issue has reemerged. In this post we are interested in finding sociological variables that are good predictors of independence support, and possibly determine, if any, causality relationships. We will base our discussion on results from recent regional elections published by the [Catalan Government](http://governacio.gencat.cat/ca/pgov_ambits_d_actuacio/pgov_eleccions/pgov_dades_electorals/) as well as on sociological data from [IDESCAT](https://www.idescat.cat/), the Catalan Institute of Statistics.

$$P(A\mid B) = \frac{P(B\mid A)P(A)}{P(B)}$$ 

#### Data preprocessing

Electoral data at municipal level could be easily downloaded from the Catalan Government site in CSV format. We focused on data from the regional elections held on the 27th of September 2015, where pro and anti independence blocks, except for one party, were clearly identifiable. A more recent election took place on the 21st of December 2017, with a higher turnout but with very similar percentages in terms of independence support. Ont he other hand, sociological data such as place of birth, level of studies or income level had to be scraped from IDESCAT (the scraping script is available on this GutHub repo). Data was from exactly the same year of the electoral contest. Once both datasets were available, data had to be merged using unique municipality codes. Data was saved in a .csv file hat can be easily imported into a pandas dataframe.

#### Exploring Data

Catalonia currently has 947 municipalities, the vast majority of which are small villages of a few hundred inhabitants. We had to drop data for Talarn municipality, because some of it was clearly wrong.

In [1]:
import pandas as pd
import numpy as np
DF_ALL_MUNIC_DATA = pd.read_csv('data/catalan_elections/DF_ALL_MUNIC_DATA.csv')

The imported dataframe contains all electoral data together with all scraped sociological variables at municipal level. We will focus on four main types of sociological data ( the independent variables):

   - Place of birth
   - Income level
   - Educational level
   - Language knowledge
    
In order to do this, we select the appropriate columns from the dataframe 

In [9]:
# select relevant columns
attributes = ['Independ_pct', 'pct_cat', 'pct_spa', 'pct_foreign', 'RFDB_idx', 'pct_Univ',
              'pct_1erGrau', 'pct_cat_speakers', 'Tot']

   1. __'pct_cat'__ is , for each municipality, the percentage of people born in Catalonia and __'pct_spa'__ is the percentage of people born in another Spanish region. __'pct_foreign'__ is the percentage of people born outside the Spanish State ( most of whom cannot vote ).
   2. __'RFDB_idx'__ is the average income level of the municipality
   3. __'pct_Univ'__ is the percentage of people over 16 in the municipality with a University degree. __'pct_1erGrau'__ is the percentage of people over 16 that attended primary school for a few years, but that never completed basic studies. 
   4. __pct_cat_speakers__ is the percentage of the population of the municiaplity able to communicate in Catalan language.
    
Unemployment data ( measured at the end of the year, in winter) was discarded because of the high seasonality it showed in many villages, making it a highly unreliable indicator. 

Our dependent variable will be __'Independ_pct'__, the percentage of the population in each municipality that voted for inequivocally pro-independence parties in the elections of 27th of September 2015. The parties were actually two : JuntsxSi ( a coalition of the wo main Catalan nationalist parties) and CUP, a far left party that scored higher than ever in the considered election.

pandas has a handy method 'describe' that summarizes the basic statistics of each column of the dataframe:

In [10]:
# describe data
DF_ALL_MUNIC_DATA[attributes].describe()

Unnamed: 0,Independ_pct,pct_cat,pct_spa,pct_foreign,RFDB_idx,pct_Univ,pct_1erGrau,pct_cat_speakers,Tot
count,946.0,946.0,946.0,946.0,216.0,152.0,462.0,936.0,946.0
mean,69.464757,79.048666,10.2575,10.693833,93.427315,19.616972,14.743326,85.518522,7951.276
std,16.4483,9.74593,5.886323,6.624858,13.261371,7.382937,3.874346,8.778367,55728.71
min,14.6,40.866369,0.0,0.0,58.8,6.819345,5.654776,50.668964,27.0
25%,61.075,73.21124,5.811894,5.988154,84.075,14.118196,12.130013,80.569736,316.25
50%,74.045,79.884299,8.839505,9.414203,94.35,18.278716,14.622246,87.442369,946.5
75%,81.78,86.375442,13.725421,13.822862,101.025,22.151323,16.917358,92.149169,3717.75
max,96.73,99.159664,30.069723,45.240859,129.2,46.321718,29.16935,100.0,1608746.0


Unfortunately, educational and income data are not available for the majority of the municipalities. On the contrary, place of birth and language knowledge are available for all of them. A few comments on the data are:
 - __The distribution of the population__: __'Tot'__ variable is the number of inhabitants in the municipalities. 75% of them have less than 3700 inhabitants. The maximum corresponds to Barcelona, with more than 1.6 million people. The minimum, to a village of only 27 souls.
 - __The distribution of independence support__: 

A first step towards understanding potential relationships between the dependent and the independent variables is to compute correlations. pandas has again a handy method 'corr' to do exactly that.

In [3]:
corr_matrix = DF_ALL_MUNIC_DATA[attributes].corr()
#corr_matrix['Independ_pct'].sort_values()[:-1].to_frame()
corr_matrix['Independ_pct'].sort_values()[:-1]

pct_spa            -0.850738
pct_foreign        -0.266279
Tot                -0.192456
RFDB_idx           -0.106550
pct_1erGrau         0.200860
pct_Univ            0.396213
pct_cat             0.694831
pct_cat_speakers    0.751488
Name: Independ_pct, dtype: float64

Clearly, the strongest correlations of independence support are with places of birth and language knowledge. We will see later tha place of birth and language knowledge are highly correlated, and thus they can be considered as a single predictor. The correlation with place of birth is particularly high when people are born in other Spanish regions outside Catalonia (a very strong negative correlation of 85%). 

The percentage of Spanish-born people outside Catalonia is actually a proxy for the degree of intensity of Spanish cultural origins among the population of the municipality. The variable __'pct_spa'__ measures the immigration level from other Spanish regions, but it does not tell whether this immigration is recent or rather from a few decades ago. However, we know that during the second half of the XXth century there was a massive [immigration wave](https://ca.wikipedia.org/wiki/Demografia_de_Catalunya) that concentrated in the urban area around Barcelona  ( total population grew from 3.2 milions in 1950 to 5.1 milions in 1970). In more recent decades, the migration pressure from other Spanish regions has decreased. It is therefore likely that  the higher the __'pct_spa'__ value, the more likely it is that even Catalan-born people from the municipality have recent Spanish roots in their family: they are the outspring from the migrants of the 50-70s. That is to say, the percentage of people with Spanish origins is always higher than the value expressed by __'pct_spa'__ because this indicator is high in places were immigration has been high in recent decades. 

On the contrary, a high level of __'pct_cat'__ tends to indicate that most people have Catalan roots, but it masks the fact that a percentage of the people born in Catalonia will have non Catalan origins because of the high immigration levels in the second half of the XX century. Therefore, the percentage of people with Catalan roots is always lower than the percentage expressed by __'pct_cat'.

Since the correlation coefficient only measures linear relationships 
and it may completely miss out on nonlinear ones, in order to explore potential non linearities, it is always useful to actually visualize the relations between variables.

#### Graphs

Scatter matrix from dataframe
<img alt="" src="figures/catalan_elections/data_scatter_matrix.png"/>

#### Feature selection : random forests

#### Regression

#### Conclusion

 
Results should be taken with caution. We have the precedent of Ukraine, where two referendums were held almost in succession in a single year, with strikingly different results. Allegiances of some parts of society may switch rapidly

In [None]:
import numpy as np
import pandas as pd
url = "http://www.historiaelectoral.com/percentcat.html"
tbs = pd.read_html(url)

In [None]:
df = pd.DataFrame(tbs[1][[0,1]].values[2:-2], columns = ['elect_type', 'part_pct'])
df['part_pct'] = df['part_pct'].astype(int)

df = df.sort_values(by='part_pct')

df = df.reset_index(drop=True)

In [None]:
df.count()

In [None]:
import re

In [None]:
df['elect_type'][0]

In [None]:
df[df['elect_type'].str.contains('G.')]

In [None]:
from matplotlib import pyplot as plt
plt.rcParams["figure.figsize"] = (10, 8)

In [27]:
from sklearn.datasets import load_iris

In [28]:
iris = load_iris()

In [55]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

In [65]:
dt_reg = DecisionTreeRegressor(max_depth=6)
rf_reg = RandomForestRegressor(max_depth=4)

In [66]:
df = DF_ALL_MUNIC_DATA[attributes]
df = df.dropna()
X = df.drop('Independ_pct', axis=1).values
y = df['Independ_pct'].values

In [67]:
dt_reg.fit(X, y)

DecisionTreeRegressor(criterion='mse', max_depth=6, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

In [68]:
df.drop('Independ_pct', axis=1).keys()

Index(['pct_cat', 'pct_spa', 'pct_foreign', 'RFDB_idx', 'pct_Univ',
       'pct_1erGrau', 'pct_cat_speakers', 'Tot'],
      dtype='object')

In [69]:
dt_reg.feature_importances_

array([ 0.00250801,  0.81381858,  0.00500208,  0.0297963 ,  0.00849649,
        0.01833272,  0.11087233,  0.0111735 ])

In [53]:
rf_reg.fit(X, y)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=4,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [54]:
rf_reg.feature_importances_

array([ 0.00173497,  0.90802415,  0.00396083,  0.01225879,  0.00773688,
        0.0094459 ,  0.0457114 ,  0.01112708])

In [51]:
df.drop('Independ_pct', axis=1).columns

Index(['pct_cat', 'pct_spa', 'pct_foreign', 'RFDB_idx', 'pct_Univ',
       'pct_1erGrau', 'pct_cat_speakers', 'Tot'],
      dtype='object')