# Data mining votes data

Considering MEPs "political position" as a function of all their 10000+ votes over the past 5 years, this information can be reduced to a few dimensions using Principal Component Analysis.

## PCA
Votes data is filtered to take only votes between 2014-07-01 and 2019-07-01, and only MEPs who were active in that period. The resulting Principal Components (PCs) are stored in `computed/meps_pcs.json`.

In [170]:
import eu_utils

# Read all the data - takes about a minute
meps_details, votes_details, votes_data, group_ids = eu_utils.init_data()

Reading data... done


In [174]:
import pandas as pd

# number of PCAs to compute
# use .6 to let PCA choose the right number of PCs to explain 60% of variance
COMPONENT_NB = .6

# Extract a data slice for a certain time interval and compute PCAs from it
# That means taking only votes that happened in that time window
data_with_count, selected_MEPs, selected_indices = eu_utils.temporal_slice('2014-07-01', '2019-07-01', votes_details, votes_data)
selected_votes = votes_data.iloc[:, selected_indices].columns
print ('selected %d votes from %d MEPs for analysis' % (len(selected_indices), len(selected_MEPs)))

# compute Principal Components (PCs) - see eu_utils.compute_pcas
principalComponents, pca = eu_utils.compute_pcas(data_with_count.iloc[selected_MEPs, selected_indices], COMPONENT_NB) 

all_data = pd.concat([data_with_count[['mep_id','group', 'country', 'votes_count']], principalComponents ], axis=1, sort=False, join='inner')

# store computed PCs for each MEP
all_data.to_json('computed/meps_pcs.json', orient='records')

# print sample
all_data[:10]

selected 10227 votes from 824 MEPs for analysis
PCA(8) done


Unnamed: 0,mep_id,group,country,votes_count,PCA0,PCA1,PCA2,PCA3,PCA4,PCA5,PCA6,PCA7
0,96997,PPE,Germany,6236,-28.235454,7.834083,14.929969,7.964872,-2.147837,18.457425,1.630457,3.040508
9,96993,ENF,Italy,6569,42.647004,44.808131,-1.322115,13.781579,-1.641579,-10.299622,9.970715,5.113137
10,23816,PPE,Hungary,6998,-32.698046,16.100594,11.065426,9.575744,-5.848232,10.935226,0.217553,1.466658
29,128588,Verts/ALE,Sweden,6237,41.819829,-13.912177,14.387661,-13.481719,-4.024967,-1.500645,-16.753546,10.80269
37,124944,S&D,United Kingdom,8957,5.724098,-43.356725,-23.379537,4.507312,-2.313912,-4.5792,-3.330182,-3.095479
53,125064,PPE,Greece,6643,-28.833811,6.298887,12.185261,9.314009,-2.166416,12.970508,3.11471,0.621291
54,38398,PPE,Netherlands,9029,-47.024328,3.854118,11.510935,1.233932,1.96478,0.428552,-0.53233,-2.289941
55,124759,PPE,Lithuania,1663,10.926068,23.808086,3.801808,9.54073,6.282302,16.223866,-1.74536,1.840359
56,124758,ENF,France,9704,51.74773,66.298593,-3.650526,36.477489,-4.025577,-26.980113,24.841338,29.29727
57,124757,ENF,France,9045,50.151187,64.103452,-3.427633,33.706566,-3.379075,-25.790149,21.956354,26.489858


PCA coefficients are stored in `computed/votes_pcs_coefficients.json` for further analysis, to understand which votes are most decisive in shaping up a principal component.

In [185]:
# build a table where each row has PC coefficients for one vote : [vote_id, pca<n>_coeff...]
votes_pca_coeff = pd.DataFrame({'vote_id': selected_votes}, selected_votes)
for comp_idx, comp in enumerate(pca.components_):
    votes_pca_coeff['pca%d_coeff' % comp_idx] = comp
    
votes_pca_coeff.to_json('computed/votes_pcs_coefficients.json', orient='index')

# print sample
votes_pca_coeff[:10]

Unnamed: 0,vote_id,pca0_coeff,pca1_coeff,pca2_coeff,pca3_coeff,pca4_coeff,pca5_coeff,pca6_coeff,pca7_coeff
73468,73468,0.006188,0.015818,-0.013466,-1.1e-05,-0.004833,-0.009701,-0.003199,0.003929
98690,98690,-0.011887,0.000158,-0.00976,0.004123,-0.006131,-0.022659,0.019064,-0.000612
73396,73396,0.011997,-0.018872,-0.00745,-0.008521,0.013538,0.003351,0.007445,-0.001541
103692,103692,-0.002231,-0.005045,0.001908,-0.006636,-0.005724,-0.02185,0.002471,-0.001113
73394,73394,-0.008275,-0.014527,0.008249,0.010596,0.005272,-0.000888,0.002998,0.023751
101767,101767,-0.009363,0.013584,0.007612,-0.006789,0.007334,-0.010078,0.017324,0.000537
84206,84206,-0.013576,0.006716,0.01594,0.008612,-0.008114,-0.000243,-0.003101,-0.005425
94745,94745,0.015247,-0.007919,0.005201,0.006682,-0.009101,-0.000476,0.004465,0.009228
55292,55292,0.015129,0.003277,0.018022,-0.003186,0.001053,-0.006365,-0.013568,0.000121
55293,55293,0.015078,0.005096,0.014414,-0.007891,-0.007224,-0.00735,-0.012767,0.000236


## Use ExtraTrees to identify votes of importance
An attempt at using extra trees. Not used in the main report so far.


In [127]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import ExtraTreesRegressor
from sklearn.impute import SimpleImputer

# Fill missing values 
imp = SimpleImputer(missing_values=float('nan'), strategy='constant', fill_value=0)
imp.fit(data_with_count.iloc[selected_MEPs, selected_indices])
data_with_count_filled = imp.transform(data_with_count.iloc[selected_MEPs, selected_indices])

# Build a forest and compute the feature importances
forest = ExtraTreesRegressor(n_estimators=100,
                             n_jobs=6,
                              random_state=0)
X = data_with_count_filled
y = principalComponents['PCA0'].tolist()
forest.fit(X, y)

importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

# display at most 20 features
for f in range(min(X.shape[1], 20)):
    vote_id = str(data_with_count.columns[selected_indices[indices[f]]])
    print("%d. (%f) vote %s : %s" % (f, importances[indices[f]], vote_id, votes_details[vote_id]['title'] ))


Automatically created module for IPython interactive environment
Feature ranking:
0. (0.297816) vote 75652 : A8-0344/2016 -  Richard Corbett - Am 398
1. (0.156305) vote 71458 : A8-0223/2016 -  Jeppe Kofod et  Michael Theurer - Considérant AF
2. (0.074999) vote 71454 : A8-0223/2016 -  Jeppe Kofod et  Michael Theurer - Considérant AC
3. (0.072532) vote 79122 : A8-0041/2015 -  Neena Gill - Am 12
4. (0.043879) vote 71461 : A8-0223/2016 -  Jeppe Kofod et  Michael Theurer - Am 29
5. (0.028480) vote 71435 : A8-0223/2016 -  Jeppe Kofod et  Michael Theurer - Am 16
6. (0.025509) vote 76050 : A8-0360/2016 -  Elmar Brok - § 3
7. (0.025071) vote 89134 :  -  Jan Olbrycht et  Isabelle Thomas - Am 30/1
8. (0.019913) vote 75742 : A8-0344/2016 -  Richard Corbett - Am 188
9. (0.019417) vote 77629 : A8-0039/2017 -  Gunnar Hökmark - Considérant O
10. (0.016902) vote 75599 : A8-0344/2016 -  Richard Corbett - Am 387/1
11. (0.014039) vote 75860 : A8-0344/2016 -  Richard Corbett - Am 374
12. (0.011349) vote 75