### TSE data example
This is just a small example of how to use tse data. 
In this script we filter the candidates in order to obtain a dataframe containing only politicians that were elected at least once.

In [1]:
import pandas as pd
import numpy as np
import os


In [2]:
DATASET_PATH= os.path.join(os.pardir,'data','2017-05-10-tse-candidates.xz')

In [3]:
# Loading csv
cand_df=pd.read_csv(DATASET_PATH,encoding='utf-8',dtype='category',)# setting dtype to category instead of str cuts by more than a half RAM usage
cand_df.columns

Index(['year', 'phase', 'description', 'state', 'location', 'post', 'name',
       'electoral_id', 'cpf', 'voter_id', 'result'],
      dtype='object')

In [4]:
### Here, we quickly check the data using  value_counts() to get the frequency on each column
for name, col in cand_df.iteritems():
    print ('\t',name,'\n')
    print (col.value_counts())

	 year 

2016    497284
2012    483825
2004    401789
2008    384532
2014     26222
2010     22577
2006     20790
Name: year, dtype: int64
	 phase 

1    1836151
2        868
Name: phase, dtype: int64
	 description 

Eleições Municipais 2016                  497123
ELEIÇÃO MUNICIPAL 2012                    483068
ELEICOES 2004                             401789
ELEIÇÕES 2008                             383530
Eleições Gerais 2014                       26222
ELEIÇÕES 2010                              22577
ELEICOES 2006                              20790
ELEIÇÕES SUPLEMENTARES 2008                  990
JUARA                                         14
Eleição Suplementar de Foz do Iguaçu          13
ELEIÇÃO SUPLEMENTAR CAMAMU-BA                 13
 ELEIÇÃO SUPLEMENTAR DE MARITUBA/PA           12
ELEIÇÃO SUPLEMENTAR MOSSORÓ-RN                12
Eleição Suplementar de Gravataí               12
MAJORITÁRIA - CRICIÚMA                        12
NOVA ELEIÇÃO SANTANA DO PARNAÍBA              12

### Only elected politicians
Now, we process candidacies data to obtain a list of elected politicians
We use 'result' to figure out who has been elected. It is better to use the description column then the code column, since the codes dont seem to be consistent along the years.


In [5]:
ind_elected= (cand_df.result=='elected') | (cand_df.result=='elected_by_party_quota')
# ind_elected|=cand_df.result=='alternate'# should we consider it?
ind_elected= cand_df.index[ind_elected]

In [6]:
politicians_df=cand_df.loc[ind_elected,['cpf','name','post','location','state','year']]

In [7]:
politicians_df.sort_values('name')


Unnamed: 0,cpf,name,post,location,state,year
631490,13178547304,AARÃO CRUZ MENDES,mayor,BENEDITINOS,PI,2008
1096815,13178547304,AARÃO CRUZ MENDES,mayor,BENEDITINOS,PI,2012
671307,58270876704,AARÃO DE MOURA BRITO NETO,mayor,MANGARATIBA,RJ,2008
259280,58270876704,AARÃO DE MOURA BRITO NETO,mayor,MANGARATIBA,RJ,2004
1657382,58270876704,AARÃO DE MOURA BRITO NETO,mayor,MANGARATIBA,RJ,2016
1558597,16133358149,ABADIA DE MOURA MORAES,city_councilman,ARAPUTANGA,MT,2016
492413,24937630253,ABADIA DELFINO DUARTE SOUZA,city_councilman,GOIANÉSIA,GO,2008
106975,59857277691,ABADIA NUNES DE OLIVEIRA ANDRADE,city_councilman,BONFINOPOLIS DE MINAS,MG,2004
1311039,37736744149,ABADIO RODRIGUES DA SILVA,city_councilman,TALISMÃ,TO,2012
939880,09482822315,ABDALA DA COSTA SOUSA,city_councilman,BOM JESUS DAS SELVAS,MA,2012


It is quite curious that alphabetically ordered in this way, the last politician is called...

If we want to keep only the list of politicians we keep only cpf and name and remove the duplicates

In [8]:
politicians_df[['cpf','name']].drop_duplicates().sort_values('name')

Unnamed: 0,cpf,name
631490,13178547304,AARÃO CRUZ MENDES
259280,58270876704,AARÃO DE MOURA BRITO NETO
1558597,16133358149,ABADIA DE MOURA MORAES
492413,24937630253,ABADIA DELFINO DUARTE SOUZA
106975,59857277691,ABADIA NUNES DE OLIVEIRA ANDRADE
1311039,37736744149,ABADIO RODRIGUES DA SILVA
939880,09482822315,ABDALA DA COSTA SOUSA
940015,00917446364,ABDALA DA COSTA SOUSA FILHO
808132,33460825200,ABDALA HABIB FRAXE JUNIOR
776856,04365099892,ABDALA SALOMÃO NETO
