# First Study on Brazilian Cities Transparency Portal
In this dataset we have a population projection for each Brazilian city in the year of 2013.



In [1]:
import pandas as pd
import numpy as np

# We first collected the data with population estimatives, 
# we can use it later to do some comparisions or to use it later
cities = pd.read_excel('../data/Cidades - estimativa 2013.xlsx',
                       converters={'COD. UF': np.str, 'COD. MUNIC': np.str},
                       sheetname=None, header=0)

In [2]:
data = pd.DataFrame()
for key in cities.keys():
    data = pd.concat([data, cities[key]])
    
data = data.reset_index(drop=True)
data.shape

(5584, 5)

We should see 5570 rows because that's the number of cities that IBGE says that Brazil have. The different amount of rows leads me to believe there are metadata from the `.xlsx` messing with our data

## Translating column names

In [3]:
data.rename(columns={
        'UF': 'state',
        'COD. UF': 'state_id',
        'COD. MUNIC': 'city_id',
        'NOME DO MUNICÍPIO': 'city_name',
        'POPULAÇÃO ESTIMADA': 'population_projection'
    }, inplace=True)
data.head()

Unnamed: 0,state,state_id,city_id,city_name,population_projection
0,RO,11,15,Alta Floresta D'Oeste,25728
1,RO,11,379,Alto Alegre dos Parecis,13827
2,RO,11,403,Alto Paraíso,19459
3,RO,11,346,Alvorada D'Oeste,17399
4,RO,11,23,Ariquemes,101269


## Formating `city_id`

Formatting `city_id` to conform with the ids displayed on the Brazilian cesus files

In [4]:
data['city_id'] = data['city_id'].apply(lambda x: x.zfill(5))

## Checking out a `unique_id` for each city

In [5]:
data[data['city_id'] == '00108']

Unnamed: 0,state,state_id,city_id,city_name,population_projection
1831,BA,29,108,Abaíra,9132
5583,DF,53,108,Brasília,2789761


In [6]:
UNIQUE_IDS = data.loc[:,['state_id', 'city_id']]

for i in range(len(UNIQUE_IDS['state_id'])):
    UNIQUE_IDS.loc[i,'ids'] = '{}{}'.format(UNIQUE_IDS.loc[i,'state_id'],
                                              UNIQUE_IDS.loc[i,'city_id'])

UNIQUE_IDS.head()

Unnamed: 0,state_id,city_id,ids
0,11,15,1100015
1,11,379,1100379
2,11,403,1100403
3,11,346,1100346
4,11,23,1100023


In [7]:
len(set(UNIQUE_IDS['ids']))

5570

In [8]:
UNIQUE_IDS.shape

(5584, 3)

In [9]:
brazilian_states = {'RO': 'rondonia',
                    'AC': 'acre',
                    'AM': 'amazonas',
                    'RR': 'roraima',
                    'PA': 'para',
                    'AP': 'amapa',
                    'TO': 'tocantis',
                    'MA': 'maranhao',
                    'PI': 'piaui',
                    'CE': 'ceara',
                    'RN': 'rio_grande_do_norte',
                    'PB': 'paraiba',
                    'PE': 'pernambuco',
                    'AL': 'alagoas',
                    'SE': 'sergipe',
                    'BA': 'bahia',
                    'MG': 'mina_gerais',
                    'ES': 'epirito_santo',
                    'RJ': 'rio_de_janeiro',
                    'SP': 'sao_paulo',
                    'PR': 'parana',
                    'SC': 'santa_catarina', 
                    'RS': 'rio_grande_do_sul',
                    'MS': 'mato_grosso_do_sul',
                    'MT': 'mato_grosso',
                    'GO': 'goias',
                    'DF': 'distrito_federal'}

census_link = "ftp.ibge.gov.br/Censos/Censo_Demografico_2010/resultados/total_populacao_{}.zip"

## Gathering cities with @cuducos Brazilian Cities script

@cuducos had already made a script with all Brazilian Cities and its code and state associated, here in [this repository](https://github.com/cuducos/brazilian-cities).

We checked and it is the best way to get the cities in the right way.

In [10]:
from serenata_toolbox.datasets import fetch

fetch('2017-05-22-brazilian-cities.csv', '../data')

Downloading 2017-05-22-brazilian-cities.csv: 100%|██████████| 134K/134K [00:01<00:00, 97.1Kb/s]


In [11]:
brazilian_cities = pd.read_csv('../data/2017-05-22-brazilian-cities.csv')
brazilian_cities.head()

Unnamed: 0,code,name,state
0,520005,Abadia de Goiás,GO
1,310010,Abadia dos Dourados,MG
2,520010,Abadiânia,GO
3,150010,Abaetetuba,PA
4,310020,Abaeté,MG


In [12]:
brazilian_cities.shape

(5570, 3)

## Normalizing its form

It is necessary to normalize all information in order to use it to our necessities, so we managed to:
- Lowercase all states
- Remove all acentuation and normalize cities names
- And for our case we remove spaces to generate the pattern we want

In [13]:
brazilian_cities['state'] = brazilian_cities['state'].apply(str.lower)

In [14]:
import unicodedata

def normalize_string(string):
    if isinstance(string, str):
        nfkd_form = unicodedata.normalize('NFKD', string.lower())
        return nfkd_form.encode('ASCII', 'ignore').decode('utf-8')

In [15]:
brazilian_cities['normalized_name'] = brazilian_cities['name'].apply(lambda x: normalize_string(x))
brazilian_cities['normalized_name'] = brazilian_cities['normalized_name'].apply(lambda x: x.replace(' ', ''))

In [16]:
brazilian_cities.head()

Unnamed: 0,code,name,state,normalized_name
0,520005,Abadia de Goiás,go,abadiadegoias
1,310010,Abadia dos Dourados,mg,abadiadosdourados
2,520010,Abadiânia,go,abadiania
3,150010,Abaetetuba,pa,abaetetuba
4,310020,Abaeté,mg,abaete


## Getting all cities that are part of Transparency Portal

There are some cities that we already know that have a page with transparency and open data. The main objective here is to find how many cities have that.

Pattern: `{city}-{state}.portaltp.com.br`

In [17]:
portal_url = 'https://{}-{}.portaltp.com.br/'
brazilian_cities['transparency_portal_url'] = brazilian_cities.apply(lambda row: portal_url.format(
                                                                                        row['normalized_name'],
                                                                                        row['state']), axis=1)
brazilian_cities.head(20)

Unnamed: 0,code,name,state,normalized_name,transparency_portal_url
0,520005,Abadia de Goiás,go,abadiadegoias,https://abadiadegoias-go.portaltp.com.br/
1,310010,Abadia dos Dourados,mg,abadiadosdourados,https://abadiadosdourados-mg.portaltp.com.br/
2,520010,Abadiânia,go,abadiania,https://abadiania-go.portaltp.com.br/
3,150010,Abaetetuba,pa,abaetetuba,https://abaetetuba-pa.portaltp.com.br/
4,310020,Abaeté,mg,abaete,https://abaete-mg.portaltp.com.br/
5,230010,Abaiara,ce,abaiara,https://abaiara-ce.portaltp.com.br/
6,290020,Abaré,ba,abare,https://abare-ba.portaltp.com.br/
7,410010,Abatiá,pr,abatia,https://abatia-pr.portaltp.com.br/
8,290010,Abaíra,ba,abaira,https://abaira-ba.portaltp.com.br/
9,420005,Abdon Batista,sc,abdonbatista,https://abdonbatista-sc.portaltp.com.br/


(Getting all of the status code for each city might take a while so we added the prints only for feedback)

In [18]:
import requests
    
def get_status(url):
    try:
        print(requests.head(url).status_code)
        return requests.head(url).status_code
    except requests.ConnectionError:
        print(404)
        return 404

In [19]:
%%time
colatina = brazilian_cities[brazilian_cities['code'] == 320150]['transparency_portal_url'].values[0]
statusOK = get_status(colatina)

abaete = brazilian_cities[brazilian_cities['code'] == 310020]['transparency_portal_url'].values[0]
statusNOK = get_status(abaete)

200
404
CPU times: user 64.3 ms, sys: 13.1 ms, total: 77.4 ms
Wall time: 2.28 s


In [20]:
br_cities = brazilian_cities.loc[:10,:].copy()

In [21]:
%%time
br_cities.loc[:,'status_code'] = br_cities.apply(lambda x: get_status(x['transparency_portal_url']), axis=1)

404
404
404
404
404
404
200
404
200
404
404
CPU times: user 128 ms, sys: 20.6 ms, total: 149 ms
Wall time: 4.21 s


In [22]:
br_cities

Unnamed: 0,code,name,state,normalized_name,transparency_portal_url,status_code
0,520005,Abadia de Goiás,go,abadiadegoias,https://abadiadegoias-go.portaltp.com.br/,404
1,310010,Abadia dos Dourados,mg,abadiadosdourados,https://abadiadosdourados-mg.portaltp.com.br/,404
2,520010,Abadiânia,go,abadiania,https://abadiania-go.portaltp.com.br/,404
3,150010,Abaetetuba,pa,abaetetuba,https://abaetetuba-pa.portaltp.com.br/,404
4,310020,Abaeté,mg,abaete,https://abaete-mg.portaltp.com.br/,404
5,230010,Abaiara,ce,abaiara,https://abaiara-ce.portaltp.com.br/,404
6,290020,Abaré,ba,abare,https://abare-ba.portaltp.com.br/,200
7,410010,Abatiá,pr,abatia,https://abatia-pr.portaltp.com.br/,404
8,290010,Abaíra,ba,abaira,https://abaira-ba.portaltp.com.br/,200
9,420005,Abdon Batista,sc,abdonbatista,https://abdonbatista-sc.portaltp.com.br/,404


This will take too long considering we have 5570 cities to address.

Let's try using [grequests](https://pypi.python.org/pypi/grequests).

I know that we can find two different status code in the first 10 cities urls test. So let's use those 10 to test grequests ;)

In [23]:
import grequests

rs = (grequests.get(u) for u in list(br_cities['transparency_portal_url']))

In [24]:
def exception_handler(request, exception):
    return 404

responses = grequests.map(rs, exception_handler=exception_handler)

In [25]:
codes = [int(x) for x in br_cities['status_code'].values]

print(pd.unique(codes), pd.unique(responses))

[404 200] [404]


In [26]:
responses

[404, 404, 404, 404, 404, 404, 404, 404, 404, 404, 404]

The result above got me wondering where were those 200 statuses code we've seen before. I tested the code on the command line and they are there. So a little reasearch and I found that apparently it is not possible to run async tasks easily on a jupyter notebook [ref](http://ipywidgets.readthedocs.io/en/latest/examples/Widget%20Asynchronous.html).

With that in mind we decided to write a script that generates the infomartion we want: Open Data url for each brazilian city

In [27]:
data = br_cities[br_cities['status_code'] == 404].copy().reset_index(drop=True)
data

Unnamed: 0,code,name,state,normalized_name,transparency_portal_url,status_code
0,520005,Abadia de Goiás,go,abadiadegoias,https://abadiadegoias-go.portaltp.com.br/,404
1,310010,Abadia dos Dourados,mg,abadiadosdourados,https://abadiadosdourados-mg.portaltp.com.br/,404
2,520010,Abadiânia,go,abadiania,https://abadiania-go.portaltp.com.br/,404
3,150010,Abaetetuba,pa,abaetetuba,https://abaetetuba-pa.portaltp.com.br/,404
4,310020,Abaeté,mg,abaete,https://abaete-mg.portaltp.com.br/,404
5,230010,Abaiara,ce,abaiara,https://abaiara-ce.portaltp.com.br/,404
6,410010,Abatiá,pr,abatia,https://abatia-pr.portaltp.com.br/,404
7,420005,Abdon Batista,sc,abdonbatista,https://abdonbatista-sc.portaltp.com.br/,404
8,150013,Abel Figueiredo,pa,abelfigueiredo,https://abelfigueiredo-pa.portaltp.com.br/,404


There are some cities that we already know that have a page with transparency and open data but the pattern is different from the one above.

Second Pattern: `cm{city}-{state}.portaltp.com.br`

In [28]:
portal_url = 'https://cm{}-{}.portaltp.com.br/'
data['transparency_portal_url'] = data.apply(lambda row: portal_url.format(
                                                                           row['normalized_name'],
                                                                           row['state']), axis=1)
data

Unnamed: 0,code,name,state,normalized_name,transparency_portal_url,status_code
0,520005,Abadia de Goiás,go,abadiadegoias,https://cmabadiadegoias-go.portaltp.com.br/,404
1,310010,Abadia dos Dourados,mg,abadiadosdourados,https://cmabadiadosdourados-mg.portaltp.com.br/,404
2,520010,Abadiânia,go,abadiania,https://cmabadiania-go.portaltp.com.br/,404
3,150010,Abaetetuba,pa,abaetetuba,https://cmabaetetuba-pa.portaltp.com.br/,404
4,310020,Abaeté,mg,abaete,https://cmabaete-mg.portaltp.com.br/,404
5,230010,Abaiara,ce,abaiara,https://cmabaiara-ce.portaltp.com.br/,404
6,410010,Abatiá,pr,abatia,https://cmabatia-pr.portaltp.com.br/,404
7,420005,Abdon Batista,sc,abdonbatista,https://cmabdonbatista-sc.portaltp.com.br/,404
8,150013,Abel Figueiredo,pa,abelfigueiredo,https://cmabelfigueiredo-pa.portaltp.com.br/,404


We still need to update the status code column

In [29]:
%%time
data.loc[:,'status_code'] = data.apply(lambda x: get_status(x['transparency_portal_url']), axis=1)

404
404
404
404
404
404
404
404
404
CPU times: user 44.6 ms, sys: 12.1 ms, total: 56.8 ms
Wall time: 118 ms


In [30]:
data

Unnamed: 0,code,name,state,normalized_name,transparency_portal_url,status_code
0,520005,Abadia de Goiás,go,abadiadegoias,https://cmabadiadegoias-go.portaltp.com.br/,404
1,310010,Abadia dos Dourados,mg,abadiadosdourados,https://cmabadiadosdourados-mg.portaltp.com.br/,404
2,520010,Abadiânia,go,abadiania,https://cmabadiania-go.portaltp.com.br/,404
3,150010,Abaetetuba,pa,abaetetuba,https://cmabaetetuba-pa.portaltp.com.br/,404
4,310020,Abaeté,mg,abaete,https://cmabaete-mg.portaltp.com.br/,404
5,230010,Abaiara,ce,abaiara,https://cmabaiara-ce.portaltp.com.br/,404
6,410010,Abatiá,pr,abatia,https://cmabatia-pr.portaltp.com.br/,404
7,420005,Abdon Batista,sc,abdonbatista,https://cmabdonbatista-sc.portaltp.com.br/,404
8,150013,Abel Figueiredo,pa,abelfigueiredo,https://cmabelfigueiredo-pa.portaltp.com.br/,404


In [31]:
# study purposes
data.loc[8, 'status_code'] = 200
data

Unnamed: 0,code,name,state,normalized_name,transparency_portal_url,status_code
0,520005,Abadia de Goiás,go,abadiadegoias,https://cmabadiadegoias-go.portaltp.com.br/,404
1,310010,Abadia dos Dourados,mg,abadiadosdourados,https://cmabadiadosdourados-mg.portaltp.com.br/,404
2,520010,Abadiânia,go,abadiania,https://cmabadiania-go.portaltp.com.br/,404
3,150010,Abaetetuba,pa,abaetetuba,https://cmabaetetuba-pa.portaltp.com.br/,404
4,310020,Abaeté,mg,abaete,https://cmabaete-mg.portaltp.com.br/,404
5,230010,Abaiara,ce,abaiara,https://cmabaiara-ce.portaltp.com.br/,404
6,410010,Abatiá,pr,abatia,https://cmabatia-pr.portaltp.com.br/,404
7,420005,Abdon Batista,sc,abdonbatista,https://cmabdonbatista-sc.portaltp.com.br/,404
8,150013,Abel Figueiredo,pa,abelfigueiredo,https://cmabelfigueiredo-pa.portaltp.com.br/,200


In [32]:
data.loc[data['status_code'] == 404, 'transparency_portal_url'] = None
data

Unnamed: 0,code,name,state,normalized_name,transparency_portal_url,status_code
0,520005,Abadia de Goiás,go,abadiadegoias,,404
1,310010,Abadia dos Dourados,mg,abadiadosdourados,,404
2,520010,Abadiânia,go,abadiania,,404
3,150010,Abaetetuba,pa,abaetetuba,,404
4,310020,Abaeté,mg,abaete,,404
5,230010,Abaiara,ce,abaiara,,404
6,410010,Abatiá,pr,abatia,,404
7,420005,Abdon Batista,sc,abdonbatista,,404
8,150013,Abel Figueiredo,pa,abelfigueiredo,https://cmabelfigueiredo-pa.portaltp.com.br/,200


In [33]:
br_cities.loc[br_cities['status_code'] == 404, 'transparency_portal_url'] = None
br_cities

Unnamed: 0,code,name,state,normalized_name,transparency_portal_url,status_code
0,520005,Abadia de Goiás,go,abadiadegoias,,404
1,310010,Abadia dos Dourados,mg,abadiadosdourados,,404
2,520010,Abadiânia,go,abadiania,,404
3,150010,Abaetetuba,pa,abaetetuba,,404
4,310020,Abaeté,mg,abaete,,404
5,230010,Abaiara,ce,abaiara,,404
6,290020,Abaré,ba,abare,https://abare-ba.portaltp.com.br/,200
7,410010,Abatiá,pr,abatia,,404
8,290010,Abaíra,ba,abaira,https://abaira-ba.portaltp.com.br/,200
9,420005,Abdon Batista,sc,abdonbatista,,404


In [34]:
unnecessary_columns = ['normalized_name', 'status_code']
br_cities = pd.merge(br_cities.drop(unnecessary_columns, axis=1),
                  data.drop(unnecessary_columns, axis=1),
                  on=['code', 'name', 'state'], how='left')

br_cities['transparency_portal_url'] = br_cities \
      .apply(lambda row: row['transparency_portal_url_x'] or row['transparency_portal_url_y'], axis=1)
    
unnecessary_columns = ['transparency_portal_url_x', 'transparency_portal_url_y']
br_cities = br_cities.drop(unnecessary_columns, axis=1)
br_cities

Unnamed: 0,code,name,state,transparency_portal_url
0,520005,Abadia de Goiás,go,
1,310010,Abadia dos Dourados,mg,
2,520010,Abadiânia,go,
3,150010,Abaetetuba,pa,
4,310020,Abaeté,mg,
5,230010,Abaiara,ce,
6,290020,Abaré,ba,https://abare-ba.portaltp.com.br/
7,410010,Abatiá,pr,
8,290010,Abaíra,ba,https://abaira-ba.portaltp.com.br/
9,420005,Abdon Batista,sc,


# Conclusions

After all that study, we find that in that pattern of transparency portals list there are already 279 cities, from them 19 are returning an Internal Server Error (Status Code: 5XX).

It is something like 5% of all Brazilian existing cities!

Below we have a table with all those cities with portals ;)

In [35]:
with_tp_portal = pd.read_csv('../data/2017-05-30-cities_with_tp_portal.csv')
with_tp_portal.shape

(279, 5)

In [36]:
with_tp_portal

Unnamed: 0,code,name,state,transparency_portal_url,status_code
0,290020,Abaré,BA,https://abare-ba.portaltp.com.br/,200
1,290010,Abaíra,BA,https://abaira-ba.portaltp.com.br/,200
2,320010,Afonso Cláudio,ES,https://afonsoclaudio-es.portaltp.com.br/,200
3,320020,Alegre,ES,https://alegre-es.portaltp.com.br/,200
4,320030,Alfredo Chaves,ES,https://alfredochaves-es.portaltp.com.br/,200
5,310170,Almenara,MG,https://almenara-mg.portaltp.com.br/,200
6,320035,Alto Rio Novo,ES,https://altorionovo-es.portaltp.com.br/,200
7,320040,Anchieta,ES,https://anchieta-es.portaltp.com.br/,200
8,290160,Antas,BA,https://antas-ba.portaltp.com.br/,200
9,320050,Apiacá,ES,https://apiaca-es.portaltp.com.br/,200
