# Taxable revenues in Barcelona

Open Data Barcelona provides lots of fun data about our city. You can access it here: https://opendata-ajuntament.barcelona.cat

We will study the mean tax revenues per neighborhood ("barri") in the years 2016 and 2015. Tax revenues are a proxy for (taxable) income, so we're in a way looking at how income varies across the city. 

Barcelona is divided in districts and each district is divided in 'barri' or neighborhood. Each neighborhood of Barcelona is divided in several census tracts. In '2016_renda.csv' we have the mean tax revenues (tax revenues / number of tax returns) in each census tract of the city. 

The columns are in Catalan, so here's a quick explanation in English: 

Any = Year  
Codi_Districte = District Code  
Nom_Districte = District Name  
Codi_Barri = Neighborhood Code  
Nom_Barri = Neighborhood Name  
Seccio_Censal = Census Tract Number  
Import_Euros = Mean tax revenues

First, import the data in "2016_renda.csv" into a Pandas DataFrame.

In [14]:
import pandas as pd

df = pd.read_csv('2016_renda.csv')
df

Unnamed: 0,Any,Codi_Districte,Nom_Districte,Codi_Barri,Nom_Barri,Seccio_Censal,Import_Euros
0,2016,1,Ciutat Vella,1,el Raval,1,9977
1,2016,1,Ciutat Vella,1,el Raval,2,7366
2,2016,1,Ciutat Vella,1,el Raval,3,7657
3,2016,1,Ciutat Vella,1,el Raval,4,9510
4,2016,1,Ciutat Vella,1,el Raval,5,7714
...,...,...,...,...,...,...,...
1063,2016,10,Sant Martí,73,la Verneda i la Pau,143,10551
1064,2016,10,Sant Martí,65,el Clot,234,13866
1065,2016,10,Sant Martí,69,Diagonal Mar i el Front Marítim del Poblenou,235,12175
1066,2016,10,Sant Martí,69,Diagonal Mar i el Front Marítim del Poblenou,236,13028


Get the 5 neighborhoods (barris) with the highest average mean tax revenues.

In [15]:
df.groupby('Nom_Barri') \
  .mean() \
  .reset_index() \
  .sort_values('Import_Euros', ascending=False) \
  [:5] \
  [['Nom_Barri', 'Import_Euros']]

Unnamed: 0,Nom_Barri,Import_Euros
72,les Tres Torres,27626.818182
11,Pedralbes,26720.285714
17,Sant Gervasi - Galvany,24623.580645
18,Sant Gervasi - la Bonanova,23794.333333
23,Sarrià,23012.875


Get the difference in mean revenue between the poorest census tract and the richest, within each district.

You should return a DataFrame with 2 columns:
the district name and the difference in mean revenues.

In [16]:
def get_inequality(df):
    return df.Import_Euros.max() - df.Import_Euros.min()

df.groupby('Nom_Districte') \
  .apply(get_inequality) \
  .sort_values() \
  .reset_index(name='gap')

Unnamed: 0,Nom_Districte,gap
0,Ciutat Vella,8795
1,Gràcia,9303
2,Sant Andreu,9946
3,Nou Barris,10159
4,Sants-Montjuïc,11268
5,Horta-Guinardó,11657
6,Eixample,15176
7,Les Corts,16251
8,Sarrià-Sant Gervasi,16397
9,Sant Martí,19000


## Planning Your Attack

One way to make your code easier to understand is to ensure that your code can be read on two levels: one "declarative" level, where someone can read (or write) *what* will happen and another "imperative level", where someone can read (or write!) *how* the thing is happening.

Data preparation often involves a "pipeline", a uni-directional flow of transformations where the data is moved, one step at a time, towards the final format.

It's important to plan ahead when you design your pipline.

One way to make a plan is to start from the final goal, and write out the following statement: 

1. "If I had ________ (INPUT), then it would be easy to make [FINAL GOAL], I would just need to ________ (step)."

Where you should think of INPUT as "data ______ in data structure ______".

That will be the final step of your pipeline. Now repeat the statement, with the FINAL GOAL being replaced with the INPUT of the previous step: 

2. "If I had ________ (INPUT), then it would be easy to make [PREVIOUS INPUT], I would just need to ________ (step)."

Let's see an example of this method of planning by working out an exercise:


Your goal will be the following: 

We want to understand the income variation (or "spatial inequality") within each "barri". However, each barri has a different size.
Larger barris will naturally have a greater variation, even if there isn't great variation between one block and the next, which is what
we want to understand with spatial inequality.

To deal with this, we will apply the naive solution of simply using the number of census tracts as a proxy for "physical size" of the barri. We will then divide the income gap (difference between
lowest and highest income tract) within each barri by the number of tracts as a way to "control for size". This will be our measure of "spatial inequality".

Your job is to return a dataframe sorted by our index of spatial inequality where any barri with one tract (0 inequality) has been removed.


We will try to lay out a plan to solve the problem with the process we just went over:

1. If I had a *column with
   the income gap divided by the number of tracts*
   then it would be easy to *get the barris with 
   highest and lowest spatial inequality*, I 
   would just need to *sort the dataframe by that
   column*.

2. If I had *A. a column for the income gap and 
   B. a column for the number of tracts in a barri*
   then it would be easy to get a *column with
   the income gap divided by the number of tracts*
   I would just need to *divide the first one column by the second*. 

3b. If I had *the raw data*, then it would be easy to add
   *a column with the number of tracts*, I would just need
   to *count the number of tracts per barri*.

3a. If I had *the raw dat*, then it would be easy to add
   *a column with the income gap*, I would just need to
   *calculate the income difference between tracts in each 
   barri*. 

Now we can use this outline to write a declarative pipeline
function (in the opposite order of the steps we wrote): 

In [17]:
def spatial_inequality_in_barcelona(df):
    df = add_income_diff_for_barris(df)
    df = add_num_tracts_per_barri(df)
    df = add_inequality(df)
    return inequality_by_barri(df)

In the next exercises, you will write each of the function of the pipeline above and in the end, use this function to compare barris based on their spatial inequality.

Write the function: "add_income_diff_for_barris"

HINT: Make sure the returned dataframe is the same size as the original!


In [18]:
def add_diff(df):
    gap = get_inequality(df)
    return df.assign(gap=gap)

def add_income_diff_for_barris(df):
    return df.groupby('Nom_Barri') \
             .apply(add_diff) \
             .reset_index(drop=True)

Write the function: "add_num_tracts_per_barri"

In [19]:
def add_num_tracts_per_barri(df):
    return df.groupby('Nom_Barri') \
             .apply(lambda df: df.assign(num_tracts = df.shape[0])) \
             .reset_index(drop=True)

Write the function: "add_inequality"

In [20]:
def add_inequality(df):
    return df.groupby('Nom_Barri') \
             .apply(lambda df: df.assign(inequality = df.gap/df.num_tracts)) \
             .reset_index(drop=True)


Write the function "inequality_by_barri". Note that this function should check that the dataframe has the same number of rows as number of barris (i.e. one barri per row). Also if some barris have an inequality at 0, remove them.

In [21]:
def inequality_by_barri(df):
    return df.drop_duplicates('Nom_Barri') \
             .drop(columns = ['Seccio_Censal']) \
             .sort_values('inequality') \
             .pipe(lambda df: df[df.inequality != 0])

Try out the function we wrote out in the planning phase, `spatial_inequality_in_barcelona`, does it work when given the raw data?

In [22]:
spatial_inequality_in_barcelona(df)

Unnamed: 0,Any,Codi_Districte,Nom_Districte,Codi_Barri,Nom_Barri,Import_Euros,gap,num_tracts,inequality
0,2016,9,Sant Andreu,58,Baró de Viver,8587,157,2,78.500000
492,2016,7,Horta-Guinardó,37,el Carmel,11044,2545,22,115.681818
139,2016,9,Sant Andreu,60,Sant Andreu,14460,5266,39,135.025641
964,2016,7,Horta-Guinardó,38,la Teixonera,12415,1136,8,142.000000
565,2016,7,Horta-Guinardó,35,el Guinardó,13484,3713,25,148.520000
...,...,...,...,...,...,...,...,...,...
460,2016,9,Sant Andreu,59,el Bon Pastor,16064,7602,7,1086.000000
47,2016,10,Sant Martí,69,Diagonal Mar i el Front Marítim del Poblenou,16881,8738,7,1248.285714
102,2016,4,Les Corts,21,Pedralbes,29364,8853,7,1264.714286
1007,2016,10,Sant Martí,67,la Vila Olímpica del Poblenou,16952,8310,5,1662.000000


Now let's "refactor". "Refactoring" means rewriting the code without changing the functionality. What we wrote works and is readable. 

But maybe breaking it down into so many separate steps, while didactic, isn't the most efficient. You probably grouped by "Nom_Barri" at least 3 separate times!

Try to rewrite the function spatial_inequality_in_barcelona to be more efficient (to only groupby Nom_Barri once for example) and shorter.

In [23]:
def add_inequality(df):
    gap = df.Import_Euros.max() - df.Import_Euros.min()
    sections = df.shape[0]
    return df.assign(gap=gap, 
                     sections=sections, 
                     inequality=gap/sections)

def spatial_inequality_in_barcelona(df):
    return df.groupby('Nom_Barri') \
             .apply(add_inequality) \
             .reset_index(drop=True) \
             .sort_values('inequality') \
             .pipe(lambda df: df[df.gap != 0]) \
             [['Nom_Barri', 'gap', 'sections', 'inequality']]  

spatial_inequality_in_barcelona(df)

Unnamed: 0,Nom_Barri,gap,sections,inequality
0,Baró de Viver,157,2,78.500000
1,Baró de Viver,157,2,78.500000
513,el Carmel,2545,22,115.681818
512,el Carmel,2545,22,115.681818
511,el Carmel,2545,22,115.681818
...,...,...,...,...
1009,la Vila Olímpica del Poblenou,8310,5,1662.000000
1010,la Vila Olímpica del Poblenou,8310,5,1662.000000
363,"Vallvidrera, el Tibidabo i les Planes",9995,3,3331.666667
362,"Vallvidrera, el Tibidabo i les Planes",9995,3,3331.666667
