# Taxable revenues in Barcelona

Open Data Barcelona provides lots of fun data about our city. You can access it here: https://opendata-ajuntament.barcelona.cat

We will study the mean tax revenues per neighborhood ("barri") in the year 2016. Tax revenues are a proxy for (taxable) income, so we're in a way looking at how income varies across the city. 

Barcelona is divided in districts and each district is divided in 'barri' or neighborhood. Each neighborhood of Barcelona is divided in several census tracts. In '2016_renda.csv' we have the mean tax revenues (tax revenues / number of tax returns) in each census tract of the city. 

The columns are in Catalan, so here's a quick explanation in English: 

Any = Year  
Codi_Districte = District Code  
Nom_Districte = District Name  
Codi_Barri = Neighborhood Code  
Nom_Barri = Neighborhood Name  
Seccio_Censal = Census Tract Number  
Import_Euros = Mean tax revenues

First, import the data in "2016_renda.csv" into a Pandas DataFrame.

In [1]:
import pandas as pd

df = pd.read_csv('2016_renda.csv')
df

Unnamed: 0,Any,Codi_Districte,Nom_Districte,Codi_Barri,Nom_Barri,Seccio_Censal,Import_Euros
0,2016,1,Ciutat Vella,1,el Raval,1,9977
1,2016,1,Ciutat Vella,1,el Raval,2,7366
2,2016,1,Ciutat Vella,1,el Raval,3,7657
3,2016,1,Ciutat Vella,1,el Raval,4,9510
4,2016,1,Ciutat Vella,1,el Raval,5,7714
...,...,...,...,...,...,...,...
1063,2016,10,Sant Martí,73,la Verneda i la Pau,143,10551
1064,2016,10,Sant Martí,65,el Clot,234,13866
1065,2016,10,Sant Martí,69,Diagonal Mar i el Front Marítim del Poblenou,235,12175
1066,2016,10,Sant Martí,69,Diagonal Mar i el Front Marítim del Poblenou,236,13028


Get the 5 neighborhoods (barris) with the highest average mean tax revenues.

In [2]:
df.groupby('Nom_Barri') \
  .mean(numeric_only=True) \
  .sort_values('Import_Euros', ascending=False) \
  [:5] \
  ['Import_Euros']

Nom_Barri
les Tres Torres               27626.818182
Pedralbes                     26720.285714
Sant Gervasi - Galvany        24623.580645
Sant Gervasi - la Bonanova    23794.333333
Sarrià                        23012.875000
Name: Import_Euros, dtype: float64

Get the difference in mean revenue between the poorest census tract and the richest, within each district.

You should return a DataFrame with 2 columns:
the district name and the difference in mean revenues.

In [3]:
def get_inequality(df):
    return df.Import_Euros.max() - df.Import_Euros.min()

df.groupby('Nom_Districte') \
  .apply(get_inequality) \
  .sort_values() \
  .reset_index(name='gap')

Unnamed: 0,Nom_Districte,gap
0,Ciutat Vella,8795
1,Gràcia,9303
2,Sant Andreu,9946
3,Nou Barris,10159
4,Sants-Montjuïc,11268
5,Horta-Guinardó,11657
6,Eixample,15176
7,Les Corts,16251
8,Sarrià-Sant Gervasi,16397
9,Sant Martí,19000


We want to understand the income variation (or "spatial inequality") within each neighborhood. However, each neighborhood has a different size. Larger neighborhoods will naturally have a greater variation, even if there isn't great variation between one block and the next, which is what we want to understand with spatial inequality.

We will apply the naive solution of simply using the number of census tracts as a proxy for "physical size" of the neighborhood. We will then divide the income gap (difference between lowest and highest income tract) within each neighborhood by the number of tracts as a way to "control for size". This will be our measure of "spatial inequality".

We want to have a single function, `spatial_inequality_in_barcelona`, that takes the 2015 dataframe and returns a dataframe sorted by our index of spatial inequality. You will find below the `spatial_inequality_in_barcelona` function that takes a dataframe as argument and applies successively different intermediary functions on it. You are asked to write those intermediary functions.

In [4]:
def spatial_inequality_in_barcelona(df):
    df = add_income_diff(df)
    df = add_num_tracts(df)
    df = add_inequality(df)
    return df

Here is what each intermediary function should do:
 - `add_income_diff`: add a column with the income gap in each neighborhood to the dataframe
 - `add_num_tracts`: add a column with the number of census tracts in each neighborhood to the dataframe
 - `add_inequality`: add a column with the income gap divided by the number of tracts in each neighborhood to the dataframe

Write the function: "add_income_diff"

In [5]:
def add_diff(df):
    gap = get_inequality(df)
    return df.assign(gap=gap)

def add_income_diff(df):
    return df.groupby('Nom_Barri',group_keys=False) \
             .apply(add_diff) \
             .reset_index(drop=True)

Write the function: "add_num_tracts"

In [6]:
def add_num_tracts(df):
    return df.groupby('Nom_Barri',group_keys=False) \
             .apply(lambda df: df.assign(num_tracts = df.shape[0])) \
             .reset_index(drop=True)

Write the function: "add_inequality"

In [7]:
def add_inequality(df):
    return df.groupby('Nom_Barri',group_keys=False) \
             .apply(lambda df: df.assign(inequality = df.gap/df.num_tracts)) \
             .reset_index(drop=True) \
             .sort_values('inequality')


Try your function `spatial_inequality_in_barcelona`, does it work when given the raw data?

In [8]:
spatial_inequality_in_barcelona(df)

Unnamed: 0,Any,Codi_Districte,Nom_Districte,Codi_Barri,Nom_Barri,Seccio_Censal,Import_Euros,gap,num_tracts,inequality
251,2016,3,Sants-Montjuïc,12,la Marina del Prat Vermell,25,9411,0,1,0.000000
824,2016,8,Nou Barris,56,Vallbona,116,8445,0,1,0.000000
687,2016,7,Horta-Guinardó,42,la Clota,102,13651,0,1,0.000000
832,2016,9,Sant Andreu,58,Baró de Viver,7,8430,157,2,78.500000
831,2016,9,Sant Andreu,58,Baró de Viver,6,8587,157,2,78.500000
...,...,...,...,...,...,...,...,...,...,...
972,2016,10,Sant Martí,67,la Vila Olímpica del Poblenou,52,21522,8310,5,1662.000000
971,2016,10,Sant Martí,67,la Vila Olímpica del Poblenou,51,16952,8310,5,1662.000000
403,2016,5,Sarrià-Sant Gervasi,22,"Vallvidrera, el Tibidabo i les Planes",3,21524,9995,3,3331.666667
401,2016,5,Sarrià-Sant Gervasi,22,"Vallvidrera, el Tibidabo i les Planes",1,12967,9995,3,3331.666667


The resulting dataframe has several issues. It reports the neighborhood inequality for each census tract and has thus a lot of repeated information. Plus some neighborhoods have 0 inequality because they have only one census tract. Use `drop_duplicates()` to retain only the relevant information and a boolean mask to remove the neighborhoods with only 1 census tract.

In [12]:
df2 = spatial_inequality_in_barcelona(df)
df2[(df2['inequality'] != 0)][['Nom_Barri','inequality']].drop_duplicates()

Unnamed: 0,Nom_Barri,inequality
832,Baró de Viver,78.500000
659,el Carmel,115.681818
843,Sant Andreu,135.025641
673,la Teixonera,142.000000
613,el Guinardó,148.520000
...,...,...
838,el Bon Pastor,1086.000000
1065,Diagonal Mar i el Front Marítim del Poblenou,1248.285714
393,Pedralbes,1264.714286
975,la Vila Olímpica del Poblenou,1662.000000


Try to rewrite `spatial_inequality_in_barcelona` with less intermediary functions and correcting for duplicates and single tract neighborhoods.

In [13]:
def add_inequality(df):
    gap = df.Import_Euros.max() - df.Import_Euros.min()
    sections = df.shape[0]
    return df.assign(gap=gap, 
                     sections=sections, 
                     inequality=gap/sections)

def spatial_inequality_in_barcelona(df):
    return df.groupby('Nom_Barri',group_keys=False) \
             .apply(add_inequality) \
             .reset_index(drop=True) \
             .sort_values('inequality') \
             .pipe(lambda df: df[df.inequality != 0]) \
             [['Nom_Barri', 'inequality']] \
             .drop_duplicates()

spatial_inequality_in_barcelona(df)

Unnamed: 0,Nom_Barri,inequality
832,Baró de Viver,78.500000
659,el Carmel,115.681818
843,Sant Andreu,135.025641
673,la Teixonera,142.000000
613,el Guinardó,148.520000
...,...,...
838,el Bon Pastor,1086.000000
1065,Diagonal Mar i el Front Marítim del Poblenou,1248.285714
393,Pedralbes,1264.714286
975,la Vila Olímpica del Poblenou,1662.000000
