# Tax Revenues (Income!) in Barcelona

Open Data Barcelona provides lots of fun data about our city.

You can access it here: https://opendata-ajuntament.barcelona.cat

We will be examining average tax returns per neighborhood ("barri") in the years 2016 and 2015. Tax revenues are, naturally, a proxy for income, so we're really looking at home (taxable) income varies across the city.

The columns are in Catalan, so here's a quick explanation in English: 

Any = Year
Codi_Districte = District Code
Nom_Districte = District Name
Codi_Barri = Neighborhood Code
Nom_Barri = Neighborhood Name
Seccio_Censal = Cenus Tract Number
Import_Euros = Tax Revenue (average over all individuals in the census tract)

In [None]:
# Let's begin by reading the file "2016_renda.csv"
# into a DataFrame:


In [None]:
#
# 1)
# Get the (5) barris with the highest average tax revenues
# (i.e. average over the census tracts in each barri)


In [None]:
#
# 2)
# Get the difference in mean revenue between the 
# poorest census tract and the richest, within 
# each district.
#
# You should return a DataFrame with 2 columns:
# The district name and the difference in reveneue.


## Planning Your Attack

One pattern to make your code more legible, and to make it easier to break down big problems, is to ensure that your code can be read on two levels: one "declarative" level, where someone can read (or write) *what* will happen and another "imperative level", where someone can read (or write!) *how* the thing is happening.

Data preparation often involves a "pipeline", a uni-directional flow of transformations where the data is moved, one step at a time, towards the final format.

It's important, when you try to create a pipline, which can be a big problem, to make a plan.

One way to make a plan is to start from the final goal, and ask write out the following statement: 

1. "If I had ________ (INPUT), then it would be easy to make [FINAL GOAL], I would just need to ________ (step)."

Where you should think of INPUT as "data ______ in data structure ______".

That will be the final step of your pipeline. Now repeat the statement, with the FINAL GOAL being replaced with the INPUT of the previous step: 

2. "If I had ________ (INPUT), then it would be easy to make [PREVIOUS INPUT], I would just need to ________ (step)."

Let's see an example of this method of planning by working out an exercise:

In [None]:
#
# Your goal will be the following: 
#
# We want to understand the income variation 
# (or "spatial inequality") within each "barri".
# However, each barri is a different size.
# Larger barris will naturally have a greater
# variation, even if there isn't great variation
# between one block and the next, which is what
# we want to understand with spatial inequality.
# To deal with this, we will apply a naive solution
# of simply using the number of census tracts as
# a proxy for "physical size" of the barri. We 
# will then divide the income gap (difference between
# lowest and highest income tract) within each barri
# by the number of tracts as a way to "control for size".
# This will be our measure of "spatial inequality".
#
# Your job is to return a dataframe sorted by 
# spatial inequality, with any barri with one
# tract (0 inequality) removed.
#
#
# We will try to lay out a plan to solve the problem
# at hand with the process we just went over:

# 1. If I had a <<an extra column on the dataframe of 
#    the income gap divided by the number of tracts>>
#    then it would be easy to <<get the barris with 
#    highest and lowest normalized income gap>>, I 
#    would just need to <<sort the dataframe by that
#    column>>>.
#
# 2. If I had << A. a column for the income gap and 
#    B. a column for the number of tracts in a barri>>
#    then it would be easy to make << an extra column on the
#    dataframe of the income gap divided by the number of tracts>>
#    I would just need to <<divide one column by the other>>. 
#
#3b. If I had <<the raw data>>, then it would be easy to make
#    <<a column with the number of tracts>>, I would just need
#    to <<count the number of tracts per barri>>.
#
#3a. If I had <<the raw data>>, then it would be easy to make
#    <<a column with the income gap>>, I would just need to
#    <<calculate the income difference between tracts in each 
#    barri>>. 
#
# Now we can use this outline to write a declarative pipeline
# function (in the opposite order of the steps we wrote): 

def spatial_inequality_in_barcelona(df):
    df = add_income_diff_for_barris(df)
    df = add_num_tracts_per_barri(df)
    df = add_inequality(df)
    return inequality_by_barri(df)

# In the next exercises, you will write each of those functions,
# and in the end, use this function to compare barris based on
# their spatial inequality.

In [None]:
#
# 3)
# Write the function: "add_income_diff_for_barris"
#
# HINT: Make sure the returned dataframe is the
# same size as the original!
#


In [None]:
#
# 4)
# Create the function: "add_num_tracts_per_barri"


In [None]:
#
# 5)
# Create the function: "add_inequality"


In [None]:
#
# 6)
# Add the function "inequality_by_barri"
# 
# Note that this function should probably 
# make sure that the dataframe has the
# same number of rows as number of barris
# (i.e. one barri per row).
#
# Also note that some barris have an inequality
# of 0, let's go ahead and remove them!


In [None]:
# 
# 7) 
# Try out the function we wrote out in the planning
# phase, spatial_inequality_in_barcelona,
# does it work when given the raw data?
# 
# Now let's go ahead and "refactor"
# "Refactoring" means rewriting the code without
# changing the functionality. What we wrote works,
# and is great and legible. 
# 
# But maybe breaking it down into so many separate 
# steps, while didactic, could be considered overkill
# and maybe isn't the most efficient. You probably
# grouped by "Nom_Barri" at least 3 separate times!
#
# Try to rewrite the function spatial_inequality_in_barcelona
# to be more efficient (to only groupby Nom_Barri once!)
# and a bit shorter.


In [None]:
# Open Data Barcelona provides the tax data for years
# 2015 and 2016 in different csv's. Read in the tax data
# for year 2015 so we can see how incomes have changed
# between the years. 

#
# 8)
# Get the growth of the mean tax reveneue per census
# tract. Create a DataFrame that has the district, barri,
# and census tract as well as the difference in revenue
# between the years for each tract.
#
# Sort by the difference per tract.


In [None]:
#
# 9)
# Get the mean growth per barri. 
# Sort by mean growth.
