# Tidy Tuesday: Income Inequality Before and After Taxes
**August 5, 2025**

Today's goal is to answer the [5 questions](https://github.com/rfordatascience/tidytuesday/blob/main/data/2025/2025-08-05/readme.md#:~:text=Which%20countries%20have,the%20available%20data%3F) from the readme file. They are:
* Which countries have the highest Gini coefficient before taxes?
* Which countries have the highest Gini coefficient after taxes?
* Which countries have the highest shifts in Gini coefficient?
* Which countries have the lowest shifts in Gini coefficient?
* Which countries have had the highest changes in redistribution in the available data?

## Prepare
### Knowing the data
From the Tidy Tuesday page:\
`The Gini coefficient measures inequality on a scale from 0 to 1. Higher values indicate higher inequality ... Income has been equivalized – adjusted to account for the fact that people in the same household can share costs like rent and heating.`

In [48]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [49]:
# Import data
income = pd.read_csv('income_inequality_processed.csv')
income

Unnamed: 0,Entity,Code,Year,gini_mi_eq,gini_dhi_eq
0,Australia,AUS,1989,0.431,0.304
1,Australia,AUS,1995,0.470,0.311
2,Australia,AUS,2001,0.481,0.320
3,Australia,AUS,2003,0.469,0.316
4,Australia,AUS,2004,0.467,0.316
...,...,...,...,...,...
942,Vietnam,VNM,2005,,0.369
943,Vietnam,VNM,2007,,0.401
944,Vietnam,VNM,2009,,0.398
945,Vietnam,VNM,2011,,0.364


## Cleaning the data

In [50]:
print('Data Types:')
print(income.dtypes)
print()
print('Count of NAs:')
print(income.isna().sum())

Data Types:
Entity          object
Code            object
Year             int64
gini_mi_eq     float64
gini_dhi_eq    float64
dtype: object

Count of NAs:
Entity           0
Code             0
Year             0
gini_mi_eq     398
gini_dhi_eq      0
dtype: int64


Data types look right. The number of NAs is concerning. Considering how the questions from the goal relate to the shift between the pre-tax (`gini_mi_eq`) and the post-tax (`gini_dhi_eq`) gini coefficients, we will need both values to accomplish that goal. Thus, I can remove the rows with NAs.

In [51]:
income = income.dropna().reset_index()
income

Unnamed: 0,index,Entity,Code,Year,gini_mi_eq,gini_dhi_eq
0,0,Australia,AUS,1989,0.431,0.304
1,1,Australia,AUS,1995,0.470,0.311
2,2,Australia,AUS,2001,0.481,0.320
3,3,Australia,AUS,2003,0.469,0.316
4,4,Australia,AUS,2004,0.467,0.316
...,...,...,...,...,...,...
544,912,United States,USA,2019,0.505,0.394
545,913,United States,USA,2020,0.521,0.376
546,914,United States,USA,2021,0.517,0.371
547,915,United States,USA,2022,0.512,0.393


I will also rename `gini_mi_eq` and `gini_dhi_eq` to `PreTax` and `PostTax`, respectively. This will help with readability and referencing.

In [52]:
income = income.rename(columns={'gini_mi_eq': 'PreTax', 'gini_dhi_eq': 'PostTax'})
income.columns

Index(['index', 'Entity', 'Code', 'Year', 'PreTax', 'PostTax'], dtype='object')

Lastly, I will need an additional column that will contain the values of the shifts between `PostTax` and `PreTax`. Needed to answer the questions.

In [57]:
Shifts = [] # To contain the shift between tax

# Loop to find the gini coefficient differencet
for row in range(len(income)):
    change = income['PostTax'][row] - income['PreTax'][row]
    Shifts.append(change)
    
income['Shifts'] = Shifts # Create new column

In [59]:
income

Unnamed: 0,index,Entity,Code,Year,PreTax,PostTax,Shifts
0,0,Australia,AUS,1989,0.431,0.304,-0.127
1,1,Australia,AUS,1995,0.470,0.311,-0.159
2,2,Australia,AUS,2001,0.481,0.320,-0.161
3,3,Australia,AUS,2003,0.469,0.316,-0.153
4,4,Australia,AUS,2004,0.467,0.316,-0.151
...,...,...,...,...,...,...,...
544,912,United States,USA,2019,0.505,0.394,-0.111
545,913,United States,USA,2020,0.521,0.376,-0.145
546,914,United States,USA,2021,0.517,0.371,-0.146
547,915,United States,USA,2022,0.512,0.393,-0.119


Data is now ready to use.