# Assignment Part 1: Economy Data Visualization

In this assignment, you will be working with city-wide economic data for China from 2010 to 2019. This data includes information on population size, urbanization rate, and disposable income for various cities in China. You will use the Python libraries Numpy, Pandas, and Matplotlib to analyze and visualize this data, and gain insights into trends and patterns in the Chinese economy over the past decade. You will start by reading the data into a Pandas dataframe, and then use Numpy and Pandas to manipulate and summarize the data. You will then use Matplotlib to create plots and charts to visualize the data and uncover trends and patterns.

By completing this assignment, you will gain hands-on experience with these important tools for data analysis and visualization in Python, and develop your skills in working with real-world data.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

census = pd.read_csv('census.csv')
census = census.fillna(0)

# Task 1: Urbanization Analysis

In [None]:
"""
Preprocess dataframe.
"""
dataframe = pd.DataFrame(census, columns=['year', 'dum', 'urbanization_rate', 'urban_disposable_income',
                                          'rural_disposable_income'])
dataframe

We have the urbanization_rate of several cities between 2010 and 2019.
Now, we are going to find the most `successful` city on urbanization.
For example, Beijing's urbanization_rate is 0.860 on 2010 and 0.866 on 2019.
Beijing has a 0.006 improvement.
The task is to find the city with the highest improvement.

We visualize the average urbanization_rate from 2010 to 2019.
It is promising to find that the urbanization_rate improves gradually recent years.

In [None]:
plt.figure(figsize=(10, 5))
sns.lineplot(data=dataframe, x='year', y='urbanization_rate', estimator='mean')
plt.show()

Now, we need to collect the improvements for all the cities.
Specifically, `collected_urban_rates` is a numpy array with shape [num_cities, 2].
Each row contains urbanization rates of a city on 2010 and 2019, respectively.

In [None]:
cities = list(set(dataframe['dum']))

collected_urban_rates = np.empty((len(cities), 2))

for idx, city in enumerate(cities):
    city_data = dataframe[dataframe['dum'] == city]
    r1 = city_data[city_data['year'] == 2010]['urbanization_rate'].values[0]
    r2 = city_data[city_data['year'] == 2019]['urbanization_rate'].values[0]
    collected_urban_rates[idx] = r1, r2

idx = np.argmax(collected_urban_rates[:, 1] - collected_urban_rates[:, 0])
print(
    f"The most successful city is {cities[idx]}, the urbanization rate improves from {collected_urban_rates[idx, 0]} to {collected_urban_rates[idx, 1]}.")

# Task 2: Economy Data Visualization

In this task, we are going to visualize some economy data and trends in the dataset.

In [None]:
df = dataframe.loc[dataframe['dum'].isin(('Beijing', 'Guangzhou City', 'Shanghai', 'Shenzhen'))]
plt.figure(figsize=(10, 5))
sns.lineplot(data=df, x='year', y='urban_disposable_income', hue='dum')
plt.show()

The Theil index is a statistic primarily used to measure economic inequality. For more information, you can refer to https://en.wikipedia.org/wiki/Theil_index
Here, let's find out the cities who has higher Theil index in 2019 than that in 2010, and then draw a bar plot of their change from 2010 to 2019.

In [None]:
df_theil = pd.DataFrame(census, columns=['year', 'dum', 'theil_index'])
df_theil = df_theil.groupby(['dum', 'year']).theil_index.min().reset_index()
df_theil = df_theil.pivot(index='dum', columns='year', values='theil_index')
df_theil[2019] = pd.to_numeric(df_theil[2019], errors='coerce')
df_theil[2010] = pd.to_numeric(df_theil[2010], errors='coerce')
df_theil['change'] = df_theil[2019] - df_theil[2010]

target = df_theil[df_theil['change'] > 0]

plt.figure(figsize=(6, 3))
sns.barplot(data=target, x=target.index, y='change')
plt.show()

# Task 3: Disposable income

In this task, you need to analyze average disposable income in the dataset.

First, you need to calculate the disposable income for each city. Suppose this can be calculated by:
$$ \text{disposable\_income} =\text{urban\_disposable\_income}\times \text{urbanization\_rate}+\text{rural\_disposable\_income}\times(1-\text{urbanization\_rate})$$

In [None]:
df_disp = pd.DataFrame(census, columns=['year', 'dum', 'urbanization_rate', 'urban_disposable_income',
                                        'rural_disposable_income', 'urban_total_income', 'rural_total_income'])
df_disp['disposable_income'] = df_disp['urban_disposable_income'] * df_disp['urbanization_rate'] + \
                               df_disp['rural_disposable_income'] * (1 - df_disp['urbanization_rate'])
df_disp['total_income'] = df_disp['urban_total_income'] * df_disp['urbanization_rate'] + \
                               df_disp['rural_total_income'] * (1 - df_disp['urbanization_rate'])
df_disp

Then, calculate the average disposable income and total income for each city from 2011 to 2019.

In [None]:
df_disp = df_disp[df_disp['year'] != 2010]
res = df_disp.groupby('dum')['disposable_income', 'total_income'].mean()
res

Finally, regression analysis is needed to find the relationship between the disposable_income and total_income.
For that, use seaborn.regplot to regress  disposable_income and total_income.

In [None]:
plt.figure(figsize=(16, 3))
sns.regplot(data=res, x='total_income', y='disposable_income')
plt.show()