# Final project: Identification of vulnerable population groups

## 1. Problem statement

According to a December 2021 [inFOM survey](https://www.cbr.ru/Collection/Collection/File/39633/inFOM_21-12.pdf), 27% of Russians have enough money only for food, and another 9% cannot afford a nutritious diet. These people are especially attentive to prices, and the rate of growth of food prices usually exceeds the average rate of inflation. At the same time, Rosstat believes that food expenses should make up approximately 36% of a Russian's average monthly expenses (another 10% goes to utilities and housing, 4% goes to medicines). 

Until 2021, the "poverty line" (living below the subsistence minimum) in Russia was determined by the cost of the [minimum food basket](https://base.garant.ru/70306880/). In the same year, the government "untied" the poverty level from the prices of basic products: since 2021, the subsistence minimum is calculated as 44.2% of the median income of Russian citizens for the previous year.

You have at your disposal data on income, morbidity, socially vulnerable groups of the Russian population and other economic and demographic data.

Your task as a data scientist:
* cluster the regions of Russia and determine which of them are in the greatest need of assistance to low-income/disadvantaged segments of the population;
* describe the population groups facing poverty;
* determine:
    * whether the number of children, pensioners and other socially vulnerable groups affects the poverty level in the region;
    * whether the level of poverty/social disadvantage is related to production and consumption in the region;
    * what other dependencies can be observed in relation to socially vulnerable segments of the population.

## 2. Getting to know the data

To reduce data redundancy, data for districts uniting multiple regions has been removed from the initial data sources. Also the following former regions have been removed, since they have significant amount of empty values: 
* Агинский Бурятский округ (Забайкальский край)
* Коми-Пермяцкий округ, входящий в состав Пермского края
* Корякский округ, входящий в состав Камчатского края
* Крымский федеральный округ
* Таймырский (Долгано-Ненецкий) автономный округ (Красноярский край)
* Усть-Ордынский Бурятский округ
* Эвенкийский автономный округ (Красноярский край)

### 2.0 Import dependencies and define helper code

#### 2.0.1 Import dependencies

In [19]:
import pandas as pd

#### 2.0.2 Define helper code

In [20]:
def rename_columns(df, prefix):
    dict_columns = {}
    for i in range(1, df.shape[1]):
        dict_columns[df.columns[i]] = prefix + str(df.columns[i])

    return df.rename(columns=dict_columns)

### 2.1 Population data

Data source contains the population dynamics of the Russian regions across the years 1999 and 2022. Columns:
1. `region` - name of the region of the Russian Federation. 
2. `population_1999-population_2022` - columns containing the population count for each of the regions in the corresponding year.

In [21]:
data_population = pd.read_excel('data/population.xlsx', 'report')
for i in range(data_population.shape[0]//2):
    data_population.iloc[i, 1:] = data_population.iloc[i+1, 1:]
    data_population.drop(index=i+1, inplace=True)
    data_population = data_population.reset_index(drop=True)

data_population = rename_columns(data_population, 'population_')
data_population = data_population.set_index('region')
data_population = data_population.sort_index()
data_population.to_csv('population.csv')

### 2.2 Child mortality in rural areas

Data source contains child mortality in rural areas on the first life year in persons per Russian region across the years 1990 and 2021. Columns:
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `child_mortality_rural_1990-child_mortality_rural_2021` - columns containing the child mortality in rural areas for each of the regions in the corresponding year.

In [22]:
data_child_mortality_rural = pd.read_excel('data/child_mortality_rural_1990_2021.xlsx')
data_child_mortality_rural = rename_columns(data_child_mortality_rural, 'child_mortality_rural_')
data_child_mortality_rural['region'] = data_child_mortality_rural['region'].str.strip()
data_child_mortality_rural = data_child_mortality_rural.set_index('region', drop=True)
data_child_mortality_rural = data_child_mortality_rural.sort_index()
data_child_mortality_rural.head()

Unnamed: 0_level_0,child_mortality_rural_1990,child_mortality_rural_1991,child_mortality_rural_1992,child_mortality_rural_1993,child_mortality_rural_1994,child_mortality_rural_1995,child_mortality_rural_1996,child_mortality_rural_1997,child_mortality_rural_1998,child_mortality_rural_1999,...,child_mortality_rural_2012,child_mortality_rural_2013,child_mortality_rural_2014,child_mortality_rural_2015,child_mortality_rural_2016,child_mortality_rural_2017,child_mortality_rural_2018,child_mortality_rural_2019,child_mortality_rural_2020,child_mortality_rural_2021
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Алтайский край,217.0,245.0,230.0,249.0,226.0,261.0,220.0,213.0,176.0,196.0,...,173.0,154,117,94,87,79,70,60,44,42.0
Амурская область,142.0,124.0,106.0,108.0,111.0,99.0,100.0,118.0,86.0,91.0,...,63.0,50,39,31,23,20,17,16,13,15.0
Архангельская область,116.0,118.0,90.0,94.0,80.0,65.0,71.0,46.0,56.0,58.0,...,32.0,31,31,26,18,22,15,16,7,12.0
Архангельская область (кроме Ненецкого автономного округа),,,,,,,,,,,...,,29,29,25,17,22,14,16,5,10.0
Астраханская область,103.0,73.0,55.0,63.0,62.0,71.0,57.0,79.0,92.0,56.0,...,29.0,35,42,30,21,22,14,18,13,11.0


In [23]:
#df = pd.concat([data_population, data_child_mortality_rural], axis=1)
df = data_child_mortality_rural / data_population
df.to_csv('text.csv')

### 2.3 Child mortality in urban areas

Data source contains child mortality in urban areas on the first life year in persons per Russian region across the years 1990 and 2021. Columns:
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `child_mortality_urban_1990-child_mortality_urban_2021` - columns containing the child mortality in urban areas for each of the regions in the corresponding year.

In [24]:
data_child_mortality_urban = pd.read_excel('data/child_mortality_urban_1990_2021.xlsx')
data_child_mortality_urban = rename_columns(data_child_mortality_urban, 'child_mortality_urban_')
data_child_mortality_urban['region'] = data_child_mortality_urban['region'].str.strip()
data_child_mortality_urban = data_child_mortality_urban.set_index('region', drop=True)
data_child_mortality_urban = data_child_mortality_urban.sort_index()
data_child_mortality_urban.head()

Unnamed: 0_level_0,child_mortality_urban_1990,child_mortality_urban_1991,child_mortality_urban_1992,child_mortality_urban_1993,child_mortality_urban_1994,child_mortality_urban_1995,child_mortality_urban_1996,child_mortality_urban_1997,child_mortality_urban_1998,child_mortality_urban_1999,...,child_mortality_urban_2012,child_mortality_urban_2013,child_mortality_urban_2014,child_mortality_urban_2015,child_mortality_urban_2016,child_mortality_urban_2017,child_mortality_urban_2018,child_mortality_urban_2019,child_mortality_urban_2020,child_mortality_urban_2021
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Алтайский край,305.0,322.0,299.0,311.0,271.0,229.0,185.0,162.0,164.0,195.0,...,157.0,136,158,128,124,106,106,54,53,46.0
Амурская область,201.0,169.0,174.0,136.0,131.0,151.0,145.0,150.0,160.0,179.0,...,98.0,69,62,52,30,28,29,27,29,21.0
Архангельская область,198.0,170.0,174.0,152.0,156.0,152.0,136.0,118.0,122.0,120.0,...,78.0,87,70,62,63,54,39,43,26,23.0
Архангельская область (кроме Ненецкого автономного округа),,,,,,,,,,,...,,84,68,59,62,50,39,42,24,22.0
Астраханская область,147.0,162.0,133.0,153.0,140.0,123.0,135.0,109.0,129.0,98.0,...,113.0,95,106,89,61,52,57,54,51,35.0


### 2.4 Disability statistics

Data source is loaded and then gets transformed to contain the disability data per Russion region across the years 2017 and 2022. Columns
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `disability_{year}_total, disability_{year}_18_30, disability_{year}_31_40, disability_{year}_41_50, disability_{year}_60_` - columns containing the disability counts across the whole population and across different age groups for each of the regions in the corresponding year.

In [25]:
data_disability = pd.read_csv('data/disabled_total_by_age_2017_2022.csv', sep=';')
data_disability = data_disability[data_disability['date'].str.endswith('01-01')]


def flatten_table(df, years):
    result = []
    for year in years:
        df_year = df[df['date'].str.startswith(year)]
        df_year = df_year.drop(columns='date')
        df_year = rename_columns(df_year, f'disability_{year}_')
        df_year = df_year.set_index('region')
        result.append(df_year)
    return pd.concat(result, axis=1)


data_disability = flatten_table(data_disability, ['2017', '2018', '2019', '2020', '2021', '2022'])
data_disability.head()

Unnamed: 0_level_0,disability_2017_total,disability_2017_18_30,disability_2017_31_40,disability_2017_41_50,disability_2017_51_60,disability_2017_60_,disability_2018_total,disability_2018_18_30,disability_2018_31_40,disability_2018_41_50,...,disability_2021_31_40,disability_2021_41_50,disability_2021_51_60,disability_2021_60_,disability_2022_total,disability_2022_18_30,disability_2022_31_40,disability_2022_41_50,disability_2022_51_60,disability_2022_60_
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Алтайский край,182975.0,8930.0,13527.0,13836.0,30477.0,116205.0,178491.0,8238.0,13542.0,14296.0,...,13358.0,16039.0,23274.0,107520.0,162551.0,7308.0,13058.0,16684.0,21956.0,103545.0
Амурская область,69063.0,3687.0,5260.0,6266.0,12679.0,41171.0,67042.0,3428.0,5297.0,6263.0,...,5143.0,6338.0,9427.0,38004.0,59472.0,3295.0,5041.0,6331.0,8689.0,36116.0
Архангельская область,88792.0,3488.0,5159.0,6747.0,13834.0,59564.0,87258.0,3276.0,5140.0,6671.0,...,5154.0,7067.0,11368.0,56370.0,79706.0,2977.0,5063.0,7242.0,10727.0,53697.0
Астраханская область,44995.0,3159.0,3974.0,4749.0,9136.0,23977.0,44111.0,2969.0,3940.0,4730.0,...,4018.0,4980.0,7708.0,23379.0,42096.0,2686.0,3997.0,5143.0,7487.0,22783.0
Белгородская область,223030.0,6318.0,10383.0,16596.0,37444.0,152289.0,214048.0,5808.0,10127.0,15706.0,...,9474.0,14917.0,28836.0,133344.0,182223.0,4868.0,9121.0,14645.0,26875.0,126714.0


### 2.5 Desease Incidence Statistics

Data source is loaded and then gets transformed to contain the desease incidence data per Russion region across the years 2005 and 2016. Data has been reduced for total deseases per region and age group, specific desease statistics have been removed. Columns
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `desease_incidence_0-14 years_{year}, desease_incidence_15-17 years_{year}, desease_incidence_18 years and older_{year}, desease_incidence_Total_{year}` - columns containing the desease incidence across the whole population and across different age groups for each of the regions in the corresponding year.

In [None]:
data_desease_incidence = pd.read_excel(
    'data/morbidity_2005_2016_age_disease.xlsx',
    index_col=[0, 1]
)
data_desease_incidence_list = []
age_groups = ['0-14 years', '15-17 years', '18 years and older', 'Total']
for age_group in age_groups:
    df = data_desease_incidence[data_desease_incidence.index.get_level_values('age_group') == age_group]
    df = df.reset_index()
    df = df.drop('age_group', axis=1)
    df = rename_columns(df, f'desease_incidence_{age_group}_')
    df['region'] = df['region'].str.strip()
    df = df.set_index('region')
    data_desease_incidence_list.append(df)
data_desease_incidence = pd.concat(data_desease_incidence_list, axis=1)
data_desease_incidence.head()

### 2.6 Welfare Expense Share

Data source is loaded and then gets transformed to contain the welfare expense data per Russion region across the years 2015 and 2020. Columns
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `welfare_share_2015-welfare_share_2020` - columns containing the welfare expense percentage across the whole population for each of the regions in the corresponding year.

In [None]:
data_welfare_share = pd.read_excel('data/welfare_expense_share_2015_2020.xlsx')
data_welfare_share = rename_columns(data_welfare_share, 'welfare_share_')
data_welfare_share = data_welfare_share.set_index('region', drop=True)
data_welfare_share = data_welfare_share.sort_index()
data_welfare_share.head()

### 2.7 Poverty Statistics

Data source is loaded and then gets transformed to contain the poverty data per Russion region across the years 1992 and 2020. Columns
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `poverty_1992-poverty_2020` - columns containing the poverty percentage across the whole population for each of the regions in the corresponding year.

In [26]:
data_poverty = pd.read_csv(
    'data/poverty_percent_by_regions_1992_2020.csv', 
    sep=';', 
    skipinitialspace=True
)


def flatten_table(df, years):
    result = []
    for year in years:
        df_year = df[df['year'] == year]
        df_year = df_year.drop(columns='year')
        df_year = rename_columns(df_year, f'poverty_{year}_')
        df_year = df_year.set_index('region')
        result.append(df_year)
    return pd.concat(result, axis=1)


data_poverty = flatten_table(data_poverty, list(range(1992, 2021)))
data_poverty = data_poverty.sort_index()
data_poverty.head()

Unnamed: 0_level_0,poverty_1992_poverty_percent,poverty_1993_poverty_percent,poverty_1994_poverty_percent,poverty_1995_poverty_percent,poverty_1996_poverty_percent,poverty_1997_poverty_percent,poverty_1998_poverty_percent,poverty_1999_poverty_percent,poverty_2000_poverty_percent,poverty_2001_poverty_percent,...,poverty_2011_poverty_percent,poverty_2012_poverty_percent,poverty_2013_poverty_percent,poverty_2014_poverty_percent,poverty_2015_poverty_percent,poverty_2016_poverty_percent,poverty_2017_poverty_percent,poverty_2018_poverty_percent,poverty_2019_poverty_percent,poverty_2020_poverty_percent
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Алтайский край,,,,33.7,46.8,45.7,52.9,53.8,53.9,47.3,...,22.6,20.6,17.6,17.1,18.0,17.8,17.5,17.4,17.6,17.5
Амурская область,,,,36.1,28.2,26.3,31.2,38.0,47.7,45.3,...,20.4,16.0,16.2,14.8,15.2,17.0,16.7,15.6,15.7,15.2
Архангельская область,,,,26.2,26.8,23.9,31.6,42.6,33.5,27.4,...,14.4,13.2,14.6,14.8,16.5,15.2,14.3,13.5,13.6,12.8
Архангельская область (кроме Ненецкого автономного округа),,,,,,,,,,,...,,13.0,14.5,14.5,16.2,14.9,13.9,12.5,12.7,12.3
Астраханская область,,,,32.3,30.7,25.0,30.4,36.0,33.4,31.1,...,14.2,12.5,11.8,12.0,14.2,16.1,15.5,15.1,15.5,15.6


### 2.8 Poverty Statisticts in Demografic Context

Data source is loaded and then gets transformed to contain the poverty data within demografic groups per Russion region across the years 2017 and 2020. Columns
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `poverty_dem_children_{year}, poverty_dem_seniors_{year}, poverty_dem_adults_{year}` - columns containing the poverty percentage across the demografic groups for each of the regions in the corresponding year.

In [58]:
data_poverty_dem_list = []
for year in range(2017, 2021):
    data_poverty_dem = pd.read_excel(f'data/poverty_socdem_{year}.xlsx')
    data_poverty_dem = rename_columns(data_poverty_dem, 'poverty_dem_')
    data_poverty_dem['region'] = data_poverty_dem['region'].str.strip()
    data_poverty_dem = data_poverty_dem.set_index('region', drop=True)
    data_poverty_dem_list.append(data_poverty_dem)
data_poverty_dem = pd.concat(data_poverty_dem_list, axis=1)

data_poverty_dem = data_poverty_dem.sort_index()
data_poverty_dem.head()

Unnamed: 0_level_0,poverty_dem_children_2017,poverty_dem_seniors_2017,poverty_dem_adults_2017,poverty_dem_children_2018,poverty_dem_seniors_2018,poverty_dem_adults_2018,poverty_dem_children_2019,poverty_dem_seniors_2019,poverty_dem_adults_2019,poverty_dem_children_2020,poverty_dem_seniors_2020,poverty_dem_adults_2020
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Алтайский край,38.0,6.4,55.6,42.1,5.1,52.8,38.7,3.9,57.4,31.4,9.2,59.4
Амурская область,39.9,6.0,54.1,40.6,4.6,54.8,33.9,7.2,58.9,38.4,4.7,56.9
Архангельская область,38.1,6.8,55.1,39.5,6.5,54.0,33.0,8.1,59.0,30.2,6.7,63.1
Архангельская область (кроме Ненецкого автономного округа),37.4,7.1,55.5,38.8,6.6,54.6,31.3,8.9,59.8,28.5,7.2,64.3
Астраханская область,40.2,4.2,55.5,35.0,5.2,59.8,43.4,4.7,51.9,42.1,5.5,52.4


### 2.9 Housing Statistics 2020

#### 2.9.1 Living conditions - size

Data source contains the housing conditions in terms of the living space in the year 2020 across the Russian regions. Columns:
1. `housing_size_в том числе домохозяйства, указавшие, что при проживании не испытывают стесненностиgion` - percentage of the households which declared to have sufficient living space.
2. `housing_size_в том числе домохозяйства, указавшие, что при проживании испытывают определенную стесненность	housing_size_в том числе домохозяйства, указавшие, что при проживании испытывают большую ` - percentage of the households which declared to have some shortage of the living space.
3. `housing_size_в том числе домохозяйства, указавшие, что при проживании испытывают большую стесненность` - percentage of the households which declared to have a considerable shortage of the living space.
4. `housing_size_затруднились ответить` - percentage of the households which found it difficult to anser.
5. `housing_size_Размер общей площади в расчете на члена домохозяйства` - total area size per houshold member.
6. `housing_size_Размер жилой площади в расчете на члена домохозяйства` - living area size per household member.
7. `housing_size_Число жилых комнат в расчете на одно домохозяйство` - number of rooms per household.

In [87]:
data_housing_size = pd.read_excel(
    'data/housing_2020.xlsx', 
    sheet_name='housing_cond'
)
data_housing_size = rename_columns(data_housing_size, 'housing_size_')
data_housing_size = data_housing_size.set_index('region', drop=True)
data_housing_size = data_housing_size.sort_index()
data_housing_size.head()

Unnamed: 0_level_0,"housing_size_в том числе домохозяйства, указавшие, что при проживании не испытывают стесненности","housing_size_в том числе домохозяйства, указавшие, что при проживании испытывают определенную стесненность","housing_size_в том числе домохозяйства, указавшие, что при проживании испытывают большую стесненность",housing_size_затруднились ответить,housing_size_Размер общей площади в расчете на члена домохозяйства,housing_size_Размер жилой площади в расчете на члена домохозяйства,housing_size_Число жилых комнат в расчете на одно домохозяйство
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Алтайский край,83.2,13.4,3.3,0.1,25.9,18.0,2.4
Амурская область,77.3,16.8,5.9,0.0,23.2,17.3,2.3
Архангельская область (кроме Ненецкого автономного округа),83.3,14.5,2.2,0.0,24.6,16.9,2.4
Астраханская область,77.5,19.4,3.1,0.0,24.9,19.2,2.3
Белгородская область,86.3,11.7,2.0,0.0,25.1,17.8,2.6


#### 2.9.2 Living conditions - state

Data source contains the housing conditions in terms of the state of the house in the year 2020 across the Russian regions. Columns:
1. `housing_state_из них домохозяйства, собирающиеся улучшить свои жилищные условия` - percentage of the households which declared to improve the house state.
2. `housing_state_из них указавшие: на стесненность проживания` - percentage of the households which declared to have not sufficient living space.
3. `housing_state_из них указавшие: на плохое или очень плохое состояние жилого помещения` - percentage of the households which declared to have very poor or poor state of the house.
4. `housing_state_из них указавшие: на плохое состояние или очень плохое состояние жилого помещения и на стесненность проживания` - percentage of the households which declared to have very poor or poor state of the house and do not have sufficient living space.
5. `housing_state_из числа домохозяйств, собирающихся улучшить свои жилищные условия: планируют вселиться в жилое помещение, строительство которого ведут (участвуют в долевом строительстве)` - percentage of the households which declared to improve their living conditions by moving to an object which they are currently building.
6. `housing_state_из числа домохозяйств, собирающихся улучшить свои жилищные условия: собираются подать документы для постановки на очередь (и/или ожидают прохождения очереди)	` - percentage of the households which declared to apply for a social appartment.
7. `housing_state_из числа домохозяйств, собирающихся улучшить свои жилищные условия: рассчитывают на получение нового жилья в связи со сносом дома` - percentage of the households which declared to expect to get a new apartment due to their current house demolition.
8. `housing_state_из числа домохозяйств, собирающихся улучшить свои жилищные условия: собираются купить (построить) другое жилье` - percentage of the households which declared to improve their living conditions by buying a new apartment.
9. `housing_state_из числа домохозяйств, собирающихся улучшить свои жилищные условия: собираются снимать жилье` - percentage of the households which declared to improve their living conditions by renting a different apartment.
10. `housing_state_из числа домохозяйств, собирающихся улучшить свои жилищные условия: собираются улучшить свои жилищные условия другим способом` - percentage of the households which declared to improve their living conditions by some other means.
11. `housing_state_затруднились ответить` - percentage of the households which found it difficult to anser.
12. `housing_state_домохозяйства, не собирающиеся улучшать свои жилищные условия` - percentage of the households which do not plan to improve their living conditions.

In [None]:
data_housing_state = pd.read_excel(
    'data/housing_2020.xlsx', 
    sheet_name='housing_intent'
)
data_housing_state = rename_columns(data_housing_state, 'housing_state_')
data_housing_state = data_housing_state.set_index('region', drop=True)
data_housing_state = data_housing_state.sort_index()
data_housing_state.head()

Unnamed: 0_level_0,"housing_state_из них домохозяйства, собирающиеся улучшить свои жилищные условия",housing_state_из них указавшие: на стесненность проживания,housing_state_из них указавшие: на плохое или очень плохое состояние жилого помещения,housing_state_из них указавшие: на плохое состояние или очень плохое состояние жилого помещения и на стесненность проживания,"housing_state_из числа домохозяйств, собирающихся улучшить свои жилищные условия: планируют вселиться в жилое помещение, строительство которого ведут (участвуют в долевом строительстве)","housing_state_из числа домохозяйств, собирающихся улучшить свои жилищные условия: собираются подать документы для постановки на очередь (и/или ожидают прохождения очереди)","housing_state_из числа домохозяйств, собирающихся улучшить свои жилищные условия: рассчитывают на получение нового жилья в связи со сносом дома","housing_state_из числа домохозяйств, собирающихся улучшить свои жилищные условия: собираются купить (построить) другое жилье","housing_state_из числа домохозяйств, собирающихся улучшить свои жилищные условия: собираются снимать жилье","housing_state_из числа домохозяйств, собирающихся улучшить свои жилищные условия: собираются улучшить свои жилищные условия другим способом",housing_state_затруднились ответить,"housing_state_домохозяйства, не собирающиеся улучшать свои жилищные условия"
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Алтайский край,14.9,6.3,0.3,0.2,3.1,5.0,0.8,31.5,3.9,56.2,0.6,85.1
Амурская область,14.7,6.1,1.3,0.3,16.6,5.5,1.8,32.6,7.1,35.1,1.3,84.6
Архангельская область (кроме Ненецкого автономного округа),13.4,5.9,1.4,0.7,15.7,3.4,6.3,32.7,4.6,37.3,0.0,86.6
Астраханская область,16.4,5.3,0.8,0.2,7.5,8.1,9.6,36.3,0.9,36.2,1.3,81.8
Белгородская область,12.0,4.2,0.7,0.4,11.2,1.6,1.6,29.5,0.0,56.9,0.0,88.0


### 2.10 Gross Regional Product statistics per capita

Data source contains per capita gross regional product in roubles per Russian region across the years 1996 and 2020. Columns:
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `gross_product_1996-gross_product_2020` - columns containing the per capita gross regional product in roubles for each of the regions in the corresponding year.

In [97]:
data_gross_product = pd.read_excel(
    'data/gross_regional_product_1996_2020.xlsx', 
    sheet_name='data'
)
data_gross_product['region'] = data_gross_product['region'].str.strip()
data_gross_product = rename_columns(data_gross_product, 'gross_product_')
data_gross_product = data_gross_product.set_index('region', drop=True)
data_gross_product = data_gross_product.sort_index()
data_gross_product.head()


# df = pd.concat([data_population, data_gross_product], axis=1)
# df.to_csv('text.csv')

Unnamed: 0_level_0,gross_product_1996,gross_product_1997,gross_product_1998,gross_product_1999,gross_product_2000,gross_product_2001,gross_product_2002,gross_product_2003,gross_product_2004,gross_product_2005,...,gross_product_2011,gross_product_2012,gross_product_2013,gross_product_2014,gross_product_2015,gross_product_2016,gross_product_2017,gross_product_2018,gross_product_2019,gross_product_2020
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Алтайский край,7605.8,7431.0,8012.4,12204.8,17660.5,23509.0,27991.2,34295.8,44934.9,53812.4,...,137677.2,153556.7,173763.5,186798.6,204933.1,224525.8,231268.4,247599.3,271319.7,291156.9
Амурская область,13155.7,15818.0,15103.7,21935.4,28317.2,42578.3,50449.6,59480.3,72937.0,88597.1,...,273175.8,280023.9,258817.0,286282.6,343385.7,370192.4,373935.1,419905.2,521060.1,571362.1
Архангельская область,12218.6,14133.4,15755.3,25622.9,44797.4,49474.0,61989.4,78507.0,109045.7,128965.3,...,360165.9,391146.2,417776.4,456985.8,532533.7,609484.0,654268.9,753081.9,780623.9,697648.2
Архангельская область (кроме Ненецкого автономного округа),,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,232540.7,270662.9,283264.5,310817.4,352837.9,400764.6,441961.6,493205.1,509917.0,514200.4
Астраханская область,7643.9,8559.6,10172.2,15805.2,27815.3,32037.2,40786.8,50386.3,56357.5,69814.0,...,170504.7,206677.1,269821.7,290822.2,315996.9,361704.8,434701.5,570206.4,596388.2,526950.9


### 2.X Income situation

#### 2.X.1 Per capita monthly cash income

Data source contains per capita monthly income in roubles per Russian region across the years 2015 and 2020. Columns:
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `monthly_income_2015-monthly_income_2020` - columns containing the per capita monthly cash income in roubles for each of the regions in the corresponding year.

In [93]:
data_monthly_cash_income = pd.read_excel(
    'data/cash_real_income_wages_2015_2020.xlsx', 
    sheet_name='per_capita_cash_income'
)
data_monthly_cash_income = rename_columns(data_monthly_cash_income, 'monthly_income_')
data_monthly_cash_income = data_monthly_cash_income.set_index('region', drop=True)
data_monthly_cash_income = data_monthly_cash_income.sort_index()
data_monthly_cash_income.head()

Unnamed: 0_level_0,monthly_income_2015,monthly_income_2016,monthly_income_2017,monthly_income_2018,monthly_income_2019,monthly_income_2020
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Алтайский край,20860,21256,22139,22829,23937,23864
Амурская область,28240,27976,29213,30937,33304,35499
Архангельская область,31285,31394,32310,33831,35693,36779
Архангельская область (кроме Ненецкого автономного округа),29716,29837,30707,32054,33874,34852
Астраханская область,23832,22841,22884,23670,24971,25199


#### 2.X.2 Real cash income

Data source contains real cash income in percent compared to the previous year per Russian region across the years 2015 and 2020. Columns:
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `real_income_2015-real_income_2020` - columns containing the real cash income in percent compared to the previous year for each of the regions in the corresponding year.

In [92]:
data_real_cash_income = pd.read_excel(
    'data/cash_real_income_wages_2015_2020.xlsx', 
    sheet_name='real_incomes'
)
data_real_cash_income = rename_columns(data_real_cash_income, 'real_income_')
data_real_cash_income = data_real_cash_income.set_index('region', drop=True)
data_real_cash_income = data_real_cash_income.sort_index()
data_real_cash_income.head()

Unnamed: 0_level_0,real_income_2015,real_income_2016,real_income_2017,real_income_2018,real_income_2019,real_income_2020
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Алтайский край,99.1,94.7,100.0,99.7,99.6,95.5
Амурская область,96.1,92.1,101.1,102.4,101.7,100.3
Архангельская область,95.1,92.9,98.9,102.0,100.1,98.6
Архангельская область (кроме Ненецкого автономного округа),95.1,93.0,98.7,101.7,100.2,98.4
Астраханская область,94.0,90.1,97.1,100.6,100.7,97.1


#### 2.X.3 Per capita monthly formal wage

Data source contains per capita monthly formal wage in roubles per Russian region across the years 2015 and 2020. Columns:
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `monthly_wage_2015-monthly_wage_2020` - columns containing per capita monthly formal wage in roubles for each of the regions in the corresponding year.

In [None]:
data_monthly_formal_wage = pd.read_excel(
    'data/cash_real_income_wages_2015_2020.xlsx', 
    sheet_name='formal_wage_paid'
)
data_monthly_formal_wage = rename_columns(data_monthly_formal_wage, 'monthly_wage_')
data_monthly_formal_wage = data_monthly_formal_wage.set_index('region', drop=True)
data_monthly_formal_wage = data_monthly_formal_wage.sort_index()
data_monthly_formal_wage.head()

#### 2.X.4 Real wage

Data source contains real wage in percent compared to the previous year per Russian region across the years 2015 and 2020. Columns:
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `real_wage_2015-real_wage_2020` - columns containing the real wage in percent compared to the previous year for each of the regions in the corresponding year.

In [None]:
data_real_wage = pd.read_excel(
    'data/cash_real_income_wages_2015_2020.xlsx', 
    sheet_name='real_pay'
)
data_real_wage = rename_columns(data_real_wage, 'real_wage_')
data_real_wage = data_real_wage.set_index('region', drop=True)
data_real_wage = data_real_wage.sort_index()
data_real_wage.head()