# Final project: Identification of vulnerable population groups

## 1. Problem statement

According to a December 2021 [inFOM survey](https://www.cbr.ru/Collection/Collection/File/39633/inFOM_21-12.pdf), 27% of Russians have enough money only for food, and another 9% cannot afford a nutritious diet. These people are especially attentive to prices, and the rate of growth of food prices usually exceeds the average rate of inflation. At the same time, Rosstat believes that food expenses should make up approximately 36% of a Russian's average monthly expenses (another 10% goes to utilities and housing, 4% goes to medicines). 

Until 2021, the "poverty line" (living below the subsistence minimum) in Russia was determined by the cost of the [minimum food basket](https://base.garant.ru/70306880/). In the same year, the government "untied" the poverty level from the prices of basic products: since 2021, the subsistence minimum is calculated as 44.2% of the median income of Russian citizens for the previous year.

You have at your disposal data on income, morbidity, socially vulnerable groups of the Russian population and other economic and demographic data.

Your task as a data scientist:
* cluster the regions of Russia and determine which of them are in the greatest need of assistance to low-income/disadvantaged segments of the population;
* describe the population groups facing poverty;
* determine:
    * whether the number of children, pensioners and other socially vulnerable groups affects the poverty level in the region;
    * whether the level of poverty/social disadvantage is related to production and consumption in the region;
    * what other dependencies can be observed in relation to socially vulnerable segments of the population.

## 2. Getting to know the data

To reduce data redundancy, data for districts uniting multiple regions has been removed from the initial data sources. Also the following former regions have been removed, since they have significant amount of empty values: 
* Агинский Бурятский округ (Забайкальский край)
* Коми-Пермяцкий округ, входящий в состав Пермского края
* Корякский округ, входящий в состав Камчатского края
* Крымский федеральный округ
* Таймырский (Долгано-Ненецкий) автономный округ (Красноярский край)
* Усть-Ордынский Бурятский округ
* Эвенкийский автономный округ (Красноярский край)

### 2.0 Import dependencies and define helper code

#### 2.0.1 Import dependencies

In [230]:
import pandas as pd

from sklearn import preprocessing

#### 2.0.2 Define helper code

In [252]:
regions_count = 85


def drop_empty_years(df):
    missing = df.isnull().sum()
    missing_columns = missing[missing > regions_count * 0.6].index
    df = df.drop(missing_columns, axis=1)
    return df


def normalize_data(df):
    mm_scaler = preprocessing.MinMaxScaler()
    df_mm = mm_scaler.fit_transform(df)
    df_mm = pd.DataFrame(
        df_mm, 
        columns=df.columns, 
        index=df.index
    )
    return df_mm


def rename_columns(df, prefix, start_col=1):
    dict_columns = {}
    for i in range(start_col, df.shape[1]):
        dict_columns[df.columns[i]] = prefix + str(df.columns[i])

    return df.rename(columns=dict_columns)

### 2.1 Population data

Data source contains the population dynamics of the Russian regions across the years 1999 and 2022. Columns:
1. `region` - name of the region of the Russian Federation. 
2. `population_1999-population_2022` - columns containing the population count for each of the regions in the corresponding year.

In [257]:
data_population = pd.read_excel('data/population.xlsx', 'report')
for i in range(data_population.shape[0]//2):
    data_population.iloc[i, 1:] = data_population.iloc[i+1, 1:]
    data_population.drop(index=i+1, inplace=True)
    data_population = data_population.reset_index(drop=True)

#data_population = rename_columns(data_population, 'population_')
data_population = data_population.set_index('region')
data_population = data_population.sort_index()
data_population.to_csv('population.csv')
data_population.head()

Unnamed: 0_level_0,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Алтайский край,2662738.0,2651628.0,2641079.0,2621050.0,2602595.0,2571987.0,2539430.0,2503510.0,2473024.0,2453455.0,...,2398751.0,2390638.0,2384812.0,2376774.0,2365680.0,2350080.0,2332813.0,2317153.0,2296353.0,2268179.0
Амурская область,949526.0,935607.0,923055.0,911381.0,901044.0,887781.0,874018.0,861056.0,850502.0,844290.0,...,816910.0,811274.0,809873.0,805689.0,801752.0,798424.0,793194.0,790044.0,781846.0,772525.0
Архангельская область,1414144.0,1390334.0,1369118.0,1350448.0,1332655.0,1315549.0,1299218.0,1281838.0,1266667.0,1257164.0,...,1202295.0,1191785.0,1183323.0,1174078.0,1165750.0,1155028.0,1144119.0,1136535.0,1127051.0,1114322.0
Архангельская область (кроме Ненецкого автономного округа),,,,,,1273668.0,1257312.0,1239924.0,1224813.0,1215264.0,...,1159506.0,1148760.0,1139950.0,1130240.0,1121813.0,1111031.0,1100290.0,1092424.0,1082662.0,1069782.0
Астраханская область,1016372.0,1012385.0,1009281.0,1005510.0,1004780.0,1006073.0,1006467.0,1002517.0,1001300.0,1005897.0,...,1013840.0,1016516.0,1021287.0,1018626.0,1018866.0,1017514.0,1014065.0,1005782.0,997778.0,989430.0


### 2.2 Child mortality in rural areas

Data source contains child mortality in rural areas on the first life year in persons per Russian region across the years 1990 and 2021 in absolute values. Columns:
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `child_mortality_rural_1990-child_mortality_rural_2021` - columns containing the child mortality in rural areas for each of the regions in the corresponding year.

In [233]:
data_child_mortality_rural = pd.read_excel('data/child_mortality_rural_1990_2021.xlsx')
data_child_mortality_rural['region'] = data_child_mortality_rural['region'].str.strip()
data_child_mortality_rural = data_child_mortality_rural.set_index('region', drop=True)
data_child_mortality_rural = data_child_mortality_rural.sort_index()

Calculate child mortality in rural areas per population of 1000 people.

In [234]:
data_child_mortality_rural = data_child_mortality_rural / data_population * 1000

# Delete the years which have up to 40 % of empty values.
data_child_mortality_rural = drop_empty_years(data_child_mortality_rural)
data_child_mortality_rural = rename_columns(data_child_mortality_rural, 'child_mortality_rural_', 0)
data_child_mortality_rural.head()

Unnamed: 0_level_0,child_mortality_rural_1999,child_mortality_rural_2000,child_mortality_rural_2001,child_mortality_rural_2002,child_mortality_rural_2003,child_mortality_rural_2004,child_mortality_rural_2005,child_mortality_rural_2006,child_mortality_rural_2007,child_mortality_rural_2008,...,child_mortality_rural_2012,child_mortality_rural_2013,child_mortality_rural_2014,child_mortality_rural_2015,child_mortality_rural_2016,child_mortality_rural_2017,child_mortality_rural_2018,child_mortality_rural_2019,child_mortality_rural_2020,child_mortality_rural_2021
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Алтайский край,0.073608,0.075425,0.067397,0.071345,0.064551,0.075817,0.063006,0.05712,0.063485,0.056247,...,0.071867,0.0642,0.048941,0.039416,0.036604,0.033394,0.029786,0.02572,0.018989,0.01829
Амурская область,0.095837,0.09085,0.114836,0.09546,0.077688,0.078848,0.086955,0.081296,0.085832,0.085279,...,0.076682,0.061206,0.048073,0.038278,0.028547,0.024945,0.021292,0.020172,0.016455,0.019185
Архангельская область,0.041014,0.033086,0.045285,0.036284,0.037519,0.037247,0.033867,0.048368,0.040263,0.038977,...,0.026369,0.025784,0.026011,0.021972,0.015331,0.018872,0.012987,0.013985,0.006159,0.010647
Архангельская область (кроме Ненецкого автономного округа),,,,,,,,,,,...,,0.025011,0.025245,0.021931,0.015041,0.019611,0.012601,0.014542,0.004577,0.009236
Астраханская область,0.055098,0.060254,0.069356,0.066633,0.057724,0.05765,0.05564,0.052867,0.055927,0.054678,...,0.028572,0.034522,0.041318,0.029375,0.020616,0.021593,0.013759,0.01775,0.012925,0.011024


### 2.3 Child mortality in urban areas

Data source contains child mortality in urban areas on the first life year in persons per Russian region across the years 1990 and 2021 in absolute values. Columns:
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `child_mortality_urban_1990-child_mortality_urban_2021` - columns containing the child mortality in urban areas for each of the regions in the corresponding year.

In [235]:
data_child_mortality_urban = pd.read_excel('data/child_mortality_urban_1990_2021.xlsx')
data_child_mortality_urban['region'] = data_child_mortality_urban['region'].str.strip()
data_child_mortality_urban = data_child_mortality_urban.set_index('region', drop=True)
data_child_mortality_urban = data_child_mortality_urban.sort_index()

Calculate child mortality in urban areas per population of 1000 people.

In [236]:
data_child_mortality_urban = data_child_mortality_urban / data_population * 1000

# Delete the years which have up to 40 % of empty values.
data_child_mortality_urban = drop_empty_years(data_child_mortality_urban)
data_child_mortality_urban = rename_columns(data_child_mortality_urban, 'child_mortality_urban_', 0)
data_child_mortality_urban.head()

Unnamed: 0_level_0,child_mortality_urban_1999,child_mortality_urban_2000,child_mortality_urban_2001,child_mortality_urban_2002,child_mortality_urban_2003,child_mortality_urban_2004,child_mortality_urban_2005,child_mortality_urban_2006,child_mortality_urban_2007,child_mortality_urban_2008,...,child_mortality_urban_2012,child_mortality_urban_2013,child_mortality_urban_2014,child_mortality_urban_2015,child_mortality_urban_2016,child_mortality_urban_2017,child_mortality_urban_2018,child_mortality_urban_2019,child_mortality_urban_2020,child_mortality_urban_2021
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Алтайский край,0.073233,0.064489,0.061717,0.066004,0.056866,0.058709,0.057887,0.057519,0.051354,0.056247,...,0.06522,0.056696,0.066091,0.053673,0.052172,0.044807,0.045105,0.023148,0.022873,0.020032
Амурская область,0.188515,0.148567,0.136503,0.121793,0.132069,0.132916,0.143018,0.130073,0.122281,0.121996,...,0.119283,0.084465,0.076423,0.064208,0.037235,0.034924,0.036322,0.03404,0.036707,0.02686
Архангельская область,0.084857,0.089906,0.105177,0.092562,0.094548,0.072973,0.1016,0.063191,0.085263,0.068408,...,0.064275,0.072362,0.058735,0.052395,0.053659,0.046322,0.033765,0.037584,0.022877,0.020407
Архангельская область (кроме Ненецкого автономного округа),,,,,,,,,,,...,,0.072445,0.059194,0.051757,0.054856,0.044571,0.035103,0.038172,0.021969,0.02032
Астраханская область,0.096421,0.102728,0.100071,0.089507,0.093553,0.101384,0.102338,0.078802,0.073904,0.062631,...,0.111333,0.093703,0.104278,0.087145,0.059885,0.051037,0.056019,0.053251,0.050707,0.035078


### 2.4 Disability statistics

Data source is loaded and then gets transformed to contain the disability data per Russion region across the years 2017 and 2022 in absolute values. Columns
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `disability_{year}_total, disability_{year}_18_30, disability_{year}_31_40, disability_{year}_41_50, disability_{year}_60_` - columns containing the disability counts across the whole population and across different age groups for each of the regions in the corresponding year.

In [258]:
data_disability = pd.read_csv('data/disabled_total_by_age_2017_2022.csv', sep=';')
data_disability = data_disability[data_disability['date'].str.endswith('01-01')]


def flatten_table(df, years):
    result = []
    for year in years:
        df_year = df[df['date'].str.startswith(year)]
        df_year = df_year.drop(columns='date')
        df_year = rename_columns(df_year, f'disability_{year}_')
        df_year = df_year.set_index('region')
        result.append(df_year)
    return pd.concat(result, axis=1)


data_disability = flatten_table(data_disability, ['2017', '2018', '2019', '2020', '2021', '2022'])

Disability data is hard to normalize by population, because there are disability statistics for multipe age groups, but there is no population data for those age groups. How ever in order to use clustering algorithms data must be normalized. We'll use `MinMaxScaler` for that.

In [259]:
# Delete the years which have up to 40 % of empty values.
data_disability = drop_empty_years(data_disability)

data_disability = normalize_data(data_disability)
data_disability.head()

Unnamed: 0_level_0,disability_2017_total,disability_2017_18_30,disability_2017_31_40,disability_2017_41_50,disability_2017_51_60,disability_2017_60_,disability_2018_total,disability_2018_18_30,disability_2018_31_40,disability_2018_41_50,...,disability_2021_31_40,disability_2021_41_50,disability_2021_51_60,disability_2021_60_,disability_2022_total,disability_2022_18_30,disability_2022_31_40,disability_2022_41_50,disability_2022_51_60,disability_2022_60_
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Алтайский край,0.170722,0.240725,0.320237,0.206124,0.186609,0.151663,0.169545,0.217319,0.313407,0.216264,...,0.289868,0.243231,0.174882,0.153223,0.173537,0.20129,0.27579,0.252781,0.175515,0.154996
Амурская область,0.063526,0.096861,0.121453,0.091096,0.076194,0.05332,0.062755,0.088285,0.119584,0.092475,...,0.108769,0.093473,0.069094,0.053639,0.062357,0.088786,0.103673,0.093059,0.067647,0.05348
Архангельская область,0.082092,0.091401,0.119025,0.098404,0.08336,0.077427,0.082126,0.084207,0.115894,0.098763,...,0.109012,0.104727,0.083923,0.079949,0.084181,0.079871,0.104146,0.107114,0.084217,0.079949
Астраханская область,0.040878,0.082373,0.090531,0.068044,0.054215,0.030785,0.040783,0.075972,0.087684,0.068852,...,0.083969,0.072509,0.055961,0.032689,0.043615,0.071713,0.08126,0.074731,0.057874,0.033407
Белгородская область,0.208415,0.169054,0.244638,0.248063,0.22983,0.198957,0.203616,0.152131,0.233127,0.237992,...,0.204246,0.22591,0.217374,0.190216,0.194755,0.132885,0.191266,0.221324,0.21551,0.189877


### 2.5 Desease Incidence Statistics

Data source is loaded and then gets transformed to contain the desease incidence (per 100000 people in the corresponding age group) data per Russion region across the years 2005 and 2016. Data has been reduced for total deseases per region and age group, specific desease statistics have been removed. Columns
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `desease_incidence_0-14 years_{year}, desease_incidence_15-17 years_{year}, desease_incidence_18 years and older_{year}, desease_incidence_Total_{year}` - columns containing the desease incidence across the whole population and across different age groups for each of the regions in the corresponding year.

In [261]:
data_desease_incidence = pd.read_excel(
    'data/morbidity_2005_2016_age_disease.xlsx',
    index_col=[0, 1]
)
data_desease_incidence_list = []
age_groups = ['0-14 years', '15-17 years', '18 years and older', 'Total']
for age_group in age_groups:
    df = data_desease_incidence[data_desease_incidence.index.get_level_values('age_group') == age_group]
    df = df.reset_index()
    df = df.drop('age_group', axis=1)
    df = rename_columns(df, f'desease_incidence_{age_group}_')
    df['region'] = df['region'].str.strip()
    df = df.set_index('region')
    data_desease_incidence_list.append(df)
data_desease_incidence = pd.concat(data_desease_incidence_list, axis=1)
data_desease_incidence = data_desease_incidence.sort_index()


Desease data is hard to normalize by population, because there are statistics for multipe age groups, but there is no population data for those age groups. How ever in order to use clustering algorithms data must be normalized. We'll use `MinMaxScaler` for that.

In [262]:
# Delete the years which have up to 40 % of empty values.
data_desease_incidence = drop_empty_years(data_desease_incidence)

data_desease_incidence = normalize_data(data_desease_incidence)
data_desease_incidence.head()

Unnamed: 0_level_0,desease_incidence_0-14 years_2005,desease_incidence_0-14 years_2006,desease_incidence_0-14 years_2007,desease_incidence_0-14 years_2008,desease_incidence_0-14 years_2009,desease_incidence_0-14 years_2010,desease_incidence_0-14 years_2011,desease_incidence_0-14 years_2012,desease_incidence_0-14 years_2013,desease_incidence_0-14 years_2014,...,desease_incidence_Total_2006,desease_incidence_Total_2007,desease_incidence_Total_2008,desease_incidence_Total_2009,desease_incidence_Total_2010,desease_incidence_Total_2011,desease_incidence_Total_2012,desease_incidence_Total_2013,desease_incidence_Total_2015,desease_incidence_Total_2016
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Алтайский край,0.540326,0.423107,0.382056,0.396793,0.382722,0.37719,0.376471,0.362263,0.39359,0.411102,...,0.454186,0.464077,0.471435,0.446491,0.438596,0.476858,0.503423,0.566883,0.659831,0.72745
Амурская область,0.462399,0.379026,0.38681,0.429858,0.40662,0.403409,0.420063,0.445003,0.445859,0.476248,...,0.213058,0.239134,0.236702,0.24986,0.25826,0.284827,0.31143,0.319947,0.390439,0.413315
Архангельская область,0.733238,0.74757,0.693187,0.699794,0.69153,0.699245,0.732198,0.718222,0.698693,,...,0.43283,0.426431,0.425356,0.42312,0.45347,0.481252,0.486296,0.492014,,
Архангельская область (кроме Ненецкого автономного округа),,,,,,,,,,0.743857,...,,,,,,,,,0.5754,0.589151
Астраханская область,0.514894,0.409359,0.374704,0.384646,0.389905,0.361651,0.357596,0.320149,0.292137,0.265183,...,0.255263,0.252347,0.244003,0.261117,0.254921,0.248056,0.221172,0.207308,0.161564,0.212842


### 2.6 Welfare Expense Share

Data source is loaded and then gets transformed to contain the welfare expense data per Russion region across the years 2015 and 2020. Columns
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `welfare_share_2015-welfare_share_2020` - columns containing the welfare expense percentage across the whole population for each of the regions in the corresponding year.

In [267]:
data_welfare_share = pd.read_excel('data/welfare_expense_share_2015_2020.xlsx')
data_welfare_share = rename_columns(data_welfare_share, 'welfare_share_')
data_welfare_share = data_welfare_share.set_index('region', drop=True)
data_welfare_share = data_welfare_share.sort_index()

Welfare expense is already provided as percentage. We will normalize this data by deviding it by 100.

In [266]:
# Delete the years which have up to 40 % of empty values.
data_welfare_share = drop_empty_years(data_welfare_share)

data_welfare_share = data_welfare_share / 100
data_welfare_share.head()

Unnamed: 0_level_0,welfare_share_2015,welfare_share_2016,welfare_share_2017,welfare_share_2018,welfare_share_2019,welfare_share_2020
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Алтайский край,0.188,0.204,0.309,0.298,0.283,0.297
Амурская область,0.192,0.289,0.263,0.241,0.227,0.217
Архангельская область,0.154,0.163,0.248,0.242,0.224,0.214
Астраханская область,0.188,0.205,0.277,0.264,0.258,0.314
Белгородская область,0.113,0.118,0.164,0.156,0.138,0.161


### 2.7 Poverty Statistics

Data source is loaded and then gets transformed to contain the poverty data per Russion region across the years 1992 and 2020. Columns
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `poverty_1992-poverty_2020` - columns containing the poverty percentage across the whole population for each of the regions in the corresponding year.

In [269]:
data_poverty = pd.read_csv(
    'data/poverty_percent_by_regions_1992_2020.csv', 
    sep=';', 
    skipinitialspace=True
)


def flatten_table(df, years):
    result = []
    for year in years:
        df_year = df[df['year'] == year]
        df_year = df_year.drop(columns='year')
        df_year = rename_columns(df_year, f'poverty_{year}_')
        df_year = df_year.set_index('region')
        result.append(df_year)
    return pd.concat(result, axis=1)


data_poverty = flatten_table(data_poverty, list(range(1992, 2021)))
data_poverty = data_poverty.sort_index()

Poverty data is already provided as percentage. We will normalize this data by deviding it by 100.

In [270]:
# Delete the years which have up to 40 % of empty values.
data_poverty = drop_empty_years(data_poverty)

data_poverty = data_poverty / 100
data_poverty.head()

Unnamed: 0_level_0,poverty_1995_poverty_percent,poverty_1996_poverty_percent,poverty_1997_poverty_percent,poverty_1998_poverty_percent,poverty_1999_poverty_percent,poverty_2000_poverty_percent,poverty_2001_poverty_percent,poverty_2002_poverty_percent,poverty_2003_poverty_percent,poverty_2004_poverty_percent,...,poverty_2011_poverty_percent,poverty_2012_poverty_percent,poverty_2013_poverty_percent,poverty_2014_poverty_percent,poverty_2015_poverty_percent,poverty_2016_poverty_percent,poverty_2017_poverty_percent,poverty_2018_poverty_percent,poverty_2019_poverty_percent,poverty_2020_poverty_percent
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Алтайский край,0.337,0.468,0.457,0.529,0.538,0.539,0.473,0.389,0.339,0.309,...,0.226,0.206,0.176,0.171,0.18,0.178,0.175,0.174,0.176,0.175
Амурская область,0.361,0.282,0.263,0.312,0.38,0.477,0.453,0.446,0.356,0.338,...,0.204,0.16,0.162,0.148,0.152,0.17,0.167,0.156,0.157,0.152
Архангельская область,0.262,0.268,0.239,0.316,0.426,0.335,0.274,0.265,0.235,0.196,...,0.144,0.132,0.146,0.148,0.165,0.152,0.143,0.135,0.136,0.128
Архангельская область (кроме Ненецкого автономного округа),,,,,,,,,,,...,,0.13,0.145,0.145,0.162,0.149,0.139,0.125,0.127,0.123
Астраханская область,0.323,0.307,0.25,0.304,0.36,0.334,0.311,0.262,0.229,0.203,...,0.142,0.125,0.118,0.12,0.142,0.161,0.155,0.151,0.155,0.156


### 2.8 Poverty Statisticts in Demografic Context

Data source is loaded and then gets transformed to contain the poverty data within demografic groups per Russion region across the years 2017 and 2020. Columns
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `poverty_dem_children_{year}, poverty_dem_seniors_{year}, poverty_dem_adults_{year}` - columns containing the poverty percentage across the demografic groups for each of the regions in the corresponding year.

In [272]:
data_poverty_dem_list = []
for year in range(2017, 2021):
    data_poverty_dem = pd.read_excel(f'data/poverty_socdem_{year}.xlsx')
    data_poverty_dem = rename_columns(data_poverty_dem, 'poverty_dem_')
    data_poverty_dem['region'] = data_poverty_dem['region'].str.strip()
    data_poverty_dem = data_poverty_dem.set_index('region', drop=True)
    data_poverty_dem_list.append(data_poverty_dem)
data_poverty_dem = pd.concat(data_poverty_dem_list, axis=1)

data_poverty_dem = data_poverty_dem.sort_index()

Poverty data in demographic context is already provided as percentage. We will normalize this data by deviding it by 100.

In [273]:
# Delete the years which have up to 40 % of empty values.
data_poverty_dem = drop_empty_years(data_poverty_dem)

data_poverty_dem = data_poverty_dem / 100
data_poverty_dem.head()

Unnamed: 0_level_0,poverty_dem_children_2017,poverty_dem_seniors_2017,poverty_dem_adults_2017,poverty_dem_children_2018,poverty_dem_seniors_2018,poverty_dem_adults_2018,poverty_dem_children_2019,poverty_dem_seniors_2019,poverty_dem_adults_2019,poverty_dem_children_2020,poverty_dem_seniors_2020,poverty_dem_adults_2020
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Алтайский край,0.38,0.064,0.556,0.421,0.051,0.528,0.387,0.039,0.574,0.314,0.092,0.594
Амурская область,0.399,0.06,0.541,0.406,0.046,0.548,0.339,0.072,0.589,0.384,0.047,0.569
Архангельская область,0.381,0.068,0.551,0.395,0.065,0.54,0.33,0.081,0.59,0.302,0.067,0.631
Архангельская область (кроме Ненецкого автономного округа),0.374,0.071,0.555,0.388,0.066,0.546,0.313,0.089,0.598,0.285,0.072,0.643
Астраханская область,0.402,0.042,0.555,0.35,0.052,0.598,0.434,0.047,0.519,0.421,0.055,0.524


### 2.9 Housing Statistics 2020

#### 2.9.1 Living conditions - size

Data source contains the housing conditions in terms of the living space in the year 2020 across the Russian regions. Columns:
1. `housing_size_в том числе домохозяйства, указавшие, что при проживании не испытывают стесненностиgion` - percentage of the households which declared to have sufficient living space.
2. `housing_size_в том числе домохозяйства, указавшие, что при проживании испытывают определенную стесненность	housing_size_в том числе домохозяйства, указавшие, что при проживании испытывают большую ` - percentage of the households which declared to have some shortage of the living space.
3. `housing_size_в том числе домохозяйства, указавшие, что при проживании испытывают большую стесненность` - percentage of the households which declared to have a considerable shortage of the living space.
4. `housing_size_затруднились ответить` - percentage of the households which found it difficult to anser.
5. `housing_size_Размер общей площади в расчете на члена домохозяйства` - total area size per houshold member.
6. `housing_size_Размер жилой площади в расчете на члена домохозяйства` - living area size per household member.
7. `housing_size_Число жилых комнат в расчете на одно домохозяйство` - number of rooms per household.

In [275]:
data_housing_size = pd.read_excel(
    'data/housing_2020.xlsx', 
    sheet_name='housing_cond'
)
data_housing_size = rename_columns(data_housing_size, 'housing_size_')
data_housing_size = data_housing_size.set_index('region', drop=True)
data_housing_size = data_housing_size.sort_index()

Data for housing conditions in terms of living space is already provided as percentage. We will normalize this data by deviding it by 100.

In [276]:
# Delete the years which have up to 40 % of empty values.
data_housing_size = drop_empty_years(data_housing_size)

data_housing_size = data_housing_size / 100
data_housing_size.head()

Unnamed: 0_level_0,"housing_size_в том числе домохозяйства, указавшие, что при проживании не испытывают стесненности","housing_size_в том числе домохозяйства, указавшие, что при проживании испытывают определенную стесненность","housing_size_в том числе домохозяйства, указавшие, что при проживании испытывают большую стесненность",housing_size_затруднились ответить,housing_size_Размер общей площади в расчете на члена домохозяйства,housing_size_Размер жилой площади в расчете на члена домохозяйства,housing_size_Число жилых комнат в расчете на одно домохозяйство
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Алтайский край,0.832,0.134,0.033,0.001,0.259,0.18,0.024
Амурская область,0.773,0.168,0.059,0.0,0.232,0.173,0.023
Архангельская область (кроме Ненецкого автономного округа),0.833,0.145,0.022,0.0,0.246,0.169,0.024
Астраханская область,0.775,0.194,0.031,0.0,0.249,0.192,0.023
Белгородская область,0.863,0.117,0.02,0.0,0.251,0.178,0.026


#### 2.9.2 Living conditions - state

Data source contains the housing conditions in terms of the state of the house in the year 2020 across the Russian regions. Columns:
1. `housing_state_из них домохозяйства, собирающиеся улучшить свои жилищные условия` - percentage of the households which declared to improve the house state.
2. `housing_state_из них указавшие: на стесненность проживания` - percentage of the households which declared to have not sufficient living space.
3. `housing_state_из них указавшие: на плохое или очень плохое состояние жилого помещения` - percentage of the households which declared to have very poor or poor state of the house.
4. `housing_state_из них указавшие: на плохое состояние или очень плохое состояние жилого помещения и на стесненность проживания` - percentage of the households which declared to have very poor or poor state of the house and do not have sufficient living space.
5. `housing_state_из числа домохозяйств, собирающихся улучшить свои жилищные условия: планируют вселиться в жилое помещение, строительство которого ведут (участвуют в долевом строительстве)` - percentage of the households which declared to improve their living conditions by moving to an object which they are currently building.
6. `housing_state_из числа домохозяйств, собирающихся улучшить свои жилищные условия: собираются подать документы для постановки на очередь (и/или ожидают прохождения очереди)	` - percentage of the households which declared to apply for a social appartment.
7. `housing_state_из числа домохозяйств, собирающихся улучшить свои жилищные условия: рассчитывают на получение нового жилья в связи со сносом дома` - percentage of the households which declared to expect to get a new apartment due to their current house demolition.
8. `housing_state_из числа домохозяйств, собирающихся улучшить свои жилищные условия: собираются купить (построить) другое жилье` - percentage of the households which declared to improve their living conditions by buying a new apartment.
9. `housing_state_из числа домохозяйств, собирающихся улучшить свои жилищные условия: собираются снимать жилье` - percentage of the households which declared to improve their living conditions by renting a different apartment.
10. `housing_state_из числа домохозяйств, собирающихся улучшить свои жилищные условия: собираются улучшить свои жилищные условия другим способом` - percentage of the households which declared to improve their living conditions by some other means.
11. `housing_state_затруднились ответить` - percentage of the households which found it difficult to anser.
12. `housing_state_домохозяйства, не собирающиеся улучшать свои жилищные условия` - percentage of the households which do not plan to improve their living conditions.

In [290]:
data_housing_state = pd.read_excel(
    'data/housing_2020.xlsx', 
    sheet_name='housing_intent'
)
data_housing_state = rename_columns(data_housing_state, 'housing_state_')
data_housing_state = data_housing_state.set_index('region', drop=True)
data_housing_state = data_housing_state.sort_index()

Data for housing conditions in terms of state of the house is already provided as percentage. We will normalize this data by deviding it by 100.

In [291]:
# Delete the years which have up to 40 % of empty values.
data_housing_state = drop_empty_years(data_housing_state)

data_housing_state = data_housing_state / 100
data_housing_state.head()

Unnamed: 0_level_0,"housing_state_из них домохозяйства, собирающиеся улучшить свои жилищные условия",housing_state_из них указавшие: на стесненность проживания,housing_state_из них указавшие: на плохое или очень плохое состояние жилого помещения,housing_state_из них указавшие: на плохое состояние или очень плохое состояние жилого помещения и на стесненность проживания,"housing_state_из числа домохозяйств, собирающихся улучшить свои жилищные условия: планируют вселиться в жилое помещение, строительство которого ведут (участвуют в долевом строительстве)","housing_state_из числа домохозяйств, собирающихся улучшить свои жилищные условия: собираются подать документы для постановки на очередь (и/или ожидают прохождения очереди)","housing_state_из числа домохозяйств, собирающихся улучшить свои жилищные условия: рассчитывают на получение нового жилья в связи со сносом дома","housing_state_из числа домохозяйств, собирающихся улучшить свои жилищные условия: собираются купить (построить) другое жилье","housing_state_из числа домохозяйств, собирающихся улучшить свои жилищные условия: собираются снимать жилье","housing_state_из числа домохозяйств, собирающихся улучшить свои жилищные условия: собираются улучшить свои жилищные условия другим способом",housing_state_затруднились ответить,"housing_state_домохозяйства, не собирающиеся улучшать свои жилищные условия"
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Алтайский край,0.149,0.063,0.003,0.002,0.031,0.05,0.008,0.315,0.039,0.562,0.006,0.851
Амурская область,0.147,0.061,0.013,0.003,0.166,0.055,0.018,0.326,0.071,0.351,0.013,0.846
Архангельская область (кроме Ненецкого автономного округа),0.134,0.059,0.014,0.007,0.157,0.034,0.063,0.327,0.046,0.373,0.0,0.866
Астраханская область,0.164,0.053,0.008,0.002,0.075,0.081,0.096,0.363,0.009,0.362,0.013,0.818
Белгородская область,0.12,0.042,0.007,0.004,0.112,0.016,0.016,0.295,0.0,0.569,0.0,0.88


### 2.10 Gross Regional Product statistics per capita

Data source contains per capita gross regional product in roubles per Russian region across the years 1996 and 2020. Columns:
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `gross_product_1996-gross_product_2020` - columns containing the per capita gross regional product in roubles for each of the regions in the corresponding year.

In [293]:
data_gross_product = pd.read_excel(
    'data/gross_regional_product_1996_2020.xlsx', 
    sheet_name='data'
)
data_gross_product['region'] = data_gross_product['region'].str.strip()
data_gross_product = rename_columns(data_gross_product, 'gross_product_')
data_gross_product = data_gross_product.set_index('region', drop=True)
data_gross_product = data_gross_product.sort_index()

Gross regional product data is already represented per capita, but the values are lagre compared to the normalized data of other features. I will the `MinMaxScaler` to normalize it.

In [294]:
# Delete the years which have up to 40 % of empty values.
data_gross_product = drop_empty_years(data_gross_product)

data_gross_product = normalize_data(data_gross_product)
data_gross_product.head()

Unnamed: 0_level_0,gross_product_1996,gross_product_1997,gross_product_1998,gross_product_1999,gross_product_2000,gross_product_2001,gross_product_2002,gross_product_2003,gross_product_2004,gross_product_2005,...,gross_product_2011,gross_product_2012,gross_product_2013,gross_product_2014,gross_product_2015,gross_product_2016,gross_product_2017,gross_product_2018,gross_product_2019,gross_product_2020
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Алтайский край,0.154473,0.135022,0.136759,0.124374,0.099823,0.101229,0.101556,0.100531,0.096055,0.079934,...,0.019249,0.020975,0.020819,0.018285,0.019199,0.01755,0.017839,0.015458,0.017008,0.029414
Амурская область,0.26719,0.287414,0.257796,0.223534,0.160058,0.18334,0.183039,0.174354,0.155914,0.131604,...,0.054443,0.056027,0.042383,0.041862,0.04633,0.042482,0.04194,0.039525,0.050826,0.084746
Архангельская область,0.248158,0.256805,0.268918,0.261112,0.25321,0.213033,0.224907,0.230127,0.233101,0.191568,...,0.077038,0.086826,0.082684,0.082319,0.083395,0.083439,0.089297,0.086061,0.085974,0.109684
Архангельская область (кроме Ненецкого автономного округа),,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.043888,0.053433,0.048581,0.047677,0.048182,0.047715,0.053432,0.049763,0.049317,0.073459
Астраханская область,0.155246,0.155529,0.173623,0.161064,0.157222,0.137951,0.147981,0.147697,0.120473,0.103703,...,0.027775,0.035698,0.045173,0.042938,0.040963,0.04103,0.052205,0.060518,0.061026,0.075976


### 2.11 Regional production statistics

Data source contains Russian regions's total production in roubles per across the years 1996 and 2020. Data from different industries has been aggregated to form a single total value. Columns:
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `production_2005-production_2020` - columns containing the regions' total production in roubles for each of the regions in the corresponding year.

In [304]:
data_production = pd.read_excel('data/regional_production_2005_2016.xlsx', header=1)
data_production['region'] = data_production['region'].str.strip()
#data_production = rename_columns(data_production, 'production_')
data_production = data_production.groupby('region').sum()

data_production_2 = pd.read_excel('data/regional_production_2017_2020.xlsx', header=1)
data_production_2['region'] = data_production_2['region'].str.strip()
#data_production_2 = rename_columns(data_production_2, 'production_')
data_production_2 = data_production_2.groupby('region').sum()

data_production = pd.concat([data_production, data_production_2], axis=1)
data_production.head()

Unnamed: 0_level_0,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Алтайский край,92619170.0,116484816.7,130286000.0,181655200.0,148133600.0,204947800.0,227310900.0,240582400.0,247418900.0,254337500.0,321526300.0,315781900.0,735976400.0,771192700.0,838317900.0,852443400.0
Амурская область,31814720.0,35861289.8,38750650.0,46798070.0,60866960.0,72102320.0,96691240.0,102412700.0,99246420.0,109393100.0,138602800.0,125936800.0,268844100.0,268760800.0,340026700.0,404576500.0
Архангельская область,122474700.0,131390413.1,165489500.0,198561800.0,228895100.0,282822600.0,306934000.0,330392400.0,483427200.0,384235300.0,444329800.0,462384300.0,1296215000.0,1347627000.0,1264614000.0,1190018000.0
Архангельская область (кроме Ненецкого автономного округа),0.0,0.0,0.0,0.0,0.0,126413700.0,133019100.0,150346000.0,298007200.0,187859100.0,211503700.0,203094200.0,709009800.0,606471100.0,533584900.0,664107000.0
Астраханская область,40922340.0,45566202.3,51948870.0,84692280.0,62969240.0,76683570.0,95723810.0,125355500.0,162312800.0,173012200.0,211627000.0,227798200.0,626569700.0,883766500.0,901514700.0,724597800.0


In [301]:
data_production = data_production / data_population #* 1000

# Delete the years which have up to 40 % of empty values.
#data_production = drop_empty_years(data_production)
#data_production = rename_columns(data_production, 'production_', 0)
data_production.head()

Unnamed: 0_level_0,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Алтайский край,,,,,,,,,,,...,,,,,,,,,,
Амурская область,,,,,,,,,,,...,,,,,,,,,,
Архангельская область,,,,,,,,,,,...,,,,,,,,,,
Архангельская область (кроме Ненецкого автономного округа),,,,,,,,,,,...,,,,,,,,,,
Астраханская область,,,,,,,,,,,...,,,,,,,,,,


### 2.12 Retail turnover statistics

Data source contains per capita yearly retail turnover in roubles per Russian region across the years 2015 and 2020. Columns:
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `retail_turnover_2000-retail_turnover_2021` - columns containing the per capita yearly retail turnover in roubles for each of the regions in the corresponding year.

In [324]:
data_retail_turnover = pd.read_excel('data/retail_turnover_per_capita_2000_2021.xlsx')
data_retail_turnover['region'] = data_retail_turnover['region'].str.strip()
#data_retail_turnover = rename_columns(data_retail_turnover, 'retail_turnover_')
data_retail_turnover = data_retail_turnover.set_index('region')
data_retail_turnover = data_retail_turnover.sort_index()
data_retail_turnover.head()

Unnamed: 0_level_0,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Алтайский край,9221,12280,15839,20534,25927,32543,42444,53937,70308,65303,...,105754,118096,128376,134925,137844,143873,150444,159514,153605,174321
Амурская область,12303,14786,17229,20378,26781,32829,40948,49413,64081,71086,...,127200,145301,163781,182491,191523,202038,214688,231113,245233,276635
Архангельская область,11845,17821,21924,26959,33554,41707,50781,61423,79990,87768,...,135625,154320,176491,194266,203019,217241,229576,239516,249101,278284
Архангельская область (кроме Ненецкого автономного округа),0,0,0,0,0,0,0,0,0,0,...,0,154177,176420,194345,202977,217332,229922,240155,250033,280050
Астраханская область,10876,14265,17668,21623,25775,32717,42007,55755,77056,83032,...,131101,147954,162393,170883,164241,163829,170710,179153,174527,196096


In [325]:
data_retail_turnover / data_population

Unnamed: 0_level_0,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Алтайский край,,0.003477,0.004650,0.006043,0.007890,0.010081,0.012815,0.016954,0.021810,0.028657,...,0.049232,0.053699,0.056577,0.057996,0.060817,0.064017,0.068378,0.066290,0.075912,
Амурская область,,0.013150,0.016019,0.018904,0.022616,0.030166,0.037561,0.047556,0.058099,0.075899,...,0.177867,0.201881,0.225333,0.237713,0.251996,0.268890,0.291370,0.310404,0.353823,
Архангельская область,,0.008520,0.013016,0.016235,0.020230,0.025506,0.032102,0.039616,0.048492,0.063627,...,0.128355,0.148090,0.164170,0.172918,0.186353,0.198762,0.209345,0.219176,0.246913,
Архангельская область (кроме Ненецкого автономного округа),,,,,,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.132968,0.153574,0.170486,0.179588,0.193733,0.206945,0.218265,0.228879,0.258668,
Астраханская область,,0.010743,0.014134,0.017571,0.021520,0.025619,0.032507,0.041902,0.055683,0.076604,...,0.145934,0.159754,0.167321,0.161238,0.160795,0.167772,0.176668,0.173524,0.196533,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Чеченская Республика,,0.000000,0.000000,0.000000,0.000000,0.007231,0.009038,0.007764,0.011321,0.020248,...,0.058190,0.065588,0.077456,0.077848,0.078505,0.080572,0.082404,0.083826,0.094326,
Чувашская Республика - Чувашия,,0.005585,0.007590,0.009084,0.011437,0.014108,0.017967,0.022844,0.030111,0.042537,...,0.077636,0.085843,0.089259,0.089723,0.093272,0.101111,0.111404,0.116244,0.137556,
Чукотский автономный округ,,0.222729,0.384677,0.596268,0.719894,0.791616,0.799093,0.861446,0.983312,1.548661,...,2.131115,2.033330,2.361199,3.070738,3.688752,3.918923,4.225258,4.316915,4.556525,
Ямало-Ненецкий автономный округ (Тюменская область),,0.052667,0.069691,0.086786,0.104231,0.133234,0.179983,0.230569,0.286727,0.365644,...,0.409323,0.436399,0.454752,0.429195,0.443258,0.476100,0.483779,0.483271,0.536941,


### 2.13 Income situation

#### 2.13.1 Per capita monthly cash income

Data source contains per capita monthly income in roubles per Russian region across the years 2015 and 2020. Columns:
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `monthly_income_2015-monthly_income_2020` - columns containing the per capita monthly cash income in roubles for each of the regions in the corresponding year.

In [139]:
data_monthly_cash_income = pd.read_excel(
    'data/cash_real_income_wages_2015_2020.xlsx', 
    sheet_name='per_capita_cash_income'
)
data_monthly_cash_income = rename_columns(data_monthly_cash_income, 'monthly_income_')
data_monthly_cash_income = data_monthly_cash_income.set_index('region', drop=True)
data_monthly_cash_income = data_monthly_cash_income.sort_index()
data_monthly_cash_income.head()

Unnamed: 0_level_0,monthly_income_2015,monthly_income_2016,monthly_income_2017,monthly_income_2018,monthly_income_2019,monthly_income_2020
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Алтайский край,20860,21256,22139,22829,23937,23864
Амурская область,28240,27976,29213,30937,33304,35499
Архангельская область,31285,31394,32310,33831,35693,36779
Архангельская область (кроме Ненецкого автономного округа),29716,29837,30707,32054,33874,34852
Астраханская область,23832,22841,22884,23670,24971,25199


#### 2.13.2 Real cash income

Data source contains real cash income in percent compared to the previous year per Russian region across the years 2015 and 2020. Columns:
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `real_income_2015-real_income_2020` - columns containing the real cash income in percent compared to the previous year for each of the regions in the corresponding year.

In [140]:
data_real_cash_income = pd.read_excel(
    'data/cash_real_income_wages_2015_2020.xlsx', 
    sheet_name='real_incomes'
)
data_real_cash_income = rename_columns(data_real_cash_income, 'real_income_')
data_real_cash_income = data_real_cash_income.set_index('region', drop=True)
data_real_cash_income = data_real_cash_income.sort_index()
data_real_cash_income.head()

Unnamed: 0_level_0,real_income_2015,real_income_2016,real_income_2017,real_income_2018,real_income_2019,real_income_2020
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Алтайский край,99.1,94.7,100.0,99.7,99.6,95.5
Амурская область,96.1,92.1,101.1,102.4,101.7,100.3
Архангельская область,95.1,92.9,98.9,102.0,100.1,98.6
Архангельская область (кроме Ненецкого автономного округа),95.1,93.0,98.7,101.7,100.2,98.4
Астраханская область,94.0,90.1,97.1,100.6,100.7,97.1


#### 2.13.3 Per capita monthly formal wage

Data source contains per capita monthly formal wage in roubles per Russian region across the years 2015 and 2020. Columns:
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `monthly_wage_2015-monthly_wage_2020` - columns containing per capita monthly formal wage in roubles for each of the regions in the corresponding year.

In [141]:
data_monthly_formal_wage = pd.read_excel(
    'data/cash_real_income_wages_2015_2020.xlsx', 
    sheet_name='formal_wage_paid'
)
data_monthly_formal_wage = rename_columns(data_monthly_formal_wage, 'monthly_wage_')
data_monthly_formal_wage = data_monthly_formal_wage.set_index('region', drop=True)
data_monthly_formal_wage = data_monthly_formal_wage.sort_index()
data_monthly_formal_wage.head()

Unnamed: 0_level_0,monthly_wage_2015,monthly_wage_2016,monthly_wage_2017,monthly_wage_2018,monthly_wage_2019,monthly_wage_2020
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Алтайский край,20090,21202,22743,25519,27962,30072
Амурская область,32902,33837,37368,42315,47234,52430
Архангельская область,38300,40790,42950,48307,52434,55891
Архангельская область (кроме Ненецкого автономного округа),35592,38118,40352,45427,49435,52779
Астраханская область,25499,27493,29599,33630,36093,38885


#### 2.13.4 Real wage

Data source contains real wage in percent compared to the previous year per Russian region across the years 2015 and 2020. Columns:
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `real_wage_2015-real_wage_2020` - columns containing the real wage in percent compared to the previous year for each of the regions in the corresponding year.

In [142]:
data_real_wage = pd.read_excel(
    'data/cash_real_income_wages_2015_2020.xlsx', 
    sheet_name='real_pay'
)
data_real_wage = rename_columns(data_real_wage, 'real_wage_')
data_real_wage = data_real_wage.set_index('region', drop=True)
data_real_wage = data_real_wage.sort_index()
data_real_wage.head()

Unnamed: 0_level_0,real_wage_2015,real_wage_2016,real_wage_2017,real_wage_2018,real_wage_2019,real_wage_2020
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Алтайский край,90.0,98.4,103.6,109.3,104.9,103.8
Амурская область,88.0,96.0,107.4,110.1,106.0,105.2
Архангельская область,92.7,99.4,102.0,110.6,103.8,102.8
Архангельская область (кроме Ненецкого автономного округа),92.4,100.0,102.5,110.7,104.0,102.9
Астраханская область,89.9,101.4,104.5,110.7,103.0,104.5
