# Final project: Identification of vulnerable population groups

## 1. Problem statement

According to a December 2021 [inFOM survey](https://www.cbr.ru/Collection/Collection/File/39633/inFOM_21-12.pdf), 27% of Russians have enough money only for food, and another 9% cannot afford a nutritious diet. These people are especially attentive to prices, and the rate of growth of food prices usually exceeds the average rate of inflation. At the same time, Rosstat believes that food expenses should make up approximately 36% of a Russian's average monthly expenses (another 10% goes to utilities and housing, 4% goes to medicines). 

Until 2021, the "poverty line" (living below the subsistence minimum) in Russia was determined by the cost of the [minimum food basket](https://base.garant.ru/70306880/). In the same year, the government "untied" the poverty level from the prices of basic products: since 2021, the subsistence minimum is calculated as 44.2% of the median income of Russian citizens for the previous year.

You have at your disposal data on income, morbidity, socially vulnerable groups of the Russian population and other economic and demographic data.

Your task as a data scientist:
* cluster the regions of Russia and determine which of them are in the greatest need of assistance to low-income/disadvantaged segments of the population;
* describe the population groups facing poverty;
* determine:
    * whether the number of children, pensioners and other socially vulnerable groups affects the poverty level in the region;
    * whether the level of poverty/social disadvantage is related to production and consumption in the region;
    * what other dependencies can be observed in relation to socially vulnerable segments of the population.

## 2. Getting to know the data

### 2.0 Import dependencies and define helper code

#### 2.0.1 Import dependencies

In [57]:
import pandas as pd

#### 2.0.2 Define helper code

In [None]:
def rename_columns(df, prefix):
    dict_columns = {}
    for i in range(1, df.shape[1]):
        dict_columns[df.columns[i]] = prefix + str(df.columns[i])

    return df.rename(columns=dict_columns)


def flatten_table(df, years):
    result = []
    for year in years:
        df_year = df[df['date'].str.startswith(year)]
        df_year = df_year.drop(columns='date')
        df_year = rename_columns(df_year, f'disability_{year}_')
        df_year = df_year.set_index('region')
        result.append(df_year)
    return pd.concat(result, axis=1)

### 2.1 Population data

Data source contains the population dynamics of the Russian regions across the years 1999 and 2022. Columns:
1. region - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. population_1999-population_2022 - columns containing the population count for each of the regions in the corresponding year.

In [59]:
data_population = pd.read_excel('data/population.xlsx', 'report')
for i in range(data_population.shape[0]//2):
    data_population.iloc[i, 1:] = data_population.iloc[i+1, 1:]
    data_population.drop(index=i+1, inplace=True)
    data_population = data_population.reset_index(drop=True)

data_population = rename_columns(data_population, 'population_')
data_population.head()

Unnamed: 0,region,population_1999,population_2000,population_2001,population_2002,population_2003,population_2004,population_2005,population_2006,population_2007,...,population_2013,population_2014,population_2015,population_2016,population_2017,population_2018,population_2019,population_2020,population_2021,population_2022
0,Белгородская область,1494868.0,1501699.0,1506976.0,1508137.0,1511899.0,1513860.0,1511662.0,1511715.0,1514153.0,...,1540985.0,1544108.0,1547936.0,1550137.0,1552865.0,1549876.0,1547418.0,1549151.0,1541259.0,1531917.0
1,Брянская область,1437471.0,1423178.0,1407965.0,1391430.0,1375004.0,1360249.0,1344132.0,1327652.0,1312748.0,...,1253666.0,1242599.0,1232940.0,1225741.0,1220530.0,1210982.0,1200187.0,1192491.0,1182682.0,1168771.0
2,Владимирская область,1592184.0,1575507.0,1558052.0,1539179.0,1520057.0,1509571.0,1497598.0,1486453.0,1475861.0,...,1421742.0,1413321.0,1405613.0,1397168.0,1389599.0,1378337.0,1365805.0,1358416.0,1342099.0,1323659.0
3,Воронежская область,2458558.0,2441337.0,2422371.0,2397111.0,2374461.0,2367457.0,2364932.0,2360912.0,2353805.0,...,2330377.0,2328959.0,2331147.0,2333477.0,2335408.0,2333768.0,2327821.0,2324205.0,2305608.0,2287678.0
4,Ивановская область,1210603.0,1194595.0,1178969.0,1161861.0,1144540.0,1131027.0,1116739.0,1101862.0,1089837.0,...,1048961.0,1043130.0,1036909.0,1029838.0,1023170.0,1014646.0,1004180.0,997135.0,987032.0,976918.0


### 2.2 Child mortality in rural areas

Data source contains child mortality in rural areas on the first life year in persons per Russian region across the years 1990 and 2021. Columns:
1. region - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. child_mortality_rural_1990-child_mortality_rural_2021 - columns containing the child mortality in rural areas for each of the regions in the corresponding year.

In [60]:
data_child_mortality_rural = pd.read_excel('data/child_mortality_rural_1990_2021.xlsx')
data_child_mortality_rural = rename_columns(data_child_mortality_rural, 'child_mortality_rural_')
data_child_mortality_rural.head()

Unnamed: 0,child_mortality,child_mortality_rural_1990,child_mortality_rural_1991,child_mortality_rural_1992,child_mortality_rural_1993,child_mortality_rural_1994,child_mortality_rural_1995,child_mortality_rural_1996,child_mortality_rural_1997,child_mortality_rural_1998,...,child_mortality_rural_2012,child_mortality_rural_2013,child_mortality_rural_2014,child_mortality_rural_2015,child_mortality_rural_2016,child_mortality_rural_2017,child_mortality_rural_2018,child_mortality_rural_2019,child_mortality_rural_2020,child_mortality_rural_2021
0,Белгородская область,103.0,92.0,75.0,79.0,80.0,72.0,72.0,67.0,61.0,...,43.0,48.0,41.0,42.0,36.0,34.0,33.0,16.0,22.0,20.0
1,Брянская область,124.0,109.0,83.0,121.0,99.0,104.0,96.0,67.0,75.0,...,46.0,47.0,39.0,44.0,36.0,31.0,12.0,11.0,12.0,13.0
2,Владимирская область,80.0,58.0,60.0,62.0,46.0,50.0,47.0,38.0,39.0,...,30.0,31.0,23.0,31.0,28.0,17.0,22.0,15.0,10.0,14.0
3,Воронежская область,138.0,179.0,156.0,149.0,154.0,137.0,133.0,132.0,125.0,...,32.0,33.0,33.0,25.0,24.0,18.0,22.0,7.0,12.0,10.0
4,Ивановская область,74.0,44.0,40.0,57.0,50.0,41.0,31.0,39.0,27.0,...,13.0,19.0,10.0,16.0,15.0,4.0,7.0,9.0,3.0,4.0


### 2.3 Child mortality in urban areas

Data source contains child mortality in urban areas on the first life year in persons per Russian region across the years 1990 and 2021. Columns:
1. region - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. child_mortality_urban_1990-child_mortality_urban_2021 - columns containing the child mortality in urban areas for each of the regions in the corresponding year.

In [61]:
data_child_mortality_urban = pd.read_excel('data/child_mortality_urban_1990_2021.xlsx')
data_child_mortality_urban = rename_columns(data_child_mortality_urban, 'child_mortality_urban_')
data_child_mortality_urban.head()

Unnamed: 0,region,child_mortality_urban_1990,child_mortality_urban_1991,child_mortality_urban_1992,child_mortality_urban_1993,child_mortality_urban_1994,child_mortality_urban_1995,child_mortality_urban_1996,child_mortality_urban_1997,child_mortality_urban_1998,...,child_mortality_urban_2012,child_mortality_urban_2013,child_mortality_urban_2014,child_mortality_urban_2015,child_mortality_urban_2016,child_mortality_urban_2017,child_mortality_urban_2018,child_mortality_urban_2019,child_mortality_urban_2020,child_mortality_urban_2021
0,Белгородская область,209.0,198.0,165.0,165.0,153.0,131.0,102.0,100.0,99.0,...,84.0,68.0,62.0,68.0,72.0,43.0,40.0,23.0,25.0,34.0
1,Брянская область,198.0,195.0,200.0,176.0,157.0,125.0,116.0,135.0,107.0,...,81.0,77.0,87.0,67.0,67.0,65.0,33.0,26.0,29.0,14.0
2,Владимирская область,221.0,209.0,179.0,148.0,165.0,146.0,114.0,123.0,130.0,...,97.0,86.0,87.0,76.0,72.0,59.0,45.0,43.0,51.0,52.0
3,Воронежская область,261.0,262.0,262.0,213.0,194.0,184.0,140.0,143.0,177.0,...,134.0,153.0,112.0,101.0,96.0,90.0,77.0,74.0,58.0,73.0
4,Ивановская область,181.0,187.0,174.0,168.0,142.0,141.0,128.0,127.0,110.0,...,58.0,67.0,58.0,48.0,51.0,31.0,28.0,31.0,25.0,20.0


### 2.4 Disability statistics

In [None]:
data_disability = pd.read_csv('data/disabled_total_by_age_2017_2022.csv', sep=';')
data_disability = data_disability[data_disability['date'].str.endswith('01-01')]


flatten_table(data_disability, ['2017', '2018', '2019', '2020', '2021', '2022'])

Unnamed: 0_level_0,disability_2017_total,disability_2017_18_30,disability_2017_31_40,disability_2017_41_50,disability_2017_51_60,disability_2017_60_,disability_2018_total,disability_2018_18_30,disability_2018_31_40,disability_2018_41_50,...,disability_2021_31_40,disability_2021_41_50,disability_2021_51_60,disability_2021_60_,disability_2022_total,disability_2022_18_30,disability_2022_31_40,disability_2022_41_50,disability_2022_51_60,disability_2022_60_
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Алтайский край,182975.0,8930.0,13527.0,13836.0,30477.0,116205.0,178491.0,8238.0,13542.0,14296.0,...,13358.0,16039.0,23274.0,107520.0,162551.0,7308.0,13058.0,16684.0,21956.0,103545.0
Амурская область,69063.0,3687.0,5260.0,6266.0,12679.0,41171.0,67042.0,3428.0,5297.0,6263.0,...,5143.0,6338.0,9427.0,38004.0,59472.0,3295.0,5041.0,6331.0,8689.0,36116.0
Архангельская область,88792.0,3488.0,5159.0,6747.0,13834.0,59564.0,87258.0,3276.0,5140.0,6671.0,...,5154.0,7067.0,11368.0,56370.0,79706.0,2977.0,5063.0,7242.0,10727.0,53697.0
Астраханская область,44995.0,3159.0,3974.0,4749.0,9136.0,23977.0,44111.0,2969.0,3940.0,4730.0,...,4018.0,4980.0,7708.0,23379.0,42096.0,2686.0,3997.0,5143.0,7487.0,22783.0
Белгородская область,223030.0,6318.0,10383.0,16596.0,37444.0,152289.0,214048.0,5808.0,10127.0,15706.0,...,9474.0,14917.0,28836.0,133344.0,182223.0,4868.0,9121.0,14645.0,26875.0,126714.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Чеченская Республика,141656.0,21645.0,31750.0,35620.0,32415.0,20226.0,146928.0,21887.0,33169.0,36467.0,...,40314.0,42005.0,36364.0,28982.0,181184.0,27995.0,42029.0,43462.0,36874.0,30824.0
Чувашская Республика,80549.0,3985.0,5122.0,7547.0,15226.0,48669.0,80106.0,3827.0,5117.0,7445.0,...,5166.0,7101.0,12712.0,46496.0,73021.0,3511.0,5225.0,7112.0,12186.0,44987.0
Чукотский автономный округ,1556.0,172.0,227.0,271.0,397.0,489.0,1549.0,166.0,223.0,262.0,...,218.0,283.0,383.0,560.0,1659.0,178.0,220.0,299.0,369.0,593.0
Ямало-Ненецкий автономный округ,13484.0,1313.0,1324.0,1673.0,3697.0,5477.0,13449.0,1285.0,1401.0,1665.0,...,1485.0,1785.0,2981.0,6217.0,13989.0,1339.0,1542.0,1813.0,2932.0,6363.0


### 2.4 Income situation

#### 2.4.1 Per capita monthly cash income

Data source contains per capita monthly income in roubles per Russian region across the years 2015 and 2020. Columns:
1. region - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. monthly_income_2015-monthly_income_2020 - columns containing the per capita monthly cash income in roubles for each of the regions in the corresponding year.

In [65]:
data_monthly_cash_income = pd.read_excel(
    'data/cash_real_income_wages_2015_2020.xlsx', 
    sheet_name='per_capita_cash_income'
)
data_monthly_cash_income = rename_columns(data_monthly_cash_income, 'monthly_income_')
data_monthly_cash_income.head()

Unnamed: 0,region,monthly_income_2015,monthly_income_2016,monthly_income_2017,monthly_income_2018,monthly_income_2019,monthly_income_2020
0,Белгородская область,28043.0,29799.0,30342.0,30778.0,32352.0,32841.0
1,Брянская область,23428.0,24006.0,25107.0,26585.0,28371.0,28596.0
2,Владимирская область,22712.0,22365.0,23554.0,23539.0,25358.0,25922.0
3,Воронежская область,29366.0,29284.0,29498.0,30289.0,32022.0,32078.0
4,Ивановская область,22297.0,23676.0,24860.0,24503.0,25794.0,26277.0


#### 2.4.2 Real cash income

Data source contains real cash income in percent compared to the previous year per Russian region across the years 2015 and 2020. Columns:
1. region - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. real_income_2015-real_income_2020 - columns containing the real cash income in percent compared to the previous year for each of the regions in the corresponding year.

In [64]:
data_real_cash_income = pd.read_excel(
    'data/cash_real_income_wages_2015_2020.xlsx', 
    sheet_name='real_incomes'
)
data_real_cash_income = rename_columns(data_real_cash_income, 'real_income_')
data_real_cash_income.head()

Unnamed: 0,region,real_income_2015,real_income_2016,real_income_2017,real_income_2018,real_income_2019,real_income_2020
0,Белгородская область,99.3,100.8,99.1,98.7,100.6,98.1
1,Брянская область,97.0,95.0,99.4,102.1,100.5,96.3
2,Владимирская область,99.5,92.2,100.9,96.4,101.9,98.0
3,Воронежская область,101.1,93.6,97.4,100.0,101.1,95.4
4,Ивановская область,95.5,98.5,100.4,94.4,99.5,97.2


#### 2.4.3 Per capita monthly formal wage

Data source contains per capita monthly formal wage in roubles per Russian region across the years 2015 and 2020. Columns:
1. region - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. monthly_wage_2015-monthly_wage_2020 - columns containing per capita monthly formal wage in roubles for each of the regions in the corresponding year.

In [67]:
data_monthly_formal_wage = pd.read_excel(
    'data/cash_real_income_wages_2015_2020.xlsx', 
    sheet_name='formal_wage_paid'
)
data_monthly_formal_wage = rename_columns(data_monthly_formal_wage, 'monthly_wage_')
data_monthly_formal_wage.head()

Unnamed: 0,region,monthly_wage_2015,monthly_wage_2016,monthly_wage_2017,monthly_wage_2018,monthly_wage_2019,monthly_wage_2020
0,Белгородская область,25456.0,27091.0,29066.0,31852.0,34615.0,37442.0
1,Брянская область,21679.0,22923.0,24743.0,27251.0,29853.0,31946.0
2,Владимирская область,23877.0,25135.0,26975.0,30460.0,33076.0,35240.0
3,Воронежская область,24906.0,26335.0,28007.0,31207.0,33690.0,36317.0
4,Ивановская область,21161.0,22144.0,23470.0,25729.0,27553.0,29083.0


#### 2.4.4 Real wage

Data source contains real wage in percent compared to the previous year per Russian region across the years 2015 and 2020. Columns:
1. region - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. real_wage_2015-real_wage_2020 - columns containing the real wage in percent compared to the previous year for each of the regions in the corresponding year.

In [68]:
data_real_wage = pd.read_excel(
    'data/cash_real_income_wages_2015_2020.xlsx', 
    sheet_name='real_pay'
)
data_real_wage = rename_columns(data_real_wage, 'real_wage_')
data_real_wage.head()

Unnamed: 0,region,real_wage_2015,real_wage_2016,real_wage_2017,real_wage_2018,real_wage_2019,real_wage_2020
0,Белгородская область,93.2,100.8,104.5,106.8,104.0,104.8
1,Брянская область,89.0,98.5,103.2,107.0,104.0,102.9
2,Владимирская область,91.0,99.2,103.6,109.9,103.5,103.0
3,Воронежская область,89.1,99.1,102.8,108.7,103.4,103.2
4,Ивановская область,87.9,97.6,102.1,106.0,102.1,101.6
