# Final project: Identification of vulnerable population groups

## 1. Problem statement

According to a December 2021 [inFOM survey](https://www.cbr.ru/Collection/Collection/File/39633/inFOM_21-12.pdf), 27% of Russians have enough money only for food, and another 9% cannot afford a nutritious diet. These people are especially attentive to prices, and the rate of growth of food prices usually exceeds the average rate of inflation. At the same time, Rosstat believes that food expenses should make up approximately 36% of a Russian's average monthly expenses (another 10% goes to utilities and housing, 4% goes to medicines). 

Until 2021, the "poverty line" (living below the subsistence minimum) in Russia was determined by the cost of the [minimum food basket](https://base.garant.ru/70306880/). In the same year, the government "untied" the poverty level from the prices of basic products: since 2021, the subsistence minimum is calculated as 44.2% of the median income of Russian citizens for the previous year.

You have at your disposal data on income, morbidity, socially vulnerable groups of the Russian population and other economic and demographic data.

Your task as a data scientist:
* cluster the regions of Russia and determine which of them are in the greatest need of assistance to low-income/disadvantaged segments of the population;
* describe the population groups facing poverty;
* determine:
    * whether the number of children, pensioners and other socially vulnerable groups affects the poverty level in the region;
    * whether the level of poverty/social disadvantage is related to production and consumption in the region;
    * what other dependencies can be observed in relation to socially vulnerable segments of the population.

## 2. Getting to know the data

To reduce data redundancy, data for districts uniting multiple regions has been removed from the initial data sources. Also the following former regions have been removed, since they have significant amount of empty values: 
* Агинский Бурятский округ (Забайкальский край)
* Коми-Пермяцкий округ, входящий в состав Пермского края
* Корякский округ, входящий в состав Камчатского края
* Крымский федеральный округ
* Таймырский (Долгано-Ненецкий) автономный округ (Красноярский край)
* Усть-Ордынский Бурятский округ
* Эвенкийский автономный округ (Красноярский край)

### 2.0 Import dependencies and define helper code

#### 2.0.1 Import dependencies

In [178]:
import pandas as pd

#### 2.0.2 Define helper code

In [179]:
def rename_columns(df, prefix):
    dict_columns = {}
    for i in range(1, df.shape[1]):
        dict_columns[df.columns[i]] = prefix + str(df.columns[i])

    return df.rename(columns=dict_columns)

### 2.1 Population data

Data source contains the population dynamics of the Russian regions across the years 1999 and 2022. Columns:
1. `region` - name of the region of the Russian Federation. 
2. `population_1999-population_2022` - columns containing the population count for each of the regions in the corresponding year.

In [None]:
data_population = pd.read_excel('data/population.xlsx', 'report')
for i in range(data_population.shape[0]//2):
    data_population.iloc[i, 1:] = data_population.iloc[i+1, 1:]
    data_population.drop(index=i+1, inplace=True)
    data_population = data_population.reset_index(drop=True)

data_population = rename_columns(data_population, 'population_')
data_population = data_population.set_index('region')
data_population = data_population.sort_index()
data_population.to_csv('population.csv')

### 2.2 Child mortality in rural areas

Data source contains child mortality in rural areas on the first life year in persons per Russian region across the years 1990 and 2021. Columns:
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `child_mortality_rural_1990-child_mortality_rural_2021` - columns containing the child mortality in rural areas for each of the regions in the corresponding year.

In [None]:
data_child_mortality_rural = pd.read_excel('data/child_mortality_rural_1990_2021.xlsx')
data_child_mortality_rural = rename_columns(data_child_mortality_rural, 'child_mortality_rural_')
data_child_mortality_rural['region'] = data_child_mortality_rural['region'].str.strip()
data_child_mortality_rural = data_child_mortality_rural.set_index('region', drop=True)
data_child_mortality_rural = data_child_mortality_rural.sort_index()
data_child_mortality_rural.head()

In [187]:
#df = pd.concat([data_population, data_child_mortality_rural], axis=1)
df = data_child_mortality_rural / data_population
df.to_csv('text.csv')

### 2.3 Child mortality in urban areas

Data source contains child mortality in urban areas on the first life year in persons per Russian region across the years 1990 and 2021. Columns:
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `child_mortality_urban_1990-child_mortality_urban_2021` - columns containing the child mortality in urban areas for each of the regions in the corresponding year.

In [None]:
data_child_mortality_urban = pd.read_excel('data/child_mortality_urban_1990_2021.xlsx')
data_child_mortality_urban = rename_columns(data_child_mortality_urban, 'child_mortality_urban_')
data_child_mortality_urban['region'] = data_child_mortality_urban['region'].str.strip()
data_child_mortality_urban = data_child_mortality_urban.set_index('region', drop=True)
data_child_mortality_urban = data_child_mortality_urban.sort_index()
data_child_mortality_urban.head()

### 2.4 Disability statistics

Data source is loaded and then gets transformed to contain the disability data per Russion region across the years 2017 and 2022. Columns
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `disability_{year}_total, disability_{year}_18_30, disability_{year}_31_40, disability_{year}_41_50, disability_{year}_60_` - columns containing the disability counts across the whole population and across different age groups for each of the regions in the corresponding year.

In [None]:
data_disability = pd.read_csv('data/disabled_total_by_age_2017_2022.csv', sep=';')
data_disability = data_disability[data_disability['date'].str.endswith('01-01')]


def flatten_table(df, years):
    result = []
    for year in years:
        df_year = df[df['date'].str.startswith(year)]
        df_year = df_year.drop(columns='date')
        df_year = rename_columns(df_year, f'disability_{year}_')
        df_year = df_year.set_index('region')
        result.append(df_year)
    return pd.concat(result, axis=1)


data_disability = flatten_table(data_disability, ['2017', '2018', '2019', '2020', '2021', '2022'])
data_disability.head()

### 2.5 Poverty Statistics

Data source is loaded and then gets transformed to contain the poverty data per Russion region across the years 1992 and 2020. Columns
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `poverty_1992-poverty_2020` - columns containing the poverty percentage across the whole populationfor each of the regions in the corresponding year.

In [None]:
data_poverty = pd.read_csv(
    'data/poverty_percent_by_regions_1992_2020.csv', 
    sep=';', 
    skipinitialspace=True
)


def flatten_table(df, years):
    result = []
    for year in years:
        df_year = df[df['year'] == year]
        df_year = df_year.drop(columns='year')
        df_year = rename_columns(df_year, f'poverty_{year}_')
        df_year = df_year.set_index('region')
        result.append(df_year)
    return pd.concat(result, axis=1)


data_poverty = flatten_table(data_poverty, list(range(1992, 2021)))
data_poverty = data_poverty.sort_index()
data_poverty.head()

### 2.6 Poverty Statisticts in Demografic Context

In [None]:
data_poverty_dem_list
data_poverty_dem = pd.read_excel('data/poverty_socdem_2017.xlsx')

data_monthly_cash_income = rename_columns(data_monthly_cash_income, 'monthly_income_')
data_monthly_cash_income = data_monthly_cash_income.set_index('region', drop=True)
data_monthly_cash_income = data_monthly_cash_income.sort_index()
data_monthly_cash_income.head()

### 2.X Income situation

#### 2.X.1 Per capita monthly cash income

Data source contains per capita monthly income in roubles per Russian region across the years 2015 and 2020. Columns:
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `monthly_income_2015-monthly_income_2020` - columns containing the per capita monthly cash income in roubles for each of the regions in the corresponding year.

In [None]:
data_monthly_cash_income = pd.read_excel(
    'data/cash_real_income_wages_2015_2020.xlsx', 
    sheet_name='per_capita_cash_income'
)
data_monthly_cash_income = rename_columns(data_monthly_cash_income, 'monthly_income_')
data_monthly_cash_income = data_monthly_cash_income.set_index('region', drop=True)
data_monthly_cash_income = data_monthly_cash_income.sort_index()
data_monthly_cash_income.head()

#### 2.X.2 Real cash income

Data source contains real cash income in percent compared to the previous year per Russian region across the years 2015 and 2020. Columns:
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `real_income_2015-real_income_2020` - columns containing the real cash income in percent compared to the previous year for each of the regions in the corresponding year.

In [None]:
data_real_cash_income = pd.read_excel(
    'data/cash_real_income_wages_2015_2020.xlsx', 
    sheet_name='real_incomes'
)
data_real_cash_income = rename_columns(data_real_cash_income, 'real_income_')
data_real_cash_income = data_real_cash_income.set_index('region', drop=True)
data_real_cash_income = data_real_cash_income.sort_index()
data_real_cash_income.head()

#### 2.X.3 Per capita monthly formal wage

Data source contains per capita monthly formal wage in roubles per Russian region across the years 2015 and 2020. Columns:
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `monthly_wage_2015-monthly_wage_2020` - columns containing per capita monthly formal wage in roubles for each of the regions in the corresponding year.

In [None]:
data_monthly_formal_wage = pd.read_excel(
    'data/cash_real_income_wages_2015_2020.xlsx', 
    sheet_name='formal_wage_paid'
)
data_monthly_formal_wage = rename_columns(data_monthly_formal_wage, 'monthly_wage_')
data_monthly_formal_wage = data_monthly_formal_wage.set_index('region', drop=True)
data_monthly_formal_wage = data_monthly_formal_wage.sort_index()
data_monthly_formal_wage.head()

#### 2.X.4 Real wage

Data source contains real wage in percent compared to the previous year per Russian region across the years 2015 and 2020. Columns:
1. `region` - name of the region of the Russian Federation. Data for districts uniting multiple regions has been removed.
2. `real_wage_2015-real_wage_2020` - columns containing the real wage in percent compared to the previous year for each of the regions in the corresponding year.

In [None]:
data_real_wage = pd.read_excel(
    'data/cash_real_income_wages_2015_2020.xlsx', 
    sheet_name='real_pay'
)
data_real_wage = rename_columns(data_real_wage, 'real_wage_')
data_real_wage = data_real_wage.set_index('region', drop=True)
data_real_wage = data_real_wage.sort_index()
data_real_wage.head()