# Average daily number of persons reported missing on Eastern Orthodox Church Holidays in Ukraine, 2010-2019

This visualization project was concerned with answering the following question - Is an average daily number of persons reported missing on Eastern Orthodox Church Holidays different from an average daily number of persons reported missing on non-holidays in Ukraine?

# Data sources
Missing persons JSON:
https://data.gov.ua/dataset/8851831d-b5ce-4ca8-8685-eafbc3f57eca/resource/6cfff17e-84ac-4141-b0fd-89abb68e9f31/download/mvswantedbezvesti_1.json

Wikipedia article about Church holidays in Ukraine:
https://uk.wikipedia.org/wiki/%D0%A6%D0%B5%D1%80%D0%BA%D0%BE%D0%B2%D0%BD%D1%96_%D1%81%D0%B2%D1%8F%D1%82%D0%B0


In [None]:
# Data manipulation
import pandas as pd
import numpy as np
# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
# HTTP library
import requests

In [None]:
# Notebook configuration
%config InlineBackend.figure_format = 'svg'
%matplotlib inline

Open data of the Ukrainian Government on persons reported missing was used for this study. We decided to limit ourselves to 10 years period and also discard cases after 2019, thus including the cases of persons that have been reported missing for more than one year.

In [None]:
# Loading JSON data on missing persons
url = 'https://data.gov.ua/dataset/8851831d-b5ce-4ca8-8685-eafbc3f57eca/resource/6cfff17e-84ac-4141-b0fd-89abb68e9f31/download/mvswantedbezvesti_1.json'
r = requests.get(url)

In [None]:
# Creating DataFrame
df = pd.read_json(r.text)
df.head()

In [None]:
# Checking for invalid dates
df.sort_values('LOST_DATE')

In [None]:
# One entry has invalid date
# Dropping invalid entry
df = df.drop(df.index[3336])

In [None]:
# Converting format
df['LOST_DATE'] = pd.to_datetime(df['LOST_DATE'])

In [None]:
# Period of interest
df = df[((df['LOST_DATE'].dt.year > 2009) & (df['LOST_DATE'].dt.year < 2020))]

In [None]:
# Getting data
df = df[['LOST_DATE', 'ID']]
df = df.rename({'LOST_DATE': 'DATE', 'ID': 'CASES'}, axis=1)
# Counting number of cases per day
df = df.groupby('DATE').count()
df

In [None]:
# Generating full range of period
all_dates = np.arange('2010-01-01', '2020-01-01', dtype='datetime64[D]')
all_dates = pd.DataFrame(all_dates)
all_dates['Z'] = np.zeros(len(all_dates))
all_dates = all_dates.rename({0: 'DATE', 'Z': 'CASES'}, axis=1)
all_dates = all_dates.set_index('DATE')

In [None]:
# Adding number of cases to range, zeros for non cases
days = all_dates + df
days = days.fillna(0)

In [None]:
# Average value for all days
days_mean = days.mean().squeeze()

Wikipedia article was scraped for data about religious holidays in Ukraine. We selected 14 major holidays, 5 of which have movable dates each year. With the help of the article, we generated dates for all holidays in the period.

In [None]:
# Getting data for holidays
# Part 1 - Holidays with fixed date
# Article about religious holidays in Ukraine
wiki_page = pd.read_html('https://uk.wikipedia.org/wiki/%D0%A6%D0%B5%D1%80%D0%BA%D0%BE%D0%B2%D0%BD%D1%96_%D1%81%D0%B2%D1%8F%D1%82%D0%B0')

# Text from article
raw = '''Різдво Христове (07.01),
Водохреща — Йордан (19.01),
Стрітення (15.02),
Благовіщення (07.04),
Преображення — Спаса (19.08),
Успіння Пресвятої Богородиці (28.08),
Різдво Пресвятої Богородиці (21.09),
Воздвиження Чесного Хреста (27.09),
Введення Богородиці у храм (04.12).
'''
# Translation
translated = '''Christmas Day (07.01),
Baptism of the Lord (19.01),
Candlemas (15.02),
Annunciation (07.04),
Feast of the Transfiguration (19.08),
Dormition of the Mother of God (28.08),
Nativity of Mary (21.09),
Feast of the Cross (27.09),
Presentation of Mary (04.12).
'''

In [None]:
# Getting titles of the holidays and yearly dates
fixed_lst = translated[:-2].split(',')
fixed_lst = [
    (f.split('(')[0].strip(), f.split('(')[1][:-1]) for f in fixed_lst
    ]

In [None]:
# Mean values of holidays
hol_means = [
    days[
        (days.index.month == int(h[1][-2:])) & (days.index.day == int(h[1][:2]))
        ].mean().squeeze() for h in fixed_lst
        ]
# Titles of holidays
hols = [h[0] for h in fixed_lst]

In [None]:
# Part 2 - Holidays with movable dates
# Loading table with movable holidays from the article
move = wiki_page[1]

# Titles in Ukrainian
mhols = move.columns
mhols = mhols[2:-1]

# Translation
translated = [
    'Quinquagesima', 'Triumphal entry into Jerusalem',
    'Easter', 'Ascension of Jesus', 'Pentecost'
    ]

In [None]:
# Generating holiday dates for the period
move['Рік'] = '.' + move['Рік'].astype(str)
for h in mhols:
    move[h] = pd.to_datetime(move[h].astype(str) + move['Рік'])

# Mean values of movable holidays
mhol_means = [
    days[days.index.isin(move[h].values)].mean().squeeze() for h in mhols
    ]

In [None]:
# Combining mean values and titles of fixed and movable holidays
hol_means.extend(mhol_means)
hols.extend(translated)

# Sorting data for plotting
hol_means, hols = (
    list(t) for t in zip(*sorted(zip(hol_means, hols), reverse=True))
    )

In [None]:
# Getting total number of cases for all holidays
hol_sum = [
    days[
        (days.index.month == int(h[1][-2:])) & (days.index.day == int(h[1][:2]))
        ].sum().squeeze() for h in fixed_lst
        ]
mhol_sum = [
    days[days.index.isin(move[h].values)].sum().squeeze() for h in mhols
    ]
hs = sum(hol_sum) + sum(mhol_sum)
# Number of non-holidays
wd = len(days) - len(hols) * 10
# Total number of cases for all period
ds = days.sum().squeeze()
# Total number of cases for non-holidays
ws = ds-hs
# Mean value for non-holidays
mw = ws / wd

A horizontal barplot was used, with bars representing holidays. We sorted the bars to help the reader identify the major trends in the average numbers of missing persons.

We marked with two lines the average numbers for Holidays and non-holidays for the reader to see the difference.

In [None]:
# Plotting
fig, ax = plt.subplots(figsize=(16, 9))

# Seaborn style, etc.
sns.set_style('white')
sns.set_color_codes('pastel')
sns.set_context('notebook')

# Horizontal barplot
sns.barplot(y=hols, x=hol_means, color='lightsteelblue', orient='h')
sns.despine(left=True, bottom=True)

# Line to mark mean for holidays
hml = plt.axvline(np.mean(hol_means), color='grey')
# Line to mark mean for non-holidays
wml = plt.axvline(mw, color='r')

# Labels, legend, etc.
plt.title(
    'Average daily number of persons reported missing on Eastern Orthodox Church Holidays in Ukraine, 2010-2019',
    fontsize= 15, pad=15
    )
plt.xlabel('Missing Persons')
plt.ylabel('Holidays')
ax.legend(
    [hml, wml],
    [
        f'{np.mean(hol_means):.2f} - Average daily number of persons reported missing on Church Holidays',
        f'{mw:.2f} - Average daily number of persons reported missing on non-holidays'
        ],
    frameon=False
    )

# Saving picture
plt.savefig("missing.png", bbox_inches='tight', dpi=300)
# Showing plot
plt.show()

The average daily number of persons reported missing on Eastern Orthodox Church Holidays is lower than the average daily number of persons reported missing on non-holidays in Ukraine.

Addressing Alberto Cairo’s principles of good data visualization:

Truthfulness - Generated dates for movable holidays ensured correct data for the period.

Beauty - Calm colors were used because this visualization could be emotionally difficult for some readers.

Functionality - Data is presented without extra information on the different dates of movable holidays, thus decreasing distraction from the main topic.

Insightfulness - The lines that mark average numbers in a visual form reveal the answer to our question.