This is part of coding/visualization exercise for the candidacy of the Data Scientist position at the Saudi Commission for Health Specialties.
<br>
Created by: Abdulrahman Alzahrani

-----------------------------------------------------------------------------------------------------------------------

### Question
A friend once told me some in the Muslim community think people tend to die more (often) in the month of Shaban. As a data scientist, what do you think?

### Answer
To start answering this question, a null hypothesis (${ H }_{ 0 }$) and an alternative hypothesis (${ H }_{ 1 }$) will be set up.

${ H }_{ 0 }$ : Poeple tend to die with a number that is _Equally likely_ in each month of the year.

${ H }_{ 1 }$ : People tend to die more _(often) in the month of Shaban_.

In [1]:
# installing the python module 'Umalqurra' which gives the ability to convert Gregorian to Hijri
# https://pypi.org/project/ummalqura/
#!pip install ummalqura

In [2]:
# import all the needed modules
import pandas as pd
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import Imputer
import scipy.stats as stats
from ummalqura.hijri_date import HijriDate as h
from datetime import date

### Madinah Data: 
Going through the data provided by the Madinah municipality I found that all data before the year of 1420 Hijri are not complete and there are many months that have null values for the death records. Thus, I chose to collect the data from the 1420 Hijri till 1440 Hijri. Nonetheless, the data of the year 1440 Hijri is missing the last three months of them since we are in the 10th month of the year 1440 Hijri right now, so I would propose applying an [Imputer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html) to replace all missing values.

In [3]:
# load in the Madinah municipality data which is requested by scfhs
madinah_df = pd.read_excel('madinah_data.xlsx')

In [4]:
# data exploration for the Madinah municipality data
madinah_df.head(2)

Unnamed: 0,year,month_1,month_2,month_3,month_4,month_5,month_6,month_7,month_8,month_9,month_10,month_11,month_12
0,1420,304,230,248,243,256,231,242,241,339,307.0,330.0,454.0
1,1421,294,231,238,272,282,228,239,252,314,335.0,345.0,412.0


In [5]:
madinah_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 13 columns):
year        21 non-null int64
month_1     21 non-null int64
month_2     21 non-null int64
month_3     21 non-null int64
month_4     21 non-null int64
month_5     21 non-null int64
month_6     21 non-null int64
month_7     21 non-null int64
month_8     21 non-null int64
month_9     21 non-null int64
month_10    20 non-null float64
month_11    20 non-null float64
month_12    20 non-null float64
dtypes: float64(3), int64(10)
memory usage: 2.2 KB


In [6]:
# check for missing values before applying an Imputer
madinah_df.isnull().sum().sum()

3

In [7]:
# applying an Imputer with the default parameters which would use the mean of each column to fill the missing values 
imputed_madinah_df = pd.DataFrame(Imputer().fit_transform(madinah_df), columns=madinah_df.columns)



In [8]:
# check for missing values aftear applying an Imputer 
imputed_madinah_df.isnull().sum().sum()

0

In [9]:
# cheching the means and std of each column
imputed_madinah_df.describe()

Unnamed: 0,year,month_1,month_2,month_3,month_4,month_5,month_6,month_7,month_8,month_9,month_10,month_11,month_12
count,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0
mean,1430.0,402.52381,328.904762,365.142857,400.904762,380.666667,371.571429,356.904762,366.52381,411.52381,373.35,402.05,445.2
std,6.204837,69.429546,80.41387,129.982032,146.445179,122.533397,108.173736,112.274175,98.530513,79.807029,75.207895,50.783339,38.057325
min,1420.0,294.0,175.0,191.0,243.0,255.0,228.0,238.0,241.0,285.0,259.0,325.0,338.0
25%,1425.0,353.0,255.0,266.0,280.0,284.0,286.0,263.0,293.0,339.0,318.0,357.0,427.0
50%,1430.0,390.0,330.0,304.0,353.0,306.0,341.0,300.0,323.0,412.0,367.0,404.0,445.2
75%,1435.0,471.0,411.0,533.0,528.0,495.0,474.0,477.0,454.0,482.0,411.0,426.0,478.0
max,1440.0,495.0,429.0,573.0,684.0,584.0,562.0,551.0,599.0,524.0,543.0,501.0,500.0


### Test the hypothesis that the mean for the death in the month of Shaban is bigger of the death mean in all months of the year.

1. Define hypothesis
2. Set alpha = 0.05
3. Calculate test statistic
4. Find the p-value
5. Interpret results

In [10]:
## Step 1: Define hypotheses.
### H_0: mu_ShabanDeath <= mu_OtherMonthsDeath
### H_A: mu_ShabanDeath > mu_OtherMonthsDeath

In [11]:
## Step 2: Set alpha = 0.05
alpha = 0.05

In [12]:
shaban = imputed_madinah_df['month_8']
other_months = imputed_madinah_df.drop('year',axis=1)

In [13]:
t_test_result = stats.ttest_ind(shaban, other_months)

In [14]:
## Step 3: Calculate test statistic.
t_test_result.statistic

array([-1.36866895,  1.35550139,  0.03879879, -0.8926218 , -0.41219137,
       -0.15808466,  0.29509063,  0.        , -1.62634969, -0.25236492,
       -1.46869514, -3.41339551])

In [15]:
## Step 4: Find p-value.
t_test_result.pvalue

array([0.17874084, 0.18286262, 0.96924381, 0.37739855, 0.68239993,
       0.87518548, 0.76945093, 1.        , 0.11172899, 0.80205133,
       0.14973649, 0.00148179])

In [16]:
## Scipy always gives the test statistic as signed. This means that given p and t values from a two-tailed test,
## you would reject the null hypothesis of a greater-than test when
## p/2 < alpha and t > 0
## https://stackoverflow.com/questions/15984221/how-to-perform-two-sample-one-tailed-t-test-with-numpy-scipy

In [17]:
## Step 5: Interpret results
for i in range(len(t_test_result.statistic)):
    if (t_test_result.statistic[i] > 0) and (t_test_result.pvalue[i]/2 < alpha):
        print("Reject the null hypothesis", other_months.columns[i])
    else:
        print("Fail to reject the null hypothesis", other_months.columns[i])

Fail to reject the null hypothesis month_1
Fail to reject the null hypothesis month_2
Fail to reject the null hypothesis month_3
Fail to reject the null hypothesis month_4
Fail to reject the null hypothesis month_5
Fail to reject the null hypothesis month_6
Fail to reject the null hypothesis month_7
Fail to reject the null hypothesis month_8
Fail to reject the null hypothesis month_9
Fail to reject the null hypothesis month_10
Fail to reject the null hypothesis month_11
Fail to reject the null hypothesis month_12


### UK Data: 
I chose to do more testing for the null hypothesis with different data provided by the UK [Office for National Statistics](https://www.ons.gov.uk/). The data entitled [Deaths by date of death and local authority, 2010 to 2014 occurrences](https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/adhocs/005798deathsbydateofdeathandlocalauthority2010to2014occurrences).

In [18]:
# load in the UK Data for 2010
uk_2010_df = pd.read_excel('dailydeathsbylocalauthority20102014.xls', '2010')

In [19]:
# data exploration for the UK data of 2010
uk_2010_df.head(28)

Unnamed: 0,"Number of daily deaths by local authority, England and Wales, deaths occurred in 2010 1,2,3",Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30,Unnamed: 31,Unnamed: 32
0,,,,,,,,,,,...,,,,,,,,,,
1,Local Authority,Month,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,...,22.0,23.0,24.0,25.0,26.0,27.0,28.0,29.0,30.0,31.0
2,Adur,January,3.0,0.0,0.0,1.0,0.0,4.0,1.0,2.0,...,1.0,1.0,4.0,6.0,5.0,1.0,2.0,3.0,5.0,2.0
3,,February,1.0,4.0,3.0,2.0,0.0,1.0,0.0,2.0,...,5.0,3.0,0.0,3.0,2.0,1.0,2.0,0.0,0.0,0.0
4,,March,3.0,2.0,2.0,3.0,1.0,4.0,1.0,2.0,...,3.0,1.0,1.0,0.0,4.0,1.0,2.0,2.0,2.0,5.0
5,,April,2.0,3.0,2.0,1.0,4.0,0.0,6.0,2.0,...,2.0,1.0,1.0,2.0,4.0,2.0,2.0,3.0,1.0,0.0
6,,May,0.0,1.0,2.0,2.0,1.0,1.0,5.0,0.0,...,2.0,1.0,1.0,2.0,0.0,3.0,1.0,2.0,0.0,0.0
7,,June,1.0,0.0,0.0,3.0,2.0,3.0,0.0,1.0,...,1.0,3.0,2.0,0.0,2.0,0.0,3.0,1.0,1.0,0.0
8,,July,0.0,1.0,1.0,0.0,1.0,2.0,4.0,2.0,...,0.0,0.0,3.0,2.0,2.0,3.0,3.0,1.0,1.0,2.0
9,,August,1.0,1.0,1.0,0.0,0.0,1.0,2.0,0.0,...,2.0,2.0,1.0,0.0,0.0,2.0,3.0,2.0,2.0,1.0


In [20]:
def clean_year_data(uncleaned_df):
    
    # deleting unwanted rows and columns
    uncleaned_df.drop(uncleaned_df.columns[0], axis=1, inplace=True)
    uncleaned_df.drop(0, axis=0, inplace=True)
    uncleaned_df.reset_index(drop=True,inplace=True)
    
    # deleting unwanted rows which use to hold the name of the local authority
    for i in range(13,len(uncleaned_df),13):
        if i > 4495:
            break
        uncleaned_df.drop(i, axis=0, inplace=True)
        
    # adding all the different local authorities to one matrix that represent the whole year sum in the UK
    df = uncleaned_df.iloc[1:13,1:].values
    for i in range(13,len(uncleaned_df),12):
        x = uncleaned_df.iloc[i:i+12,1:].values
        df = df + x
        
    # creat a DataFrame for the values    
    cleaned_df = pd.DataFrame(df)
    return cleaned_df

In [21]:
# cleaning the data of the year 2010 using the function clean_year_data and inspecting the results
cleaned_2010 = clean_year_data(uk_2010_df)
cleaned_2010

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,21,22,23,24,25,26,27,28,29,30
0,1630.0,1680.0,1521.0,1597.0,1719.0,1833.0,1687.0,1707.0,1680.0,1596.0,...,1568.0,1549.0,1457.0,1469.0,1575.0,1533.0,1536.0,1523.0,1459.0,1412.0
1,1485.0,1552.0,1461.0,1562.0,1532.0,1436.0,1347.0,1416.0,1415.0,1425.0,...,1454.0,1419.0,1471.0,1503.0,1481.0,1497.0,1335.0,0.0,0.0,0.0
2,1337.0,1403.0,1340.0,1363.0,1358.0,1403.0,1312.0,1438.0,1384.0,1410.0,...,1370.0,1342.0,1362.0,1372.0,1341.0,1398.0,1273.0,1336.0,1361.0,1312.0
3,1313.0,1393.0,1340.0,1287.0,1329.0,1401.0,1312.0,1316.0,1342.0,1337.0,...,1320.0,1382.0,1340.0,1320.0,1355.0,1284.0,1291.0,1328.0,1280.0,0.0
4,1203.0,1269.0,1270.0,1204.0,1348.0,1278.0,1268.0,1243.0,1300.0,1297.0,...,1315.0,1313.0,1385.0,1189.0,1215.0,1190.0,1250.0,1287.0,1238.0,1214.0
5,1296.0,1255.0,1300.0,1342.0,1248.0,1289.0,1188.0,1229.0,1165.0,1209.0,...,1276.0,1317.0,1234.0,1298.0,1317.0,1243.0,1318.0,1373.0,1258.0,0.0
6,1294.0,1190.0,1251.0,1117.0,1203.0,1186.0,1234.0,1206.0,1186.0,1216.0,...,1131.0,1132.0,1197.0,1176.0,1228.0,1167.0,1161.0,1154.0,1192.0,1168.0
7,1137.0,1169.0,1205.0,1173.0,1189.0,1141.0,1179.0,1158.0,1197.0,1229.0,...,1216.0,1190.0,1167.0,1175.0,1177.0,1222.0,1164.0,1205.0,1161.0,1198.0
8,1247.0,1249.0,1243.0,1247.0,1218.0,1172.0,1245.0,1253.0,1228.0,1215.0,...,1364.0,1311.0,1194.0,1131.0,1176.0,1380.0,1356.0,1320.0,1199.0,0.0
9,1274.0,1282.0,1344.0,1271.0,1342.0,1290.0,1251.0,1336.0,1278.0,1274.0,...,1453.0,1391.0,1234.0,1256.0,1352.0,1397.0,1381.0,1383.0,1318.0,1410.0


In [22]:
# load in the UK Data for the year of 2011 and cleaning the data using the function clean_year_data
cleaned_2011 = clean_year_data(pd.read_excel('dailydeathsbylocalauthority20102014.xls', '2011'))
# load in the UK Data for the year of 2012 and cleaning the data using the function clean_year_data
cleaned_2012 = clean_year_data(pd.read_excel('dailydeathsbylocalauthority20102014.xls', '2012'))
# load in the UK Data for the year of 2013 and cleaning the data using the function clean_year_data
cleaned_2013 = clean_year_data(pd.read_excel('dailydeathsbylocalauthority20102014.xls', '2013'))
# load in the UK Data for the year of 2014 and cleaning the data using the function clean_year_data
cleaned_2014 = clean_year_data(pd.read_excel('dailydeathsbylocalauthority20102014.xls', '2014'))

#### Note:
For the time constraint, I would run this experiment that shows that we could use the same month numbering for Gregorian and Hijri although that is not true and this could lead to the wrong results

In [23]:
# add the year columns
new_2010 = cleaned_2010
new_2010['year'] = 2010
new_2010['month'] = 0

In [24]:
months = list(range(1,13))
for month in months:
    cleaned_2010.at[int(month-1), 'year'] = h(2010, month, 1, gr=True).year
    cleaned_2010.at[int(month-1), 'month'] = h(2010, month, 1, gr=True).month

In [25]:
new_2010

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,23,24,25,26,27,28,29,30,year,month
0,1630.0,1680.0,1521.0,1597.0,1719.0,1833.0,1687.0,1707.0,1680.0,1596.0,...,1457.0,1469.0,1575.0,1533.0,1536.0,1523.0,1459.0,1412.0,1431,1
1,1485.0,1552.0,1461.0,1562.0,1532.0,1436.0,1347.0,1416.0,1415.0,1425.0,...,1471.0,1503.0,1481.0,1497.0,1335.0,0.0,0.0,0.0,1431,2
2,1337.0,1403.0,1340.0,1363.0,1358.0,1403.0,1312.0,1438.0,1384.0,1410.0,...,1362.0,1372.0,1341.0,1398.0,1273.0,1336.0,1361.0,1312.0,1431,3
3,1313.0,1393.0,1340.0,1287.0,1329.0,1401.0,1312.0,1316.0,1342.0,1337.0,...,1340.0,1320.0,1355.0,1284.0,1291.0,1328.0,1280.0,0.0,1431,4
4,1203.0,1269.0,1270.0,1204.0,1348.0,1278.0,1268.0,1243.0,1300.0,1297.0,...,1385.0,1189.0,1215.0,1190.0,1250.0,1287.0,1238.0,1214.0,1431,5
5,1296.0,1255.0,1300.0,1342.0,1248.0,1289.0,1188.0,1229.0,1165.0,1209.0,...,1234.0,1298.0,1317.0,1243.0,1318.0,1373.0,1258.0,0.0,1431,6
6,1294.0,1190.0,1251.0,1117.0,1203.0,1186.0,1234.0,1206.0,1186.0,1216.0,...,1197.0,1176.0,1228.0,1167.0,1161.0,1154.0,1192.0,1168.0,1431,7
7,1137.0,1169.0,1205.0,1173.0,1189.0,1141.0,1179.0,1158.0,1197.0,1229.0,...,1167.0,1175.0,1177.0,1222.0,1164.0,1205.0,1161.0,1198.0,1431,8
8,1247.0,1249.0,1243.0,1247.0,1218.0,1172.0,1245.0,1253.0,1228.0,1215.0,...,1194.0,1131.0,1176.0,1380.0,1356.0,1320.0,1199.0,0.0,1431,9
9,1274.0,1282.0,1344.0,1271.0,1342.0,1290.0,1251.0,1336.0,1278.0,1274.0,...,1234.0,1256.0,1352.0,1397.0,1381.0,1383.0,1318.0,1410.0,1431,10


In [26]:
cleaned_2010.drop(['year','month'], axis=1,inplace=True)

In [27]:
## Test the hypothesis for the year of 2010
t_test_result = stats.ttest_ind(cleaned_2010.sum(axis=1)[7], cleaned_2010.sum(axis=1))
if (t_test_result.statistic > 0) and (t_test_result.pvalue/2 < alpha):
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")

Fail to reject the null hypothesis


  **kwargs)
  ret = ret.dtype.type(ret / rcount)
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)


In [28]:
## Test the hypothesis for the year of 2011
t_test_result = stats.ttest_ind(cleaned_2011.sum(axis=1)[7], cleaned_2011.sum(axis=1))
if (t_test_result.statistic > 0) and (t_test_result.pvalue/2 < alpha):
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")

Fail to reject the null hypothesis


In [29]:
## Test the hypothesis for the year of 2012
t_test_result = stats.ttest_ind(cleaned_2012.sum(axis=1)[7], cleaned_2012.sum(axis=1))
if (t_test_result.statistic > 0) and (t_test_result.pvalue/2 < alpha):
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")

Fail to reject the null hypothesis


In [30]:
## Test the hypothesis for the year of 2013
t_test_result = stats.ttest_ind(cleaned_2013.sum(axis=1)[7], cleaned_2013.sum(axis=1))
if (t_test_result.statistic > 0) and (t_test_result.pvalue/2 < alpha):
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")

Fail to reject the null hypothesis


In [31]:
## Test the hypothesis for the year of 2014
t_test_result = stats.ttest_ind(cleaned_2014.sum(axis=1)[7], cleaned_2014.sum(axis=1))
if (t_test_result.statistic > 0) and (t_test_result.pvalue/2 < alpha):
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")

Fail to reject the null hypothesis


### I conclude relying mainly and only on the data from the Madinah municipality that is people tend to die with a number that is Equally likely in each month of the year, and there is no tendency for people die more often in the month of Shaban.