# Programming for Data Analysis Project 2019

![Image](Images/pythonpandas.JPG "Image")


## Problem Statement

Create a dataset by simulating a real-word phenomenon. Rather than collect data associated with the choosen real-world phenomenon the data should be modeled and synthesise the data using Python **numpy.randon** package



## Real World Phenomenon

The real world phenomenon we will be simulating is the Irish Emergency Call Answering Service (ECAS). This is a 24hr 365 day a year call centre that answers emergency calls 112 or 999. 

When 112 or 999 is dialed from any PSTN phone in Ireland the call will be put through to an emergency front line operator in the ECAS call centre. The caller is then given the following options 

1. Garda 
2. Fire 
3. Ambulance
4. Coast Guard 

Depending on the emergency the front line operator puts the call through to the appropiate emergency service, they continue to monitor the call until the emergency service takes responsibility for the call and responding to the emergency.

They are also responsible for taking the location of the emergency call.

From the call statistics published on the department of communications website the ECAS service answered over 1.8 million calls in 2017. This over 200 calls per hour. The average duration of calls were 7.57 secs 

References: 
https://www.dccae.gov.ie/en-ie/news-and-media/press-releases/Pages/Minister-Denis-Naughten-Publishes-New-Figures-on-Ireland%E2%80%99s-Emergency-Call-Answering-Service.aspx

## Variables

To simulate the ECAS call centre we will use Python **numpy.random** package The call centre will have the following variables and is open 24 hours 7 days a week and 365 days of the year.

The first variable is a timeseries variable called **period** this will consist of a time period of one week in hourly intervals. This will give 145 rows in the Dataframe or dataset. 

1. period
2. garda
3. fire
4. ambulance
5. coast_guard
6. total_calls

To simulate the calls to the ECAS call center Numpy random.poisson distribution is used, garda, fire, ambulance and coast_guard calls all use Numpy random.poisson distrbution

The total_calls variable is a sum total of calls for garda, fire, ambulance and coast_guard.

References: 
https://en.wikipedia.org/wiki/Poisson_distribution


## Building the Dataframe

### Importing Python Packages

The first step is to import the following Python packages. Numpy is imported to give the numpy random packages to simulate data for the dataset. Pandas constructs the actual dataframe consisting of rows and columns. Matplotlib.pyplot and seaborn are imported as graphic libraries to generate plots from the simulated data.

In [1]:
#The following Python packages are imported
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Defining the sampling start, end and frequency

Once the packages Python packages are imported the first step is to define 


In [2]:
# The period of a week is choosen starting from the 1st January 2019, 
# the week is divided up in to hourly periods given 145 data points for the week.

period =  pd.date_range('01-01-2017', '01-07-2017', freq='H')

#Reference: 
#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html
#https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html

In [3]:
#period datetime index is printed out, the data type is datetime64 
period

DatetimeIndex(['2017-01-01 00:00:00', '2017-01-01 01:00:00',
               '2017-01-01 02:00:00', '2017-01-01 03:00:00',
               '2017-01-01 04:00:00', '2017-01-01 05:00:00',
               '2017-01-01 06:00:00', '2017-01-01 07:00:00',
               '2017-01-01 08:00:00', '2017-01-01 09:00:00',
               ...
               '2017-01-06 15:00:00', '2017-01-06 16:00:00',
               '2017-01-06 17:00:00', '2017-01-06 18:00:00',
               '2017-01-06 19:00:00', '2017-01-06 20:00:00',
               '2017-01-06 21:00:00', '2017-01-06 22:00:00',
               '2017-01-06 23:00:00', '2017-01-07 00:00:00'],
              dtype='datetime64[ns]', length=145, freq='H')

### Defining ecas_garda dataframe

The ecas_garda pandas dataframe is defined to capture garda emergency calls to the ECAS call centre for the period defined in the previous section, the ecas_garda dataframe consists of the datetime column and a Numpy random.poisson distribution for garda emergency calls, the poisson distrbution's $\lambda$ is also a random integer.

Defining the ecas_garda dataframe consists of two steps.

1. Define the poisson distrbution's $\lambda$ as a random integer
2. Define the calls column as a Numpy **random.poisson** distrbution of the entire time period

References:https://www.calvin.edu/~rpruim/courses/s341/S17/from-class/MathinRmd.html


In [4]:
#Define a random integer for the Poisson distrubution lamda value for calls for a garda emergency. 
#The random integer number is anywhere in the range of 45 to 185 calls per hour, this is a random range choosen.
garda_randomint = np.random.randint(45, 185)

#The random integer is printed out below
garda_randomint

#Reference: https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.random.randint.html

172

Construct the **ecas_garda** dataframe using the **period** variable and label this column datetime. This dataframe will have the same number of rows as the period variable. The next line of code adds the **garda** column on to the new dataframe **ecas_garda** using a Numpy **random.poisson** function in Python. The resulting pandas dataframe has two columns, **datetime** and **garda**.

In [5]:
#ecas_garda pandas Dataframe is constructed first with the period to produce a timeseries column of type timeseries labeling it 'datetime'
#The datetime column is not an index to the dataframe and therefore can be in either the x or y axis of the plots.
#The column garda is added on to the new dataframe using Numpy random.poisson distrbution function.
#THe first 5 rows of the new dataframe is printed out.

ecas_garda = pd.DataFrame(period, columns=['datetime'])
ecas_garda['garda'] = np.random.poisson(garda_randomint, len(period))
ecas_garda.head()

#Reference: 
#https://web.microsoftstream.com/video/db8801fe-9e42-4663-a508-5d6f38bb7327
#https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.random.poisson.html
#https://www.dataquest.io/blog/tutorial-time-series-analysis-with-pandas/

Unnamed: 0,datetime,garda
0,2017-01-01 00:00:00,169
1,2017-01-01 01:00:00,181
2,2017-01-01 02:00:00,163
3,2017-01-01 03:00:00,207
4,2017-01-01 04:00:00,162


In [6]:
#This command verifies the data types in the ecas_garda dataframe. 
#The result is a datetime64 columns and an 32 bit integer column

ecas_garda.dtypes

#References:https://docs.scipy.org/doc/numpy-1.15.0/reference/arrays.datetime.html

datetime    datetime64[ns]
garda                int32
dtype: object

### Defining ecas_fire dataframe

Again the ecas_fire pandas dataframe is defined to capture fire emergency calls to the ECAS call centre for the period defined in the previous section, the ecas_fire dataframe consists of the datetime column and a Numpy random.poisson distribution for emergency calls for the fire service, the poisson distrbution's $\lambda$ is also a random integer.

Defining the ecas_fire dataframe consists of two steps.

1. Define the poisson distrbution's $\lambda$ as a random integer
2. Define the calls column as a Numpy **random.poisson** distrbution of the entire time period.

References:https://www.calvin.edu/~rpruim/courses/s341/S17/from-class/MathinRmd.html

In [7]:
#Define a random integer for the Poisson distribution lamda value for calls for a fire emergency. 
#The random integer number is anywhere in the range of 50 to 160 calls per hour, this is a random range choosen.

fire_randomint = np.random.randint(50,160)

# The random integer is printed out.
fire_randomint

#Reference: https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.random.randint.html

127

Construct the **ecas_fire** dataframe using the **period** variable and label this column datetime. This dataframe will have the same number of rows as the period variable. The next line of code adds the **fire** column on to the new dataframe **ecas_fire** using a Numpy **random.poisson** function in Python. The resulting pandas dataframe has two columns, **datetime** and **fire**.

In [8]:
#ecas_fire pandas Dataframe is constructed first with the period to produce a timeseries column of type timeseries labeling it 'datetime'
#The datetime column is not an index to the dataframe and therefore can be in either the x or y axis of the plots.
#The column fire is added on to the new dataframe using Numpy random.poisson distrbution function.
#The first 5 rows of the new dataframe is printed out.

ecas_fire = pd.DataFrame(period, columns=['datetime'])
ecas_fire['fire'] = np.random.poisson(fire_randomint, len(period))
ecas_fire.head()

#Reference: 
#https://web.microsoftstream.com/video/db8801fe-9e42-4663-a508-5d6f38bb7327
#https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.random.poisson.html
#https://www.dataquest.io/blog/tutorial-time-series-analysis-with-pandas/

Unnamed: 0,datetime,fire
0,2017-01-01 00:00:00,130
1,2017-01-01 01:00:00,127
2,2017-01-01 02:00:00,113
3,2017-01-01 03:00:00,128
4,2017-01-01 04:00:00,128


In [9]:
#This command verifies the data types in the ecas_fire dataframe. 
#The result is a datetime64 columns and an 32 bit integer column

ecas_fire.dtypes

#References:https://docs.scipy.org/doc/numpy-1.15.0/reference/arrays.datetime.html

datetime    datetime64[ns]
fire                 int32
dtype: object

### Defining ecas_ambulance dataframe

The ecas_ambulance pandas dataframe is defined to capture ambulance emergency calls to the ECAS call centre for the period defined in the previous section, the ecas_ambulance dataframe consists of the datetime column and a Numpy random.poisson distribution for the ambulance calls, the poisson distrbution's $\lambda$ is also a random integer.

Defining the ecas_ambulance dataframe consists of two steps.

1. Define the poisson distrbution's $\lambda$ as a random integer
2. Define the calls column as a Numpy **random.poisson** distrbution of the entire time period.

References:https://www.calvin.edu/~rpruim/courses/s341/S17/from-class/MathinRmd.html

In [10]:
#Define a random integer for the Poisson distrubution lamda value for calls for a ambulance emergency call. 
#The random integer number is anywhere in the range of 35 to 270 calls per hour, this is a random range choosen.

ambulance_randomint = np.random.randint(35,270)
ambulance_randomint

#Reference: https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.random.randint.html

132

Construct the **ecas_ambulance** dataframe using the **period** variable and label this column datetime. This dataframe will have the same number of rows as the period variable. The next line of code adds the **ambulance** column on to the new dataframe **ecas_ambulance** using a Numpy **random.poisson** function in Python. The resulting pandas dataframe has two columns, **datetime** and **ambulance**.

In [11]:
#ecas_ambulance pandas Dataframe is constructed first with the period to produce a timeseries column of type timeseries labeling it 'datetime'
#The datetime column is not an index to the dataframe and therefore can be in either the x or y axis of the plots.
#The column fire is added on to the new dataframe using Numpy random.poisson distrbution function.
#The first 5 rows of the new dataframe is printed out.

ecas_ambulance = pd.DataFrame(period, columns=['datetime'])
ecas_ambulance['ambulance'] = np.random.poisson(ambulance_randomint, len(period))
ecas_ambulance.head()

#Reference: 
#https://web.microsoftstream.com/video/db8801fe-9e42-4663-a508-5d6f38bb7327
#https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.random.poisson.html
#https://www.dataquest.io/blog/tutorial-time-series-analysis-with-pandas/

Unnamed: 0,datetime,ambulance
0,2017-01-01 00:00:00,120
1,2017-01-01 01:00:00,137
2,2017-01-01 02:00:00,130
3,2017-01-01 03:00:00,113
4,2017-01-01 04:00:00,136


In [12]:
#This command verifies the data types in the ecas_ambulance dataframe. 
#The result is a datetime64 columns and a 32 bit integer column

ecas_ambulance.dtypes

#References:https://docs.scipy.org/doc/numpy-1.15.0/reference/arrays.datetime.html

datetime     datetime64[ns]
ambulance             int32
dtype: object

### Defining ecas_coast_guard dataframe

The ecas_coast_guard pandas dataframe is defined to capture coast_guard emergency calls to the ECAS call centre for the period defined in the previous section, the ecas_coast_guard dataframe consists of the datetime column and a Numpy random.poisson distribution for emergency calls for the fire service, the poisson distrbution's $\lambda$ is also a random integer.

Defining the ecas_coast_guard dataframe consists of two steps.

1. Define the poisson distrbution's $\lambda$ as a random integer
2. Define the calls column as a Numpy **random.poisson** distrbution of the entire time period.

References:https://www.calvin.edu/~rpruim/courses/s341/S17/from-class/MathinRmd.html

In [13]:
#Define a random integer for the Poisson distribution lamda value for calls for a coast_guard emergency call.
#The random integer number is anywhere in the range of 30 to 265 calls per hour, this is a random range choosen.

coast_guard_randomint = np.random.randint(30, 265)
coast_guard_randomint

#Reference: https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.random.randint.html

212

Construct the **ecas_coast_guard** dataframe using the **period** variable and label this column datetime. This dataframe will have the same number of rows as the period variable. The next line of code adds the **coast_guard** column on to the new dataframe **ecas_coast_guard** using a Numpy **random.poisson** function in Python. The resulting pandas dataframe has two columns, **datetime** and **coast_guard**.

In [14]:
#ecas_coast_guard pandas Dataframe is constructed first with the period to produce a timeseries column of type timeseries labeling it 'datetime'
#The datetime column is not an index to the dataframe and therefore can be in either the x or y axis of the plots.
#The column fire is added on to the new dataframe using Numpy random.poisson distrbution function.
#The first 5 rows of the new dataframe is printed out.

ecas_coast_guard = pd.DataFrame(period, columns=['datetime'])
ecas_coast_guard['coast_guard'] = np.random.poisson(coast_guard_randomint, len(period))
ecas_coast_guard.head()

#Reference: 
#https://web.microsoftstream.com/video/db8801fe-9e42-4663-a508-5d6f38bb7327
#https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.random.poisson.html
#https://www.dataquest.io/blog/tutorial-time-series-analysis-with-pandas/

Unnamed: 0,datetime,coast_guard
0,2017-01-01 00:00:00,186
1,2017-01-01 01:00:00,220
2,2017-01-01 02:00:00,208
3,2017-01-01 03:00:00,215
4,2017-01-01 04:00:00,213


In [15]:
#This command verifies the data types in the ecas_coast_guard dataframe. 
#The result is a datetime64 columns and a 32 bit integer column

ecas_coast_guard.dtypes

#References:https://docs.scipy.org/doc/numpy-1.15.0/reference/arrays.datetime.html

datetime       datetime64[ns]
coast_guard             int32
dtype: object

### Joining the dataframes together

We now have the following four dataframes for calls to each of the respective emergency service

1. ecas_garda
2. ecas_fire
3. ecas_ambulance
4. ecas_coast_guard

Each of the dataframes above have the same datetime column. We will extract the calls column values from each datafram and use them to define a new dataframe **ecas** and add a total_calls column to sum all calls to the ECAS call centre for the period defined. We can also still reference the individual dataframes to analyse each emergency calls category separately.

In [16]:
# Creating an array of values from the Dataframe ecas_fire
# Reference: https://stackoverflow.com/questions/46396257/adding-a-new-column-in-pandas-dataframe-from-another-dataframe-with-differing-in
ecas_fire['fire'].values

array([130, 127, 113, 128, 128, 116, 141, 123, 131, 114, 134, 135, 135,
       123, 155, 130, 122, 119, 117, 144, 128, 128, 110, 120, 127, 124,
       122, 112, 145, 125, 134, 134, 142, 112, 117, 127, 125, 111, 122,
       110, 129, 112, 116, 114, 123, 121, 113, 140, 135, 153, 132, 139,
       122, 108, 119, 136, 126, 128, 126, 142, 122, 114, 124, 110, 141,
       129, 150, 140, 129, 120, 119, 155, 128, 120, 130, 117, 120, 143,
       123, 144, 113, 115, 138, 142, 142, 124, 123, 126, 130, 133, 105,
       124, 136, 129, 128, 119, 141, 127, 120, 131, 137, 130, 113, 113,
       142, 135, 118, 109, 128, 136, 129, 126, 130, 131, 131, 114, 117,
       115, 109, 137, 128, 117, 104, 135, 136, 110, 111, 128, 140, 106,
       130, 128, 114, 142, 134, 144, 118, 115, 123, 124, 114, 136, 123,
       133, 133])

In [17]:
ecas_ambulance['ambulance'].values

array([120, 137, 130, 113, 136, 121, 129, 138, 133, 120, 144, 125, 145,
       157, 140, 133, 144, 130, 130, 140, 141, 114, 133, 129, 133, 121,
       118, 129, 143, 125, 149, 150, 133, 138, 126, 110, 136, 130, 142,
       123, 147, 142, 134, 118, 143, 127, 141, 138, 115, 137, 125, 128,
       131, 124, 125, 136, 113, 131, 148, 161, 135, 134, 131, 133, 118,
       142, 140, 129, 151, 147, 138, 125, 122, 128, 124, 125, 139, 125,
       136, 134, 159, 149, 108, 160, 137, 143, 167, 110, 120, 142, 136,
       125, 119, 106, 145, 125, 127, 128, 112, 110, 138, 125, 156, 125,
       139, 132, 122, 133, 131, 128, 144, 146, 131, 133, 135, 128, 131,
       129, 137, 106, 131, 128, 133, 115, 117, 137, 130, 139, 127, 105,
       129, 133, 118, 128, 120, 131, 123, 130, 140, 127, 145, 143, 119,
       126, 130])

In [18]:
# Adding the fire columns to the ecas Dataframe, the ecas columns will consist of garda and fire
ecas['fire'] = ecas_fire['fire'].values
ecas.head()

NameError: name 'ecas' is not defined

In [None]:
# Creating an array of values from the Dataframe ecas_ambulance
# Reference: https://stackoverflow.com/questions/46396257/adding-a-new-column-in-pandas-dataframe-from-another-dataframe-with-differing-in
ecas['ambulance'] = ecas_ambulance['ambulance'].values
ecas.head()

In [None]:
# Adding the ambulance columns to the ecas Dataframe, the ecas columns will consist of garda, fire and ambulance
ecas['ambulance'] = ecas_ambulance['ambulance'].values
ecas.head()

In [None]:
# Creating an array of values from the Dataframe ecas_coast_guard
# Reference: https://stackoverflow.com/questions/46396257/adding-a-new-column-in-pandas-dataframe-from-another-dataframe-with-differing-in
ecas_coast_guard['coast_guard'].values

In [None]:
# Adding the coast_guard column to the ecas Dataframe, the ecas columns will consist of garda, fire, ambulance and coast_guard
ecas['coast_guard'] = ecas_coast_guard['coast_guard'].values
ecas.head()

In [None]:
# Sum columns garda, fire, ambulance, coast_guard to give a new column total_calls
# Reference: https://stackoverflow.com/questions/34023918/make-new-column-in-panda-dataframe-by-adding-values-from-other-columns/34023971
ecas['total_calls'] = ecas['garda'] + ecas['fire'] + ecas['ambulance'] + ecas['coast_guard']

ecas

In [None]:
ecas['total_calls'].sum()

In [None]:
ecas.dtypes

In [None]:
ecas.describe()

In [None]:
#sns.distplot(ecas['total_calls'], hist=True, kde=False, label = 'Total Calls', color='yellow')
plt.rcParams['figure.figsize'] = (10, 10)
sns.distplot(ecas['garda'], hist=True, kde=False, label = 'garda' , color='blue')
sns.distplot(ecas['fire'], hist=True, kde=False, label = 'fire' , color='red')
sns.distplot(ecas['ambulance'], hist=True, kde=False, label = 'ambulance' , color='pink')
sns.distplot(ecas['coast_guard'], hist=True, kde=False, label = 'coast_guard' , color='green')
plt.legend()

In [None]:
plt.rcParams['figure.figsize'] = (10, 10)
sns.lineplot(x="datetime", y="garda", data=ecas)

In [None]:
plt.rcParams['figure.figsize'] = (10, 10)
sns.lineplot(x="datetime", y="total_calls", data=ecas)


## Average Call Duration

The average ECAS call duration is 7.57 in 2017 according to the Department of environment website below.

In this section we will build this in to out model to simulate how many agents need to be available so that no call is queued. 

There will be two more variables added to the ecas dataframe

1. total_avg_talk_time
2. agents_required

Reference: 
https://www.dccae.gov.ie/en-ie/news-and-media/press-releases/Pages/Minister-Denis-Naughten-Publishes-New-Figures-on-Ireland%E2%80%99s-Emergency-Call-Answering-Service.aspx

### Add talk_time and agent_required

The next section of code will add in talk_time and agents_required to the ecas dataframe.

In [None]:
ecas['total_avg_talk_time'] = ecas['total_calls'] * 8
ecas.head()

In [None]:
ecas_jan = ecas.loc['2018-01-01':'2018-01-31']
ecas_jan.head()

In [None]:
ecas_jan['total_calls'].sum()

In [None]:
ecas_jan.head()

In [None]:
ecas[ecas.index.month == 1]['total_calls'].sum()

In [None]:
ecas[ecas.index == '2018-01-01 00:00:00']['total_calls'].values

### Poisson Probability

The poisson probability of a call arriving in a particular interval can be calculated as follows:


In [None]:

from scipy.special import factorial

time_p = '2018-01-01 00:00:00'

calls_per_min = ecas[ecas.index == time_p]['total_calls'].values/60

minutes = 5

# Expected events
#lam = calls_per_min[0] * minutes

#k = 2
#pk = np.exp(-lam) * np.power(lam, k) / factorial(k)
#print(f'The probability of {k} calls in {minutes} minutes is {100*pk:.2f}% for the time period {time_p}')

#Reference: https://github.com/WillKoehrsen/Data-Analysis/blob/master/poisson/poisson.ipynb