# Project 2019 - Programming for Data Analysis

# Simulation Dataset - Installed Base Dataset

## Introduction
The following assignment concerns the numpy.random package in Python 3. I have created a Jupyter notebook explaining the use of the package, including detailed explanations of six of the distributions provided for in the package.

## Problem statement

The objective of this project is to create a data set by simulating a real-world phenomenon, of my choosing.

* Instead of collecting data, I model and synthesise the data using Python packages, such as numpy.random.

Specifically, in this project you should:

* I simulate two hundred data points across four different variables.
* I investigate the types of variables involved, their likely distributions, and their relationships with each other.
* I simulate a data set as closely matching their properties as possible.
* I detail my research and implement the simulation in a Jupyter notebook 
* The final dataset itself is displayed in an output cell within the notebook.



## About My Dataset
The objective of my dataset is to simulate typical data about a companys installed base. According to Kurvinen (2017), the typical data on install base includes a listing of products which are installed at a given customer site. Futhermore, it can include additional variables such as serial numbers, hardware and software revisions, warranty and service contracts. 

Installed base data can be used by many departments in an organisation, from field service engineers, sales, spare parts planners, quality etc . The dataset helps to answer questions such as:

* What is the current configuration of the product to be serviced?
* Where is the faulty product physically located and where is the part to be replaced located?
* Is the unit covered under warranty or service contract?
* When was the unit installed, upgraded and/or last serviced?

![Install Base Image](https://cdn.myonlinestore.eu/945f2dab-6be1-11e9-a722-44a8421b9960/images/World%20map%20installed%20base.png)

 ## Variables Types
 
 1. Python Objects  
 Serial numbers are usually unique alphanumeric strings, usually of a fixed length, so I decided to use UUID function in python to generate random unique ID's which have similar properties to serial numbers. The UUID module provides "immutable UUID objects (the UUID class) and the functions uuid1(), uuid3(), uuid4(), uuid5() for generating version 1, 3, 4, and 5 UUIDs as specified in RFC 4122" (The Python Foundation, 2019). The dtype used to store these objects, is Python Object.   
 
 2. datetime64
 Datetime64 is a NumPy data types which support datetime functionality (The Scipy Community, 2017). This datatype will be used to store datetime variables 'install date' & 'factory warranty'.
 
 3. Int32
 This datatype is a 32bit integer, which will be used for variable 'extended warranty'.
 
 

## 1. Defining the main parameters
We start by importing all the necessary dependencies and defining the main parameters mentioned above (200 serial number, 36 months, starting month in January 2017).

In [32]:
# importing all the libraries
import pandas as pd
import numpy as np
import uuid
from datetime import datetime
from dateutil.relativedelta import relativedelta

In [33]:
# We then set the main parameters of our final dataset. The number of units we’d like to generate data for, maximum number of months warranty per unit and the start of the data collection.
# number of serial numbers 
num_serial_num = 200

# number of months since first unit was installed
num_months = 36

# starting month when units first installed
start_month = '2017-01-01'

In [34]:
# generating unique identifiers for each device

# generating 200 serial numbers
serial = pd.Series([str(uuid.uuid4()) for i in range(0,num_serial_num)])
installbase = pd.DataFrame()
installbase['serial'] = pd.Series(serial)
installbase


Unnamed: 0,serial
0,b2877847-fda8-450d-8ce1-d7ce2360b58b
1,f9263644-fbb2-4a88-9c5d-b8a2b2c37e88
2,b6f7b07b-cc38-4c9d-988a-4c840045bf43
3,47a6083c-d277-4131-a453-d97884903597
4,ec537cfc-92d5-416d-a392-ad7b1a505bba
5,e48ef7df-8f4b-477d-a731-4ed49e5bbbda
6,bc650ce2-73ef-4eca-845e-9c7dcb13d29e
7,b5bdc4c2-7068-4358-812b-08a79a0e5b9a
8,1bab0c99-5221-4285-b4d2-7c5e77dc8c8e
9,62e19f7e-70e1-4e1f-a36f-67730f20587b


## Generating datetime

In [35]:
# reseting the index
installbase = installbase.reset_index().drop('index', 1)

# defining starting month and ending month
start_month_ts = pd.to_datetime(start_month)
end_month_ts = start_month_ts + relativedelta(months=+num_months - 1)

# making a Series out of the starting and ending month
months = pd.Series(pd.date_range(start_month_ts, end_month_ts, freq='MS'))

str(start_month_ts), str(end_month_ts)

('2017-01-01 00:00:00', '2019-12-01 00:00:00')

We can see that with the specified parameters (starting month, number of months) we will generate data from January 2017 to December 2019.

In [36]:
np.random.seed(2) # use random seed generator
installdate = pd.Series(np.random.choice(months, size=num_serial_num))
installdate.head()

0   2018-04-01
1   2017-09-01
2   2018-11-01
3   2018-07-01
4   2017-12-01
dtype: datetime64[ns]

In [37]:
#  Adding install date column to the dataframe
installbase['install date'] = pd.concat([installdate] * num_serial_num, axis=0).reset_index().drop('index', 1)
installbase

Unnamed: 0,serial,install date
0,b2877847-fda8-450d-8ce1-d7ce2360b58b,2018-04-01
1,f9263644-fbb2-4a88-9c5d-b8a2b2c37e88,2017-09-01
2,b6f7b07b-cc38-4c9d-988a-4c840045bf43,2018-11-01
3,47a6083c-d277-4131-a453-d97884903597,2018-07-01
4,ec537cfc-92d5-416d-a392-ad7b1a505bba,2017-12-01
5,e48ef7df-8f4b-477d-a731-4ed49e5bbbda,2017-08-01
6,bc650ce2-73ef-4eca-845e-9c7dcb13d29e,2019-11-01
7,b5bdc4c2-7068-4358-812b-08a79a0e5b9a,2019-08-01
8,1bab0c99-5221-4285-b4d2-7c5e77dc8c8e,2017-12-01
9,62e19f7e-70e1-4e1f-a36f-67730f20587b,2018-10-01


In [38]:
factorywarrantyexpiry = (installdate + np.timedelta64(1, 'Y'))
factorywarrantyexpiry

0     2019-04-01 05:49:12
1     2018-09-01 05:49:12
2     2019-11-01 05:49:12
3     2019-07-01 05:49:12
4     2018-12-01 05:49:12
5     2018-08-01 05:49:12
6     2020-10-31 05:49:12
7     2020-07-31 05:49:12
8     2018-12-01 05:49:12
9     2019-10-01 05:49:12
10    2020-07-31 05:49:12
11    2020-02-29 05:49:12
12    2019-09-01 05:49:12
13    2018-04-01 05:49:12
14    2018-05-01 05:49:12
15    2020-09-30 05:49:12
16    2018-04-01 05:49:12
17    2018-06-01 05:49:12
18    2020-01-01 05:49:12
19    2018-05-01 05:49:12
20    2018-07-01 05:49:12
21    2020-07-31 05:49:12
22    2019-08-01 05:49:12
23    2020-07-31 05:49:12
24    2018-03-01 05:49:12
25    2019-05-01 05:49:12
26    2019-01-01 05:49:12
27    2018-05-01 05:49:12
28    2020-02-29 05:49:12
29    2019-04-01 05:49:12
              ...        
170   2019-10-01 05:49:12
171   2019-08-01 05:49:12
172   2018-05-01 05:49:12
173   2020-10-31 05:49:12
174   2019-01-01 05:49:12
175   2019-04-01 05:49:12
176   2019-10-01 05:49:12
177   2018-0

In [39]:
installbase['factory warranty'] = pd.concat([factorywarrantyexpiry], axis=0).reset_index().drop('index', 1)
installbase.head()

Unnamed: 0,serial,install date,factory warranty
0,b2877847-fda8-450d-8ce1-d7ce2360b58b,2018-04-01,2019-04-01 05:49:12
1,f9263644-fbb2-4a88-9c5d-b8a2b2c37e88,2017-09-01,2018-09-01 05:49:12
2,b6f7b07b-cc38-4c9d-988a-4c840045bf43,2018-11-01,2019-11-01 05:49:12
3,47a6083c-d277-4131-a453-d97884903597,2018-07-01,2019-07-01 05:49:12
4,ec537cfc-92d5-416d-a392-ad7b1a505bba,2017-12-01,2018-12-01 05:49:12


In [40]:
# Customer have a choice of not taking out extended warranty or purchasing an additional 1, 2 or 3 year extended warranty
num_warranty = 3  

# assign extended warranty to users randomly (when did the user first use the product?)
warranty = pd.DataFrame()
warranty['warranty extension'] = np.random.randint(low=0, high=num_warranty, size=num_serial_num)
warranty.sample(5)



Unnamed: 0,warranty extension
30,1
54,2
71,1
162,2
119,0


In [46]:
installbase['extended warranty'] = pd.concat([warranty], axis=0).reset_index().drop('index', 1)
installbase.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
serial               200 non-null object
install date         200 non-null datetime64[ns]
factory warranty     200 non-null datetime64[ns]
extended warranty    200 non-null int32
dtypes: datetime64[ns](2), int32(1), object(1)
memory usage: 5.5+ KB


## 4. Generating categorical features

In [42]:
# Defining the variables
platforms = ['iOS', 'Android']
countries = ['IE', 'GB', 'NL', 'FR', 'DE', 'BE', 'DK']
service_contract = [False, True]

4.1. Generating categorical feature weights
Defining weights for the likelihood of a categorical feature associated with an individual unit.

## References
1. Kurvinen, M (2017) *INSTALLED BASE AND TRACEABILITY* [Online] Available at: http://sd-ize.com/installed-base.html[Accessed 1 December 2019].
2. Python Software Foundation (2019) *UUID objects according to RFC 4122* [Online] Available at https://docs.python.org/2/library/uuid.html [Accessed 1 December 2019]
3. The Scipy Community (2017) *Datetimes and Timedeltas* [Online] Available at https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.datetime.html [Accessed 3 December 2019]
