# Project 2019 - Programming for Data Analysis

# Simulation Dataset - Installed Base Dataset

## Introduction
The following assignment concerns the numpy.random package in Python 3. I have created a Jupyter notebook explaining the use of the package, including detailed explanations of six of the distributions provided for in the package.

## Problem statement

The objective of this project is to create a data set by simulating a real-world phenomenon, of my choosing.

* Instead of collecting data, I model and synthesise the data using Python packages, such as numpy.random.

Specifically, in this project you should:

* I simulate two hundred data points across four different variables.
* I investigate the types of variables involved, their likely distributions, and their relationships with each other.
* I simulate a data set as closely matching their properties as possible.
* I detail my research and implement the simulation in a Jupyter notebook 
* The final dataset itself is displayed in an output cell within the notebook.



## About My Dataset
The objective of my dataset is to simulate typical data about a companys installed base. According to Kurvinen (2017), the typical data on install base includes a listing of products which are installed at a given customer site. Futhermore, it can include additional variables such as serial numbers, hardware and software revisions, warranty and service contracts. 

Installed base data can be used by many departments in an organisation, from field service engineers, sales, spare parts planners, quality etc . The dataset helps to answer questions such as:

* What is the current configuration of the product to be serviced?
* Where is the faulty product physically located and where is the part to be replaced located?
* Is the unit covered under warranty or service contract?
* When was the unit installed, upgraded and/or last serviced?

![Install Base Image](https://cdn.myonlinestore.eu/945f2dab-6be1-11e9-a722-44a8421b9960/images/World%20map%20installed%20base.png)

## Basic Description of the Dataset

The dataset contains the following data

1. Part Number: 5 Digit identifier for each product installed. 
2. Serial Numbers: Unique ID's for each unit sold. There are 200 serial numbers.
3. Installation Date: Assume all units sold require a start-up / installation by a field service engineer, who records this date on the company CRM system (Saleforce.com). The dataset was created in January 2017 and spans 36 months.
4. Factory Warranty Expiration Date: Factory Warranty usually expires 1 year after installation.
5. Extended Warranty: Number of years of extended warranty cover purchased.
6. 

## Variables Types
 
 1. Python Objects  
 Serial numbers are usually unique alphanumeric strings, usually of a fixed length, so I decided to use UUID function in python to generate random unique ID's which have similar properties to serial numbers. The UUID module provides "immutable UUID objects (the UUID class) and the functions uuid1(), uuid3(), uuid4(), uuid5() for generating version 1, 3, 4, and 5 UUIDs as specified in RFC 4122" (The Python Foundation, 2019). The dtype used to store these objects, is Python Object.   
 
 2. datetime64
 Datetime64 is a NumPy data types which support datetime functionality (The Scipy Community, 2017). This datatype will be used to store datetime variables 'install date' & 'factory warranty'.
 
 3. Int32
 This datatype is a 32bit integer, which will be used for variable 'extended warranty'.
 
 

## 1. Defining the main parameters

I begin by importing all the necessary dependencies and defining the main parameters mentioned above (200 serial number, 36 months, starting month in January 2017).

In [None]:
# importing all the libraries
import pandas as pd
import numpy as np
import uuid
from datetime import datetime
from dateutil.relativedelta import relativedelta
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter
import matplotlib.dates as mdates

In [None]:
# Define the main parameters dataset. 

# number of serial numbers which are contained in the install base dataset
num_serial_num = 200

# number of months since first unit was installed
num_months = 36

# starting month when units first installed
start_month = '2017-01-01'

## 2. Simulate List of Part Numbers
Part numbers can be classified as categorical data. 

Typically, any data attribute which is categorical in nature represents discrete values which belong to a specific finite set of categories or classes. These are also often known as classes or labels in the context of attributes or variables which are to be predicted by a model (popularly known as response variables). These discrete values can be text or numeric in nature (or even unstructured data like images!). There are two major classes of categorical data, nominal and ordinal.
In any nominal categorical data attribute, there is no concept of ordering amongst the values of that attribute. Consider a simple example of weather categories, as depicted in the following figure. We can see that we have six major classes or categories in this particular scenario without any concept or notion of order (windy doesn’t always occur before sunny nor is it smaller or bigger than sunny). Sarkar, 2018

In [None]:
# Use NumPy Random Randint Function to generate 200 random part numbers

# 5 Digit Part Numbers with ID # in range of 10000 to 99999
low, high, size = (10000 , 99999 , 200) # Define parameters for numpy.random.randint() function

np.random.seed(2) # use random seed generator

# Create a Pandas Dataframe - # solution adapted from https://stackoverflow.com/a/23671779
# Create 2D array of 200 sets of 5 numbers
parts = pd.DataFrame(np.random.randint(low, high, size), columns=['parts'])
# Display first 5 rows
parts.head()

## 3. Create Serial Numbers
One last option for generating a random token is the uuid4() function from Python’s uuid module. A UUID is a Universally Unique IDentifier, a 128-bit sequence (str of length 32) designed to “guarantee uniqueness across space and time.”

In [None]:
# generating 200 serial numbers using UUID function
# Adapted from https://towardsdatascience.com/generating-product-usage-data-from-scratch-with-pandas-319487590c6d

# create an 1D array called serial, using uuid4 function to generate 200 random UUID. 
serial = pd.Series([str(uuid.uuid4()) for i in range(0,num_serial_num)])
# create 2D dataframe called installbase and insert array serial
installbase = pd.DataFrame()
# insert array serial into installbase dataframe
installbase['serial number'] = pd.Series(serial)
# display dataframe (scrolling)
installbase


In [None]:
# Adding part number column to the dataframe using pandas concat function
# Adapted from https://towardsdatascience.com/generating-product-usage-data-from-scratch-with-pandas-319487590c6d

# Combine 2 pandas series in the dataframe, resetting the index without inserting it as a column in the new DataFrame.
installbase['part number'] = pd.concat([parts], axis=0).reset_index().drop('index', 1)
installbase

The output is a dataframe 200 rows and two columns, serial number and part number. 

## 4. Generating Installation Date

I use the pandas.to_datetime function to output a range of datetime values which simulate a range of installation dates. Then I select a random sample of these dates to populate the dataset using np.random.choice function.

In [None]:
# Create Date Range using the specified parameters (starting month, number of months).
# Adapted from https://towardsdatascience.com/generating-product-usage-data-from-scratch-with-pandas-319487590c6d

# reseting the index without inserting it as a column in the new DataFrame - https://www.geeksforgeeks.org/python-pandas-series-reset_index/
installbase = installbase.reset_index().drop('index', 1)

# defining range of installation dates: starting month 
start_month_ts = pd.to_datetime(start_month)
# define end month as start month plus 36 months, using relativedata utility https://dateutil.readthedocs.io/en/stable/relativedelta.html
end_month_ts = start_month_ts + relativedelta(months=+num_months - 1)

# making a Series out of the starting and ending month
months = pd.Series(pd.date_range(start_month_ts, end_month_ts, freq='MS'))
# Display start and end month
str(start_month_ts), str(end_month_ts)

This argument uses the specified parameters (starting month, number of months) to generate a range of dates from January 2017 to December 2019.

In [None]:
# Select random installation dates from range of dates created above

# use random seed generator
np.random.seed(2) # use random seed generator
# create a series called installdate whoch contain 200 dates chosen at random from  the daterange "months" using random.choice() function 
installdate = pd.Series(np.random.choice(months, size=num_serial_num))
# display first 5 rows
installdate.head()

In [None]:
#  Adding installdate column to the dataframe

# Combine installdate series into the installbase dataframe, resetting the index without inserting it as a column in the new DataFrame.
installbase['install date'] = pd.concat([installdate] * num_serial_num, axis=0).reset_index().drop('index', 1)
#diplay the dataframe
installbase

In [None]:
df = installbase.groupby(['install date']).count() # Adapted from McKinney 2019
# Resize Plot to creates a figure with 15 (width) x 10 (height) inches - Adapted from https://stackoverflow.com/a/36368418
plt.figure(figsize=(15,10))
plt.title('Distribution of Install Date Samples') # Plot Title
plt.xlabel('Install Date') # Label x Axis 
plt.ylabel('Frequency') # Label y Axis 
plt.plot(df['serial number'])
plt.xticks(rotation='vertical')

In [None]:
factorywarrantyexpiry = (installdate + np.timedelta64(1, 'Y'))
factorywarrantyexpiry

In [None]:
installbase['factory warranty'] = pd.concat([factorywarrantyexpiry], axis=0).reset_index().drop('index', 1)
installbase.head()

In [None]:
# Customer have a choice of not taking out extended warranty or purchasing an additional 1, 2 or 3 year extended warranty
num_warranty = 3  

# assign extended warranty to users randomly (when did the user first use the product?)
warranty = pd.DataFrame()
warranty['warranty extension'] = np.random.randint(low=0, high=num_warranty, size=num_serial_num)
warranty.sample(5)



In [None]:
installbase['extended warranty'] = pd.concat([warranty], axis=0).reset_index().drop('index', 1)
installbase

## 4. Generating categorical features

In [None]:
# Defining the variables
platforms = ['iOS', 'Android']
countries = ['IE', 'GB', 'NL', 'FR', 'DE', 'BE', 'DK']
service_contract = [False, True]

4.1. Generating categorical feature weights
Defining weights for the likelihood of a categorical feature associated with an individual unit.

In [None]:
# Define the parameters; shape and sample size
a = 3. # Shape
m = 2. # Mode
size = 1000 # sample size 

# create a dataframe called n, using parameters 
n=np.random.pareto(a, size)*m

#randomInts = np.random.normal(loc=10, scale=3, size=10000).astype(int)-10
n

## References
1. Kurvinen, M (2017) *INSTALLED BASE AND TRACEABILITY* [Online] Available at: http://sd-ize.com/installed-base.html[Accessed 1 December 2019].
2. Python Software Foundation (2019) *UUID objects according to RFC 4122* [Online] Available at https://docs.python.org/2/library/uuid.html [Accessed 1 December 2019]
3. Osolnik, J (2017) *Simulating product usage data with Pandas* [Online] Available at https://towardsdatascience.com/generating-product-usage-data-from-scratch-with-pandas-319487590c6d [Accessed 1 December 2019]
4. The Scipy Community (2017) *Datetimes and Timedeltas* [Online] Available at https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.datetime.html [Accessed 3 December 2019]
5. Sarkar, D (2018) *Categorical Data* [Online] Available at: https://towardsdatascience.com/understanding-feature-engineering-part-2-categorical-data-f54324193e63 [Accessed 21 November 2019].
6. Solomon, B (2018) *Generating Random Data in Python (Guide)* [Online] Available at: https://realpython.com/python-random/ [Accessed 7 December 2019]

