# Deployment Risk Profile Prediction: 

# Problem: Predict the risk profile of a deployment

## Introduction to business scenario

Change management teams, processes and forums struggle to keep up in large organisations.

This solution will provide a prediction on the risk profile which would negate the need for committees to review changes from every team across the whole organisation.

If the risk profile comes up as LOW, teams should be allowed to proceed by default. If the risk profile comes up as MEDIM or HIGH, the standard committes apply.

But the insights that cause the risks to be MEDIUM or HIGH, should constantly be identified and discussed with relevant stakeholders to increase velocity in an organisation.



### Features

**Data columns**




- `incidents`: capturing data on past incidents per business unit, department and system
- `documentation`: capturing how many times developers read strategy, architecture and design documents
- `Architecture catalog`:  is the feature change being made on a strategic asset vs a legacy asset
- `work management tool (e.g from Jira)`:  there is a lot that can be captured there. But ideally: the date a feature is created, the date it is completed
- `source control`:  capturing commits per feature
- `CI`:  capturing build data for features
- `CD`:  capturing deployment data for features per environment
- `Testing data`:  capturing test execution data and test result data for features
- `ITSM data (e.g from ServiceNow)`:  capture change management data
- `Logging`:  capturing date when a feature is actually consumed by a real Customer
- `Monitoring`:  capturing stability data of a system after a feature is deployed

**Data format**
- Tab `\t` separated text file, without quote or escape characters
- First line in each file is header; 1 line corresponds to 1 record

### Data standard

Not yet defined


# Step 1: Data collection



### Setup

Now that we have decided where to focus our energy, let's set things up so you can start working on solving the problem.

**Note:** This notebook was created and tested on an `ml.m4.xlarge` notebook instance. 

Start by specifying:
- The Amazon Simple Storage Service (Amazon S3) bucket and prefix(?) that you want to use for training and model data. This should be within the same Region as the Notebook Instance, training, and hosting.
- The AWS Identity and Access Management (IAM) role [Amazon Resource Name (ARN)](https://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html) used to give training and hosting access to your data. See the documentation for how to create these.

**Note:** If more than one role is required for notebook instances, training, and/or hosting, replace the `get_execution_role()` call with the appropriate full IAM role ARN string(s).

Replace **`<DataBucketName>`** with the resource name that was provided with your lab account.

In [1]:
# Change the bucket and prefix according to your information
bucket = '<DataBucketName>'

In [2]:
%%capture

%pip install --upgrade boto3 -q
%pip install mxnet -q

In [3]:
import os, subprocess
import warnings
import pandas as pd
import numpy as np
import sagemaker
from sagemaker.mxnet import MXNet
import boto3
import json
import matplotlib.pyplot as plt
import seaborn as sns

role = sagemaker.get_execution_role()
prefix = 'sagemaker-fm' 

# Add this to display all the outputs in the cell and not just the last one
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Ignore warnings
warnings.filterwarnings("ignore")

# Step 2: Data preprocessing and visualization  
In this data preprocessing phase, you should take the opportunity to explore and visualize your data to better understand it. First, import the necessary libraries and read the data into a Pandas dataframe. After that, explore your data. Look for the shape of the dataset and explore your columns and the types of columns you're working with (numerical, categorical). Consider performing basic statistics on the features to get a sense of feature means and ranges. Take a close look at your target column and determine its distribution.

### Specific questions to consider
1. What can you deduce from the basic statistics you ran on the features? 

2. What can you deduce from the distributions of the target classes?

3. Is there anything else you deduced from exploring the data?

#### <span style="color: blue;">Project presentation: Include a summary of your answers to these and other similar questions in your project presentations.</span>

Start by bringing in the dataset from an Amazon S3 public bucket to this notebook environment.

In [4]:
# Check whether the file is already in the desired path or if it needs to be downloaded

base_path = '/home/ec2-user/SageMaker/project/data/DeploymentRiskProfile'
file_path = '/Company_Deployment_Data.tsv.gz'

if not os.path.isfile(base_path + file_path):
    subprocess.run(['mkdir', '-p', base_path])
    subprocess.run(['aws', 's3', 'cp', 's3://DeploymentRiskProfile/tsv' + file_path, base_path])
else:
    print('File already downloaded!')

CompletedProcess(args=['mkdir', '-p', '/home/ec2-user/SageMaker/project/data/AmazonReviews'], returncode=0)

download: s3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Video_Download_v1_00.tsv.gz to ../project/data/AmazonReviews/amazon_reviews_us_Digital_Video_Download_v1_00.tsv.gz


CompletedProcess(args=['aws', 's3', 'cp', 's3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Video_Download_v1_00.tsv.gz', '/home/ec2-user/SageMaker/project/data/AmazonReviews'], returncode=0)

### Reading the dataset

Read the data into a Pandas dataframe so that you can know what you are dealing with.

**Note:** You'll set `error_bad_lines=False` when reading the file in, because there appear to be a very small number of records that would create a problem otherwise.

**Hint:** You can use the built-in Python `read_csv` function ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)). You can use the file path directly with Pandas `read_csv` with `delimiter='\t'`.

For example: `pd.read_csv('filename.tar.gz', delimiter = '\t', error_bad_lines=False)`

In [5]:
df = pd.read_csv(base_path + file_path, 
                 delimiter='\t',
                 on_bad_lines="skip")

Print the first few rows of your dataset.

**Hint**: Use the `pandas.head(<number>)` function to print the rows.

In [6]:
df.head(3)

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,12190288,R3FU16928EP5TC,B00AYB1482,668895143,Enlightened: Season 1,Digital_Video_Download,5,0,0,N,Y,I loved it and I wish there was a season 3,I loved it and I wish there was a season 3... ...,2015-08-31
1,US,30549954,R1IZHHS1MH3AQ4,B00KQD28OM,246219280,Vicious,Digital_Video_Download,5,0,0,N,Y,As always it seems that the best shows come fr...,As always it seems that the best shows come fr...,2015-08-31
2,US,52895410,R52R85WC6TIAH,B01489L5LQ,534732318,After Words,Digital_Video_Download,4,17,18,N,Y,Charming movie,"This movie isn't perfect, but it gets a lot of...",2015-08-31


Now what is the information contained in all the columns?

### Anatomy of the dataset

Get a little more comfortable with the data and see what features are at hand.

** Data sciencetist input required here.

### Analyzing and processing the dataset

#### Exploring the data

**Question:** How many rows and columns do you have in the dataset?

Check the size of the dataset.  

**Hint**: Use the `<dataframe>.shape` function to check the size of your dataframe

In [7]:
df.shape

(3998345, 15)

Answer: (3998345,15)

**Question:** Which columns contain null values, and how many null values do they contain?

Print a summary of the dataset.

**Hint**: Use `<dataframe>.info` function using the keyword arguments `null_counts = True`

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3998345 entries, 0 to 3998344
Data columns (total 15 columns):
 #   Column             Dtype 
---  ------             ----- 
 0   marketplace        object
 1   customer_id        int64 
 2   review_id          object
 3   product_id         object
 4   product_parent     int64 
 5   product_title      object
 6   product_category   object
 7   star_rating        int64 
 8   helpful_votes      int64 
 9   total_votes        int64 
 10  vine               object
 11  verified_purchase  object
 12  review_headline    object
 13  review_body        object
 14  review_date        object
dtypes: int64(5), object(10)
memory usage: 457.6+ MB


**Answer:** Review headline: 25, Review_body: 78, Review_date: 138

**Question:** Are there any duplicate rows? If yes, how many are there?

**Hint**: Filter the dataframe using `dataframe.duplicated()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html#pandas.DataFrame.duplicated)) and check the length of the new dataframe.

In [9]:
duplicates = df[df.duplicated()]

len(duplicates)

0

**Answer:** There are no duplicated rows.

In [None]:
### Data preprocessing

Now it's time to decide what features you are going to use and how you are going to prepare them for your model. For this example, limit yourself to `system_id`, `incident_no`, `asset_type`, `test_run_id`, `test_result`, and `feature_id`. Including additional features in the recommendation system could be beneficial but would require substantial processing (particularly the text data), which would be beyond the scope of this notebook.

Reduce this dataset and only use the columns mentioned.

**Hint:** Select multiple columns as a dataframe by passing the columns as a list. For example: `df[['column_name 1', 'column_name 2']]`

In [10]:
df_reduced = df[['system_id', 'incident_no', 'asset_type', 'test_run_id', 'test_result']]

Check if you have duplicates after reducing the dataset. 

In [11]:
duplicates = df_reduced[df_reduced.duplicated()]

len(duplicates)

131

**Answer**: 131

**Question:** Why do you have duplicates in your dataset now? What changed after you reduced the dataset? Review the first 20 lines of the duplicates.

**Hint**: Use the `pandas.head(<number>)` function to print the rows.

In [12]:
duplicates.head(20)

Unnamed: 0,customer_id,product_id,star_rating,product_title
565194,41454255,B00Y2UYRFS,1,unseen 2
594322,17570065,B00R3EEO2G,2,The Maze Runner
611264,15703996,B00I3MQNWG,5,Bosch Season 1
612471,28456429,B008Y6W7J4,5,Rabbit Hole
613791,52388381,B00YORA25I,5,"McFarland, USA (Theatrical)"
685156,31828958,B00TT53YSW,5,"Bates Motel, Season 3"
1204110,19462,B00QWUL4AW,5,Exodus: Gods and Kings
1564816,24892653,B00L2GPYKW,5,The Escape Artist Season 1
1601662,44513234,B00NY4UIKG,5,The Equalizer
1616429,43345475,B00P5968FC,3,The Babadook


**Hint:** Take a look at the first two elements in the duplicates dataframe, and query the original dataframe df to see what the data looks like. You can use the `query` function ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html)).

For example:

```
df_eg = pd.DataFrame({
            'A': [1,2,3,4],
            'B': [
        })
df_eg.query('A > 1 & B > 0')
```

Before continuing, remove the duplicate rows.

**Hint**: Use the `~` operator to select all the rows that aren't duplicated. For example:
    
```
df_eg = pd.DataFrame({
            'A': [1,2,3,4],
            'B': [2,0,5,2]
        })
df_eg[~(df_eg['B'] > 0)]
```

In [16]:
df_reduced = df_reduced[~df_reduced.duplicated()]

In [None]:
### Visualize some of the rows in the dataset
If you haven't done so in the above, you can use the space below to further visualize some of your data. Look specifically at the distribution of features like `test_result`, `incident_id`, and `asset_type`.

**Specific questions to consider**

1. After looking at the distributions of features, to what extent might those features help your model? Is there anything you can deduce from those distributions that might be helpful in better understanding your data? 

2. Should you use all the data? What features should you use?

3. What month has the highest count of user ratings?

Use the cells below to visualize your data and answer these and other questions that might be of interest to you. Insert and delete cells where needed.

#### <span style="color: blue;">Project presentation: Include a summary of your answers to these and similar questions in your project presentations.</span>

Use `sns.barplot` ([documentation](https://seaborn.pydata.org/generated/seaborn.barplot.html)) to plot the `star_rating` density and distribution.

In [None]:
**Question:** What month contains the highest count of incidents?

**Hint**:  
1. Use `pd.to_datetime` to convert the `review_date` column to a datetime column.  
2. Use the month from the `review_date` column. You can access it for a datetime column using `<column_name>.dt.month`.
3. Use the `groupby` function using `idxmax`.


### Cleaning data