# Programming For Data Analytics Project


### PROBLEM STATEMENT

Fictitious Pharmaceutical Company ‘Ollopa Pharmaceuticals’ manufactures the active ingredient for its small molecule blockbuster drug ‘Asclepius’ by organic synthesis in a processing plant. The Manufacturing Science Group have been asked to optimise the production process to increase the yield of the active ingredient. The theoretical yield is between 28 and 32Kg per batch based on standard inputs and management would like to see yields increased to approach the max theoretical yield of 32Kgs. The Manufacturing Science Group have identified 4 Process Variables which they would like to investigate to see if there is any correlation between them and the yield.

### 1.	CREATE THE VARIABLES

**Critical Quality Attributes, (CQA's)**

|Yield    | Max Reaction Temp |Pre-Crystallisation pH|Crystallisation Hold Time|Solvent Exchange IPC Result|
|:-------:|:-----------------:|:--------------------:|:-----------------------:|:-------------------------:|
| 28-32Kg | 15-25$^o$C | pH = 6.5-7.5 | 3.5-4Hrs| 0-0.5% |

The Critical Quality Attributes are the parameters, defined by quality and agreed with the regulatory authorities, which must be adhered to in order for the active ingredient to meet predefined quality standards. A manufacturing process which deviates from these CQA's could potentially be scrapped or re-worked. Such events are outside the scope of this study so the dataset investigated in this study will be within the parameters of the above CQA's

For the purposes of creating a dataset we will investigate 100 data points of each CQA (or variable) i.e. 100 batches will be studied each of which will have data for each variable.

The effect or contribution (if any) of these variables to the batch yield will be investigated.




In [31]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#number of observations

size = 100 

#List of variables

Yield = np.round(np.random.normal(loc=30, scale=0.25, size=size),2)

MaxRxnTemp= np.round(np.random.normal(loc=20, scale=0.5, size=size),2)

pH = np.round(np.random.normal(loc=7.0, scale=0.25, size=size),2)

HoldTime = np.round(np.random.normal(loc=2, scale=0.25, size=size),2)

IPC = np.round(np.random.normal(loc=0.25, scale=0.025, size=size),2)



In [32]:

print(('Yield =', Yield), ('pH ='), pH), ('Max Reaction Temp(C) = ', MaxRxnTemp), ('Hold Time =', HoldTime), ('IPC Result % =', IPC)

('Yield =', array([29.77, 30.09, 29.93, 30.  , 29.85, 29.61, 30.13, 30.3 , 29.99,
       29.85, 29.7 , 30.21, 30.23, 29.71, 29.9 , 29.91, 29.92, 29.76,
       29.39, 30.04, 29.56, 30.29, 29.82, 29.52, 29.89, 29.57, 29.86,
       29.85, 30.29, 29.76, 30.37, 30.01, 29.94, 29.82, 30.12, 29.93,
       30.16, 29.61, 30.42, 29.65, 29.86, 30.17, 30.17, 30.05, 29.95,
       30.2 , 30.43, 30.19, 30.02, 29.49, 30.1 , 30.43, 29.9 , 29.76,
       30.41, 29.68, 30.28, 30.32, 29.97, 29.84, 30.18, 30.24, 30.08,
       30.39, 29.82, 30.07, 29.95, 29.68, 30.49, 30.68, 29.94, 29.98,
       30.01, 30.26, 30.08, 29.98, 30.19, 29.97, 30.14, 30.22, 29.82,
       30.24, 29.88, 30.04, 29.96, 30.23, 30.13, 29.66, 30.23, 30.45,
       30.47, 29.94, 29.86, 30.39, 29.95, 30.15, 30.05, 29.71, 30.09,
       30.57])) pH = [7.12 6.69 6.94 7.23 7.17 6.99 7.42 6.88 6.97 6.78 6.78 7.41 7.35 7.45
 6.61 7.13 7.24 7.09 6.44 7.04 7.52 7.02 7.18 7.19 6.84 7.04 7.04 6.69
 6.97 7.17 6.91 6.89 7.15 6.74 6.82 7.03 7.3  6.86 7.07

(None,
 ('Max Reaction Temp(C) = ',
  array([19.77, 20.29, 20.25, 20.87, 20.03, 19.95, 20.13, 19.43, 19.96,
         19.99, 19.46, 20.43, 19.62, 19.74, 19.8 , 20.  , 20.31, 20.97,
         18.81, 19.57, 19.76, 19.58, 19.56, 20.6 , 19.68, 19.56, 20.13,
         19.84, 19.46, 20.03, 19.99, 20.78, 20.72, 19.43, 19.74, 19.76,
         20.43, 19.15, 20.69, 20.02, 20.03, 19.43, 19.86, 20.63, 19.95,
         20.38, 19.33, 20.12, 20.48, 20.1 , 20.76, 20.38, 20.17, 19.16,
         20.64, 19.56, 20.13, 19.86, 20.11, 19.62, 20.32, 19.92, 19.39,
         18.95, 20.33, 20.54, 20.2 , 20.06, 19.88, 20.72, 20.54, 19.7 ,
         19.16, 20.  , 19.5 , 19.39, 19.81, 19.67, 20.44, 19.34, 19.24,
         20.31, 20.86, 19.58, 20.39, 19.67, 19.94, 20.4 , 20.53, 21.07,
         20.53, 19.67, 20.82, 19.26, 20.69, 20.08, 19.78, 19.67, 19.89,
         19.47])),
 ('Hold Time =',
  array([2.16, 1.85, 1.87, 1.89, 2.18, 1.84, 2.31, 2.38, 1.95, 1.53, 1.59,
         1.58, 1.77, 2.28, 1.87, 2.09, 2.27, 1.77, 1.32, 2.  

### 2. CREATE A DATAFRAME OF ALL VARIABLES

**Each of the 4 arrays will now be concatenated to create 1 dataframe**

In [33]:
columns = {"Yield(kg)": Yield, "pH": pH, "Max Reaction Temp(C)": MaxRxnTemp, "Hold Time (Hrs:Mins)": HoldTime, "IPC Result(%)": IPC}

Batchdf = pd.DataFrame(columns,
                  columns=["Yield(kg)", "pH", "Max Reaction Temp(C)", "Hold Time (Hrs:Mins)","IPC Result(%)"])

Batchdf

Unnamed: 0,Yield(kg),pH,Max Reaction Temp(C),Hold Time (Hrs:Mins),IPC Result(%)
0,29.77,7.12,19.77,2.16,0.29
1,30.09,6.69,20.29,1.85,0.22
2,29.93,6.94,20.25,1.87,0.22
3,30.00,7.23,20.87,1.89,0.25
4,29.85,7.17,20.03,2.18,0.25
5,29.61,6.99,19.95,1.84,0.23
6,30.13,7.42,20.13,2.31,0.24
7,30.30,6.88,19.43,2.38,0.27
8,29.99,6.97,19.96,1.95,0.25
9,29.85,6.78,19.99,1.53,0.28


**We have now created an artificial DataSet known as Batchdf**