# Draft analysis 

---

Group name:

---


## Introduction

Diabetic retinopathy is a serious illness, which is expected to affect > 200 million people by the year 2025 [1]. It is an eye desease resulting in blindness for over 10000 people with diabetes per year [2]. In order to help these patients Boehringer Ingelheim is investing in Research and Development of biopharmaceuticals and is screening for new active ingredients, which has the potential to slow or even stop the progression of this desease [3]. A unique characteristic of these medications is the intravitreal application, which means that the drug product is injected directly into the vitreous humor, the gel-like substance inside the eye (see picture below).

<img src="IntravitrealApplication.png" alt="Beispielbild" width="400">

*Picture: Illustration of the ocular anatomy and intravitreal injection for the treatment of ocular diseases [4]*



One major challenge of developing drug products, which are applied intravitreal, is the requirement for the low viscosity of the drug product solution. Viscosity is the measure of how easily a fluid flows; thicker liquids like honey have high viscosity, while thinner ones like water have low viscosity. It reflects the internal resistance of a liquid's molecules to movement or flow. A high viscosity of the drug product solution in the syringe results in a higher injection force necessary to apply the medication into the eye. The European Pharmacopoeia (EP) provides specific guidelines regarding the viscosity of intravitreal applied biopharmaceuticals to ensure safe and effective injection [5]. 

For this reason the viscosity is a very important measure and is determined several times during the early development stage for every new product. Viscosity is tested under different experiment conditions like temperature and product concentration. In order to reduce development time to the commercial launch of a new drug product and reduce costs for laboratory equipment and personnel, the long term motivation is to predict the viscosity of every new agent without any experiments in the laboratory.

The data set I want to explore consists of viscosity data, whereas each observation of the data set corresponds to one measurement value. The data was collected as part of a characterization study for various biopharmaceutical products. These products consist of different types of proteins (IgG2, IgG4, Knob/Hole, DoppelMab), which have different characteristics like molecular weight, isoelectic point or extinction coefficient. In order to determine the effect of product concentration on the viscosity, each product was measured at two different concentrations (10 mg/mL, 62.5 mg/mL).  Furthermore, viscosity was measured at different temperatures (2°C - 40°C) to assess the impact of temperature variations. The data set consists of the following variables:



| Name  |   Description	| Role   	| Type   	|  Format 	|
|-------|---------------|-----------|-----------|-----------|
| measurement_value_mPas  	| Measured viscosity, of the sample in mPas 	        | response  	    | numeric  	    | float  	|
| replicate  	| Number of replicate. Within each measurement, two individual measurements were conducted as technical replicates   	        | ID 	    | numeric  	    | int  	|
| entered_on  	| The date on which the measurement was conducted	        | predictor  	    |numeric  	    | date  	|
| instrument  	| Instrument, which was used to measure the viscosity   	        | predictor  	    | nominal  	    | category  	|
| temperature_c 	| The temperature at which the measurement was conducted  	        | predictor  	    | numeric  	    | float  	|
| product_concentration_mg_mL  	| Concentration of the product in the aqueous solution in mg/mL   	        | predictor  	    | numeric  	    | float  	|
| product  	| Internal product name as a unique code  	        | ID  	    | nominal  	    | category  	|
| protein_format  	| Protein format of the investigated product 	        | predictor  	    | nominal  	    | category  	|
| molecular_weight_kda  	| Molecular weight of the investigated product in kDa. A measure of the size of the protein   	        | predictor  	    | numeric  	    | float  	|
| extinction_coefficient_l_molcm  	| Extinction coefficient of the investigated product in L·mol⁻¹·cm⁻¹. A measure of the light absorption ability of the molecule 	        | predictor  	    | numeric  	    | float  	|
| isoelectric_point  	| Isoelectric point of the investigated product. A measure of the charge of the molecule	        | predictor  	    | numeric  	    | float  	|

In this work the impact of different experiment conditions on the measured viscosity value of the drug product were examined. Different variables like the temperature and characteristics of the product solution like the product concentration, the protein format, the molecular weight and other factors are considered as predictors and might have an impact on the response variable. After analysing the relationship between the variables a model will be fitted, which makes further investigations possible.



## Setup

In [171]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import seaborn as sns
import matplotlib.pyplot as plt
import altair as alt

## Data

## Import data

In [172]:
df = pd.read_csv("viscosity_data.csv", sep=";")


### Data structure

In [173]:
df.head()

Unnamed: 0,viscosity_mpas,replicate,entered_on,instrument,temperature,product_concentration_mg_ml,product,protein_format,molecular_weight_kda,extinction_coefficient_l_molcm,isoelectric_point
0,3.93,1,15.03.2019,VISCOSIMETER_02,2,10.0,BI655300,IgG2,148830,220.42,8.54
1,4.28,2,16.03.2019,VISCOSIMETER_02,2,10.0,BI655300,IgG2,148830,220.42,8.54
2,3.42,1,15.03.2019,VISCOSIMETER_02,5,10.0,BI655300,IgG2,148830,220.42,8.54
3,3.69,2,15.03.2019,VISCOSIMETER_02,5,10.0,BI655300,IgG2,148830,220.42,8.54
4,2.89,1,15.03.2019,VISCOSIMETER_02,10,10.0,BI655300,IgG2,148830,220.42,8.54


In [174]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 502 entries, 0 to 501
Data columns (total 11 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   viscosity_mpas                  502 non-null    float64
 1   replicate                       502 non-null    int64  
 2   entered_on                      502 non-null    object 
 3   instrument                      502 non-null    object 
 4   temperature                     502 non-null    int64  
 5   product_concentration_mg_ml     502 non-null    float64
 6   product                         502 non-null    object 
 7   protein_format                  502 non-null    object 
 8   molecular_weight_kda            502 non-null    int64  
 9   extinction_coefficient_l_molcm  502 non-null    float64
 10  isoelectric_point               502 non-null    float64
dtypes: float64(4), int64(3), object(4)
memory usage: 43.3+ KB


### Data corrections

### Variable lists

In [175]:
X = df[['temperature', 'product_concentration_mg_ml']]
y = df['viscosity_mpas']

y_label = 'viscosity_mpas'

list_numeric = ['temperature', 'product_concentration_mg_ml','molecular_weight_kda','extinction_coefficient_l_molcm','isoelectric_point']

list_numeric_discrete =['temperature', 'product_concentration_mg_ml','protein_format']

list_numeric_continuous = ['molecular_weight_kda','extinction_coefficient_l_molcm','isoelectric_point']



### Data splitting

In [176]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Analysis

### Descriptive statistics

In [177]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
viscosity_mpas,502.0,3.894861,2.990443,0.54,2.2,3.165,4.5475,27.87
replicate,502.0,1.5,0.500499,1.0,1.0,1.5,2.0,2.0
temperature,502.0,20.203187,12.61065,2.0,10.0,20.0,30.0,40.0
product_concentration_mg_ml,502.0,36.354582,26.275976,10.0,10.0,62.5,62.5,62.5
molecular_weight_kda,502.0,161211.103586,22382.41398,146286.0,148783.0,149601.0,155089.0,206428.0
extinction_coefficient_l_molcm,502.0,241.135538,50.327715,201.4,207.36,220.42,236.92,355.71
isoelectric_point,502.0,8.094382,0.580659,6.97,7.75,8.27,8.42,9.36


### Exploratory data analysis

In [178]:
charts1 = []  # Liste, um einzelne Diagramme zu speichern

for x in list_numeric_discrete:
    boxplot = (
        alt.Chart(df)
        .mark_boxplot()
        .encode(
            x=alt.X(x, title=x),
            y=alt.Y(y_label, title=y_label)
        )
        .properties(
            title=f'impact of {x}',
            width=300,
            height=300
        )
    )
    charts1.append(boxplot)

    final_chart1 = alt.hconcat(*charts1)


In [179]:
charts2 = []  # Liste, um einzelne Diagramme zu speichern

for x in list_numeric_continuous:
    chart = (
        alt.Chart(df)
        .mark_circle(size=60)
        .encode(
            x=alt.X(x, title=x),
            y=alt.Y(y_label, title=y_label),
            tooltip=[x, y_label]
        )
        .properties(
            title=f'Impact of {x}',
            width=300,
            height=300
        )
    )
    charts2.append(chart)

    final_chart2 = alt.hconcat(*charts2)

In [180]:
final_chart = alt.vconcat(final_chart1, final_chart2)
final_chart

In [181]:
df_predictors = df[['viscosity_mpas','temperature', 'product_concentration_mg_ml','molecular_weight_kda','extinction_coefficient_l_molcm', 'isoelectric_point']]

### Relationships

In [187]:
corr = df_predictors.corr(method='pearson').round(2)

In [183]:
corr

Unnamed: 0,viscosity_mpas,temperature,product_concentration_mg_ml,molecular_weight_kda,extinction_coefficient_l_molcm,isoelectric_point
viscosity_mpas,1.0,-0.5,0.43,0.36,0.35,-0.16
temperature,-0.5,1.0,0.0,0.0,0.0,0.0
product_concentration_mg_ml,0.43,0.0,1.0,-0.0,-0.0,-0.0
molecular_weight_kda,0.36,0.0,-0.0,1.0,0.98,-0.32
extinction_coefficient_l_molcm,0.35,0.0,-0.0,0.98,1.0,-0.19
isoelectric_point,-0.16,0.0,-0.0,-0.32,-0.19,1.0


## Model

### Select model

In [184]:
model = LinearRegression()

### Training and validation

In [185]:
model.fit(X_train, y_train)

### Fit model

### Evaluation on test set

In [186]:
y_pred = model.predict(X_test)

### Save model



Save your model in the folder `models/`. Use a meaningful name and a timestamp.

## Conclusions