# Illumina QC prediction run {{ RUN_ID }}
* **Notebook version:** v0.0.1
* **Created by:** NIHR Imperial BRC Genomics Facility
* **Maintained by:** NIHR Imperial BRC Genomics Facility
* **Docker image path:** [Dockerfile](https://github.com/imperial-genomics-facility/igf-dockerfiles/tree/main/illumina-interop/Dockerfile_v1)
* **Notebook code path:** [Templates](https://github.com/imperial-genomics-facility/igf-dockerfiles/tree/main/illumina-interop/templates)
* **Created on:** {{ DATE_TAG }}
* **Contact us:** [NIHR Imperial BRC Genomics Facility - Contact us](https://www.imperial.ac.uk/medicine/research-and-impact/facilities/genomics-facility/contact-us/)
* **License:** Apache [License 2.0](https://github.com/imperial-genomics-facility/igf-dockerfiles/blob/main/LICENSE)

Send us your suggestions (or PRs) about how to improve this notebook.

Please add the following statement in all publications if you use any part of this notebook for your analysis: _“The NIHR Imperial BRC Genomics Facility has provided resources and support that have contributed to the research results reported within this paper.”._

## Input file used:
  * Model path: {{ MODEL_PATH }}
  * Tile data for run: {{ PARQUET_PATH }}
  * Number of CPUs to use: {{ NUM_CPU }}

In [None]:
## load model
import pickle
import pandas as pd
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
import altair  as alt
from IPython.display import HTML
alt.renderers.enable("html")

file_name = "{{ MODEL_PATH }}"
model = pickle.load(open(file_name, "rb"))

## load parquet data
conf = SparkConf()
conf = \
    conf.\
    setMaster("local[{{ NUM_CPU }}]").\
    setAppName("InterOpReport").\
    set("spark.log.level", "OFF").\
    set("spark.driver.extraJavaOptions", "-Dlog4j.logger.org=OFF").\
    set("spark.sql.execution.arrow.pyspark.enabled", "true").\
    set("spark.executor.memory", "{{ RAM_GB }}g").\
    set("spark.executor.cores", "{{ NUM_CPU }}")
sc = SparkContext(conf=conf)
spark = \
    SparkSession(sc).\
    builder.\
    getOrCreate()
pred_df = spark.read.parquet('{{ PARQUET_PATH }}')

## convert to Pandas DF
pred_pdf = pred_df.toPandas()
## change column type
pred_pdf = \
    pred_pdf.astype({
        'PCT_ClusterCountPF': float,
        'PCT_DensityPF': float,
        'PCT_Occupied': float})
## get subset of columns for prediction
X_pred = pred_pdf[['PCT_ClusterCountPF', 'PCT_DensityPF',
         'mean_CalledCount_A', 'mean_CalledCount_T', 'mean_CalledCount_G',
         'mean_CalledCount_C', 'PCT_Q30', 'PCT_Occupied',
         'intensity_c1', 'slope_p', 'offset_p', 'slope_pr', 'offset_pr']]
## transform columns
ct = ColumnTransformer(
        [("scaling",
          StandardScaler(), 
          ['mean_CalledCount_A', 
           'mean_CalledCount_T', 
           'mean_CalledCount_G',
           'mean_CalledCount_C'])],
        remainder='passthrough')
X_pred = ct.fit_transform(X_pred)
## predict flowcell type status
y_pred = model.predict(X_pred)
## add labels back to Pandas DF for plotting
pred_pdf['is_failed'] = y_pred
## add extra axis for histogram plot
def add_hist_axis_for_plotting(s)-> pd.Series:
    tile = s['Tile']
    s['h_x'] = int(str(int(tile))[0])
    s['h_y'] = int(str(int(tile))[1:])
    return s
pred_pdf = \
    pred_pdf.apply(lambda s: add_hist_axis_for_plotting(s), axis=1)

display(HTML("<h2>Prediction status of Illumina flowcell</h2>"))
for lane, l_data in pred_pdf.groupby('Lane'):
    chart1 = \
            alt.Chart(l_data).mark_rect().encode(
                x=alt.X('h_y:O', axis=alt.Axis(labels=False, ticks=False, title=None)),
                y=alt.Y('h_x:O', axis=alt.Axis(labels=False, ticks=False, title=None)),
                color=alt.Color('is_failed:O').scale(scheme='lightgreyred'),
                tooltip=['Lane:N', 'Tile:O', 'is_failed:O']
            ).configure_view(
                step=200,
                strokeWidth=5
            ).configure_axis(
                domain=False
            ).properties(
                title=f'Prediction of Lane {lane}',
                width=1080,
                height=100
            )
    display(chart1)

## Get PCT tiles failed
display(HTML("<h2>PCT tiles failed</h2>"))
print(f"PCT tiles failed: {len(pred_pdf[pred_pdf['is_failed']==1])/len(pred_pdf['is_failed']):.2f}")