In [1]:
# --- CSS STYLE ---
from IPython.core.display import HTML
def css_styling():
    styles = open("../input/2020-cost-of-living/alerts.css", "r").read()
    return HTML("<style>"+styles+"</style>")
css_styling()

 We did that above to **embed HTML into ipyhton output** 

<img src="https://i.imgur.com/YcUJEhW.png">

<center><h1>SIIM-FISABIO-RSNA COVID-19 Detection</h1></center>

This is **an object detection and classification problem**, meaning that for each instance we'll have to *predict* a **bounding box** and a **class**.

<div class="alert success-alert">
📌 <b>Project Goal</b>: Categorize chest radiographs as negative for pneumonia, typical, indeterminate, or atypical for COVID-19. If some abnormalities are found, provide the bounding boxes.
</div>

OK! Covid-19 sympthoms look very similar to other viral or bacterial pneumonias/ chest radiographs, hence it's much harder to correctly (and I might add quickly) diagnose.

### ⬇️ Libraries
* Link to my W&B Dashboard here: https://wandb.ai/mehreet/siim-covid19?workspace=user-mehreet
* How to use W&B: [Experiment Tracking with Weights and Biases](https://www.kaggle.com/ayuraj/experiment-tracking-with-weights-and-biases)

# LIBRARIES :
Now we will be importing the libraries which will be needed in our notebook 


- **OS** : To use OS Functions in python interface 
- **re (regular expression)** :  specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).
- **wandb (Weights & Biases) **: a python package that allows us to monitor our training in real-time 
    
    [Experiment Tracking]
    [Dataset Versioning ]
    [Modelset management ]
    [Prediction Visualization]
    Keep all your results at one place 
    Visualize things 
    Share results 
    Log metrics directly to w&b --> then visualize and describetheir results 
    shows the metrics and performances in a dashboard format 
     
- **tqdm**  : a python library used for creating Progress Meters or Progress Bars.
- **warnings** : warn the developer of situations that aren’t necessarily exceptions.
- **glob** : search for a specific file pattern, or perhaps more usefully, search for files where the filename matches a certain pattern by using wildcard characters.
- **ast **: used for typecasting if data type is unknown
- **cv2** : is a library of Python bindings designed to solve computer vision problems.                         cv2.imread() method loads an image from the specified file. If the image cannot be read (because of missing file, improper permissions, unsupported or invalid format) then this method returns an empty matrix.
- **math **: To use mathematical functions in python
- **pandas** : data wrangling and analysis , cleaning, transforming, manipulating and analyzing data
- **numpy** :  Python library used for working with arrays
- **IPython.display import display_html** : embed rendered HTML output into IPython output
- **seaborn** :for visualization
- **matplotlib** : visualization
- **import matplotlib.patches as patches** :Patches are arbitrary two dimensional regions. There are a lot of fancy wrappers and helpers, like Rectangles, Circles, Boxes, and Ellipses
- **import matplotlib.pyplot as plt** : collection of command style functions that make matplotlib work like MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc. In matplotlib.
- **from scipy.stats import pearsonr** :The Pearson correlation coefficient measures the linear relationship between two datasets
 Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.
- **from matplotlib.offsetbox import AnnotationBbox, OffsetImage **: The OffsetBox is a simple container artist.
The child artists are meant to be drawn at a relative position to its parent.Being an artist itself, all parameters are passed on to Artist.
AnnotationBbox creates an annotation using an OffsetBox, and provides more fine-grained control than Axes.annotate

- **pydicom** : pydicom is a pure python package for working with DICOM files such as medical images, reports, and radiotherapy objects. ... It is designed to let you manipulate data elements in DICOM files with python code.
DICOM files are images that come digitally from medical scans, such as MRIs and ultrasounds. You can view these files with a free online viewer called Jack Image viewer on any computer.
- **from pydicom.pixel_data_handlers.util import apply_voi_lut**: 
LUT means -  Look Up table 
4 type of LUT found within DICOM : 
1. Modality LUT : purpose of the "Modality LUT" step in the theoretical grayscale pipeline is to map stored pixel values to some kind of meaning physical unit.
This may be a linear mapping as Dee describes or using a lookup table.
Not all image objects support this feature. It depends on the modality since some modalities like CT have a meaningful physical unit space.
2. Identity (no LUT)
3. VOI LUT
4. Presentation LUT
VOI LUT Sequence : Value of interest look Up table defines a Sequence .One or more Items shall be included in this Sequence.
Required if Window Center (0028,1050) is not present. May be present otherwise.
The Value Of Interest(VOI) LUT transformation transforms the modality pixel values into pixel values which are meaningful for the user or the application. The VOI LUT is described by the VOI LUT Sequence
Windowing
This can be linear or sigmoid. Both types are defined by the Window Centre (0028, 1050) and Window Width (0028, 1051) values, but the resulting shapes derived from those figures differ at the extremes of the range.
- **from sklearn.cluster import KMeans** : using K means clustering algorithm 
https://www.medicalconnections.co.uk/kb/lookup-tables/
- **from skimage import morphology, measure** :skimage.morphology. binary_opening(image, footprint=None, out=None)[source] Return fast binary morphological opening of an image. This function returns the same result as grayscale opening but performs faster for binary images

In [2]:
# Libraries
import os
import re
import wandb
import tqdm
import warnings
import glob
import ast
import cv2
import math
import pandas as pd
import numpy as np
from IPython.display import display_html
import seaborn as sns
import matplotlib as mpl
import matplotlib.patches as patches
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
from matplotlib.offsetbox import AnnotationBbox, OffsetImage
import pydicom
from pydicom.pixel_data_handlers.util import apply_voi_lut
from sklearn.cluster import KMeans
from skimage import morphology, measure

In [3]:
# Environment check
warnings.filterwarnings("ignore")
os.environ["WANDB_SILENT"] = "true"
CONFIG = {'competition': 'siim-fisabio-rsna', '_wandb_kernel': 'aot'}


- **warnings.filterwarnings("ignore")**- ignoringthe warnings in python 
- **os.environ["WANDB_SILENT"]** = "true" - Set this to true to silence wandb log statements. If this is set all logs will be written to WANDB_DIR/debug.log
- **CONFIG** is a dictionary which will be later used in a function save_dataset_artifact , here we are just setting the key value pairs 




In [4]:
# Secrets 🤫
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("mehreet_secret_wand")


**secret section** - a feature of kaggle like API Keys, access tokens, etc) more securely in Kernels! 
 
Setting the secret key for calling the wandb API as secret_value_0

**Selecting** some of our **favourite color** selections for seaborn plots 

In [5]:

# Custom colors
my_colors = ["#E97777", "#E9B077", "#E9E977", 
             "#77E977", "#7777E9", "#6F6BAC", "#B677E9"]
sns.palplot(sns.color_palette(my_colors))

# Set Style
sns.set_style("white")
mpl.rcParams['xtick.labelsize'] = 18
mpl.rcParams['ytick.labelsize'] = 18
mpl.rcParams['axes.spines.left'] = False
mpl.rcParams['axes.spines.right'] = False
mpl.rcParams['axes.spines.top'] = False
plt.rcParams.update({'font.size': 22})

class color:
    BOLD = '\033[1m' + '\033[93m'
    END = '\033[0m'

sns is seaborn and setting the value of colors 
**sns.set_style** = Set the parameters that control the general style of the plots.

Customizing Matplotlib with style sheets and rcParams

dynamically change the default rc (runtime configuration) settings in a python script or interactively from the python shell.

 All rc settings are stored in a dictionary-like variable called matplotlib.rcParams
 
 

> 📌 **Note**: If this line throws an error, try using `wandb.login()` instead. It will ask for the API key to login, which you can get from your W&B profile (click on Profile -> Settings -> scroll to API keys).

In [6]:
! wandb login $secret_value_0

### ⬇️ Handy Functions

In [7]:
def show_values_on_bars(axs, h_v="v", space=0.4):
    '''Plots the value at the end of the a seaborn barplot.
    axs: the ax of the plot
    h_v: weather or not the barplot is vertical/ horizontal'''
    
    def _show_on_single_plot(ax):
        if h_v == "v":
            for p in ax.patches:
                _x = p.get_x() + p.get_width() / 2
                _y = p.get_y() + p.get_height()
                value = int(p.get_height())
                ax.text(round(_x, 5), round(_y, 5), format(round(value, 5), ','), ha="center") 
        elif h_v == "h":
            for p in ax.patches:
                _x = p.get_x() + p.get_width() + float(space)
                _y = p.get_y() + p.get_height()
                value = int(p.get_width())
                ax.text(_x, _y, format(value, ','), ha="left")

    if isinstance(axs, np.ndarray):
        for idx, ax in np.ndenumerate(axs):
            _show_on_single_plot(ax)
    else:
        _show_on_single_plot(axs)
        
        
def offset_png(x, y, path, ax, zoom, offset, border=2):
    '''For adding other .png images to the graph.
    source: https://stackoverflow.com/questions/61971090/how-can-i-add-images-to-bars-in-axes-matplotlib'''
    
    img = plt.imread(path)
    im = OffsetImage(img, zoom=zoom)
    im.image.axes = ax
    x_offset = offset
    ab = AnnotationBbox(im, (x, y), xybox=(x_offset, 0), frameon=False,
                        xycoords='data', boxcoords="offset points", pad=0)
    ax.add_artist(ab)
    
    
def get_image_metadata(study_id, df):
    '''Returns the label and bounding boxes (if any)
    for a speciffic study id.'''
    
    data = df[df["study_id"] == study_id]
    
    if data["Negative for Pneumonia"].values == 1:
        label = "negative_for_pneumonia"
    elif data["Typical Appearance"].values == 1:
        label = "typical"
    elif data["Indeterminate Appearance"].values == 1:
        label = "indeterminate"
    else:
        label = "atypical"
        
    bbox = list(data["boxes"].values)
    
    return label, bbox


def save_dataset_artifact(run_name, artifact_name, path):
    '''Saves dataset to W&B Artifactory.
    run_name: name of the experiment
    artifact_name: under what name should the dataset be stored
    path: path to the dataset'''
    
    run = wandb.init(project='siim-covid19', 
                     name=run_name, 
                     config=CONFIG, anonymous="allow")
    artifact = wandb.Artifact(name=artifact_name, 
                              type='dataset')
    artifact.add_file(path)

    wandb.log_artifact(artifact)
    wandb.finish()
    print("Artifact has been saved successfully.")
    
    
def create_wandb_plot(x_data=None, y_data=None, x_name=None, y_name=None, title=None, log=None, plot="line"):
    '''Create and save lineplot/barplot in W&B Environment.
    x_data & y_data: Pandas Series containing x & y data
    x_name & y_name: strings containing axis names
    title: title of the graph
    log: string containing name of log'''
    
    data = [[label, val] for (label, val) in zip(x_data, y_data)]
    table = wandb.Table(data=data, columns = [x_name, y_name])
    
    if plot == "line":
        wandb.log({log : wandb.plot.line(table, x_name, y_name, title=title)})
    elif plot == "bar":
        wandb.log({log : wandb.plot.bar(table, x_name, y_name, title=title)})
    elif plot == "scatter":
        wandb.log({log : wandb.plot.scatter(table, x_name, y_name, title=title)})
        
        
def create_wandb_hist(x_data=None, x_name=None, title=None, log=None):
    '''Create and save histogram in W&B Environment.
    x_data: Pandas Series containing x values
    x_name: strings containing axis name
    title: title of the graph
    log: string containing name of log'''
    
    data = [[x] for x in x_data]
    table = wandb.Table(data=data, columns=[x_name])
    wandb.log({log : wandb.plot.histogram(table, x_name, title=title)})
    
    
def return_coords(box):
    '''Returns coordinates from a bbox'''
    # Get the list of dictionaries
    box = ast.literal_eval(box)[0]
    # Get the exact x and y coordinates
    x1, y1, x2, y2 = box["x"], box["y"], box["x"] + box["width"], box["y"] + box["height"]
    # Save coordinates
    return (int(x1), int(y1), int(x2), int(y2))


def fix_inverted_radiograms(data, img):
    '''Fixes inverted radiograms - with PhotometricInterpretation == "MONOCHROME1"
    data: the .dcm dataset
    img: the .dcm pixel_array'''
    
    if data.PhotometricInterpretation == "MONOCHROME1":
        img = np.amax(img) - img
    
    img = img - np.min(img)
    img = img / np.max(img)
    img = (img * 255).astype(np.uint8)
    
    return img

**Function Dictionary : **
1. show_values_on_bars(axs, h_v="v", space=0.4):
    '''Plots the value at the end of the a seaborn barplot.
    axs: the ax of the plot
    h_v: weather or not the barplot is vertical/ horizontal'''
2. _show_on_single_plot(ax):
    if the h_v = vertical 

# 1. 🗃 Metadata

Our data consists of images + `.csv` files, containing custom information for each radiography.

**Metadata structure**:
1. `train_study_level.csv` - contains one row for each study, including correct labels.
2. `train_image_level.csv` - containing one row for each image, including both correct labels and any bounding boxes in a dictionary format

<center><img src="https://i.imgur.com/WFXxolI.png" width=650></center>

> 📌 **Important**: An image can have *multiple bounding boxes*.

In [8]:
# Save data to W&B Dashboard
save_dataset_artifact(run_name='save-train1',
                      artifact_name='train_study_level', 
                      path="../input/siim-covid19-detection/train_study_level.csv")

save_dataset_artifact(run_name='save-train2',
                      artifact_name='train_image_level', 
                      path="../input/siim-covid19-detection/train_image_level.csv")

- The function **save_dataset_artifact**: Saves dataset to W&B Artifactory.
    run_name: name of the experiment
    
    - **artifact_name**: under what name should the dataset be stored
    path: path to the dataset'''

In [9]:
# Read in metadata
train_study = pd.read_csv("../input/siim-covid19-detection/train_study_level.csv")
train_image = pd.read_csv("../input/siim-covid19-detection/train_image_level.csv")

print(color.BOLD + "Train Study Shape:" + color.END, train_study.shape, "\n" +
      color.BOLD + "Train Image Shape:" + color.END, train_image.shape, "\n" +
      "\n" +
      "Note: There are {} missing values in train_image.".\
                              format(train_image["boxes"].isna().sum()), "\n" +
      "This happens for labels = 'none' - no checkboxes.", 3*"\n")


# Head of our 2 training metadata
df1_styler = train_study.head(3).style.set_table_attributes("style='display:inline'").\
                                set_caption('TRAIN STUDY')
df2_styler = train_image.head(3).style.set_table_attributes("style='display:inline'").\
                                set_caption('TRAIN IMAGE')

Train study , Train image df - **getting rows and columns**
C**SV's color.BOLD**- Uses a color and bold till the particular text until we mention **color.END**
Also note that we are removing the missing values in train_image format() - allows to do data 

formatting in train_image["boxes"] section removing null values **is.na().sum()** - tells us how many null values are there {} adds the information
So here we get to know that 
**Train Study has 6054 rows and 5 columns and Train Image has 6334 Rows and 4 Columns**

In [10]:
# Head of our 2 training metadata
df1_styler = train_study.head(3).style.set_table_attributes("style='display:inline'").\
                                set_caption('TRAIN STUDY')
df2_styler = train_image.head(3).style.set_table_attributes("style='display:inline'").\
                                set_caption('TRAIN IMAGE')



We are finding first few rows (to be exact 3) of the data 
style.set_table_attributes --> styles the attributes of table in particular manner 
set_caption ---> sets the caption of the tables 
display_html --> When you want to display HTML in the output in Jupyter notebook

In [11]:
print(df1_styler)

Now if we directly priint the dataframe objects it shows the location (in this case hexadecimal bec it has abc in it ) where the df object is stored 

In [12]:
#df1_styler + df2_styler

This will give error since it is not possible to print two dataframe objects directly together , Thats why we use display_html method 

In [13]:
display_html(df1_styler._repr_html_() + df2_styler._repr_html_(), raw=True)

Now we see that using display_html method we are able to print two dataframe objects at once 

## 1.1 train_study analysis

🔎 **Findings**:
1. `id`: there are 6,054 unique ids - there are **no duplicates**
2. `target`: one of our targets is to predict is the radiography is `negative_for_pneumonia`, has `typical` appearance, has `indeterminate` appearance or it's just `atypical`.
3. an image can have **positive value for only 1 label**. For example, there aren't any images which are both `negative_for_pneumonia` and `indeterminate appearance` in the same time. It's only one or the other.
4. **class imbalance is present** - especially for `Indeterminate Appearance` and `Atypical Appearance`

### Target labels distribution
> Now let's see how the **labels we'll have to predict** are layed out.

In [14]:
run = wandb.init(project='siim-covid19', name='metadata_eda', config=CONFIG, anonymous="allow")

**wandb.init()** spawns a new background process to log data to a run, and it also syncs data to wandb.ai by default so you can see live visualizations.
Call **wandb.init()** to start a run before logging data with wandb.log():

**project**
(str, optional) The name of the project where you're sending the new run. If the project is not specified, the run is put in an "Uncategorized" project.

**name**
(str, optional) A short display name for this run, which is how you'll identify this run in the UI. By default we generate a random two-word name that lets you easily cross-reference runs from the table to charts. Keeping these run names short makes the chart legends and tables easier to read. If you're looking for a place to save your hyperparameters, we recommend saving those in config.

**config**
(dict, argparse, absl.flags, str, optional) This sets wandb.config, a dictionary-like object for saving inputs to your job, like hyperparameters for a model or settings for a data preprocessing job. The config will show up in a table in the UI that you can use to group, filter, and sort runs. Keys should not contain . in their names, and values should be under 10 MB. If dict, argparse or absl.flags: will load the key value pairs into the wandb.config object. If str: will look for a yaml file by that name, and load config from that file into the wandb.config object.

**anonymous**
(str, optional) Controls anonymous data logging. Options: - "never" (default): requires you to link your W&B account before tracking the run so you don't accidentally create an anonymous run. - "allow": lets a logged-in user track runs with their account, but lets someone who is running the script without a W&B account see the charts in the UI. - "must": sends the run to an anonymous account instead of to a signed-up user account.

In [15]:
print(train_study["id"].head(5))

In [16]:
# Process id
train_study["study_id"] = train_study["id"].apply(lambda x: x.split("_")[0])
print(train_study["study_id"].head(5))

Here we are **trying to change the name** of the id column in train_study as '**study_id**'
**.apply** --> if we want to apply some function in the column 

A **lambda function** is a small function containing a single expression. Lambda functions can also act as anonymous functions where they don’t require any name. These are very helpful when we have to perform small tasks with less code.
use lambda functions when we have to pass a small function to another function
Here lambda x : is being passed the function x.split()
here we are splitting 00086460a852_study into two parts 00086460a852 as [0] index and study as [1] index and we just want the id numbernot study attached to it 
So we do x.split("_")[0]



In [17]:
# Data for plots
pneumonia = train_study["Negative for Pneumonia"]
typical = train_study["Typical Appearance"]
indeterminate = train_study["Indeterminate Appearance"]
atypical = train_study["Atypical Appearance"]

Storing values of different columns in different variables for easy implementation 

In [18]:
# Plotting
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(21,20))
axs = [ax1, ax2, ax3, ax4]
dfs = [pneumonia, typical, indeterminate, atypical]
titles = ["Pneumonia", "Typical", "Indeterminate", "Atypical"]

for ax, df, title in zip(axs, dfs, titles):
    sns.countplot(y=df, ax=ax, palette=my_colors[1:])
    ax.set_title(title, fontsize=25, weight='bold')
    show_values_on_bars(ax, h_v="h", space=0.4)
    ax.set_xticklabels([])
    ax.set_ylabel('')
    ax.set_xlabel('')
    
# Virus png
path='../input/siimfisabiorsna-covid-2021/PinClipart.com_virus-clip-art_742280.png'
offset_png(x=4378, y=1, path=path, ax=ax1, zoom=0.05, offset=-360, border=1)
offset_png(x=2855, y=1, path=path, ax=ax2, zoom=0.05, offset=-50, border=1)
offset_png(x=1049, y=1, path=path, ax=ax3, zoom=0.05, offset=-43, border=1)
offset_png(x=474, y=1, path=path, ax=ax4, zoom=0.05, offset=40, border=1)

**plt.subplots** - to create figure and multiple axes (most useful)
pyplot.subplots(nrows=1, ncols=1, , sharex=False, sharey=False, squeeze=True, subplot_kw=None, gridspec_kw=None, *fig_kw)
nrows, ncolsint, default: 1 here we have mentioned 2,2 so 2 rows and 2 col Number of rows/columns of the subplot grid.
Returns fig : Figure
axs = [ax1, ax2, ax3, ax4] - Making a list of axs dfs = [pneumonia, typical, indeterminate, atypical] - making a list of dfs for diffrent catagories titles = ["Pneumonia", "Typical", "Indeterminate", "Atypical"] - title


Now using for l**oop zip(*iterables)** . The function takes in iterables as arguments and returns an iterator. This iterator generates a series of tuples containing elements from each iterable. zip() can accept any type of iterable, such as files, lists, tuples, dictionaries, sets, and so on.
**sns.countplot** - Seaborn library has a function countpot() for creating couplot
seaborn.countplot(, x=None, y=None, hue=None, data=None, order=None, hue_order=None, orient=None, color=None, palette=None, saturation=0.75, dodge=True, ax=None, kwargs)*
x, y, hue : names of variables in data or vector data, optional This is the input provided for building the plot. data : DataFrame, array, or list of arrays, optional Here we pass the data for the purpose of plotting the graph. order, hue_order : lists of strings, optional This is the order used for plotting categorical levels. orient : “v” | “h”, optional Through this parameter, we can set the orientation of plot as horizontal or vertical. color : matplotlib color, optional In this parameter, we are setting the color of the plot. palette : palette name, list, or dict The palette will be deciding the colors for the graph. ax : matplotlib Axes, optional These are the axes over which the plot is built.

**ax.set_title** - sets the title
show_values_on_bars(ax, h_v="h", space=0.4 - shows values in graph on axis with horizontal pattern and 0.4 space

In [19]:
# Save plots into W&B Dashboard
for title, df in zip(titles, dfs):
    create_wandb_plot(x_data=[0, 1], 
                      y_data=df.value_counts().values, 
                      x_name="Flag", y_name="Freq", title=title, 
                      log=title, plot="bar")

creating plot in wandb

### How many instances per label?
> 📌 **Note**: 50% of our images have typical appearance. The rest 50% is split in order into *negative for pneumonia* (28%), *indeterminate* (17%) and the rest *atypical*.

In [20]:
# Get data and transform frequencies to percentages
df = train_study.groupby(['Negative for Pneumonia', 'Typical Appearance',
       'Indeterminate Appearance', 'Atypical Appearance']).count().reset_index()

df["label"] = ['Atypical Appearance', 'Indeterminate Appearance',
               'Typical Appearance', 'Negative for Pneumonia']
df["perc"] = df["id"]/df["id"].sum()*100

# Plot
bar,ax = plt.subplots(figsize=(21,10))
ax = sns.barplot(x=df["label"], y=df["perc"], 
                 ci=None, palette=my_colors, orient='v')
ax.set_title("Label % in Total Observations", fontsize=25,
             weight = "bold")
ax.set_xlabel(" ")
ax.set_ylabel("Percentage")
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width() / 2,rect.get_height(),
             "%.1f%%"% rect.get_height(), weight='bold')

**groupby** - allows you to split your data into separate groups to perform computations for better analysis.
Here we are grouping by splitting by Negative for Pneumonia', 'Typical Appearance',
       'Indeterminate Appearance', 'Atypical Appearance
We are plotting labels vs the percentage of apperances in chest MRI's

In [21]:
create_wandb_plot(x_data=df["label"].values,
                  y_data=df["perc"].values,
                  x_name="Label", y_name="Percentage", 
                  title="Label % in Total Observations", 
                  log="perc", plot="bar")

In [22]:
wandb.finish()

## 1.2 train_image analysis

🔎 **Findings**:
1. `StudyInstanceUID` corresponds 1:1 to `new_id` created for `train_study`.
2. `image_id` is unique in the `image_train` data.
3. There are images with multiple bounding boxes!
4. There can be multiple images per study!

### How many images have some sort of abnormality?

To have a bounding box we need to have something weird detected in the scan. If there is nothing weird ... then we don't need a bounding box!

In [23]:
run = wandb.init(project='siim-covid19', name='train_imgs_eda', config=CONFIG, anonymous="allow")

Initalising wandb environment for our image analysis 

In [24]:
train_image.head(5)

Now we can see in **id col _image is unnecesary** so we will remove that using code below  

In [25]:
# Process id
train_image["image_id"] = train_image["id"].apply(lambda x: x.split("_")[0])

# Data for plotting
df = train_image["label"].apply(lambda x: x.split(" ")[0]).\
                                    value_counts().reset_index()

# Plot
plt.figure(figsize=(21, 15))
ax = sns.barplot(data=df, y="index", x="label", palette=my_colors[3:])
show_values_on_bars(ax, h_v="h", space=0.4)
plt.title("How many images have a bounding box present?", 
          fontsize=30, weight='bold')
plt.xticks([])
plt.ylabel('')
plt.xlabel('');

# Virus png
path='../input/siimfisabiorsna-covid-2021/PinClipart.com_virus-clip-art_742280.png'
offset_png(x=4294, y=0, path=path, ax=ax, zoom=0.1, offset=-110, border=1)

Now we will try to plot index VS labels To se how many images have bounding box present
- **plt.figure()**- creates a figure kind of object. when we want to tweak the size of the figure and when we want to add multiple Axes objects in a single figure.

- **sns.barplot** - making a barplot

Here we can observe that 4,294 cases have bounding boxees present and 2,040 have no bounding boxes

In [26]:
create_wandb_plot(x_data=df["index"], 
                  y_data=df["label"].values, 
                  x_name="BBox", y_name="Freq", 
                  title="Images with bbox",
                  log="bbox", plot="bar")

### How many images per each study?

> Majority of studies have only 1 images. That being said, we have ~230 studies that have multiple images available (up to 9).

In [27]:
train_image.head(5)

In [28]:
# Crate df
df = train_image["StudyInstanceUID"].value_counts().reset_index().\
                        sort_values("StudyInstanceUID", ascending=False)
print(color.BOLD + "Max number of images available per study:" + color.END, 
      df["StudyInstanceUID"].max(), "\n" +
      color.BOLD + "Min number of images available per study:" + color.END, 
      df["StudyInstanceUID"].min(), 2*"\n")

- **.value_counts()** :returns object containing counts of unique values
- **.reset_index()** : method to reset index of a Data Frame

Here we are checking the min and max images available per study 

In [29]:

# Plot
plt.figure(figsize=(21, 14))
sns.distplot(a=df["StudyInstanceUID"], color=my_colors[6], 
             hist=False, kde_kws=dict(lw=7, ls="-"))
plt.title("How many images per study?", 
          fontsize=30, weight='bold')
plt.xticks([])
plt.ylabel('Study Frequency')
plt.xlabel('Image Ids');

In [30]:
wandb.log({"max_images_on_study" : df["StudyInstanceUID"].max()})
create_wandb_hist(x_data=df["StudyInstanceUID"], 
                  x_name="Image Freq", title="No. images per study", 
                  log="hist")

In [31]:
wandb.finish()

> After EDA, my [W&B Dashboard](https://wandb.ai/andrada/siim-covid19?workspace=user-andrada) looks like this:

<center><img src="https://i.imgur.com/HQBdtSd.gif" width=600></center>

### Create the full train dataset

This is also how the `final_label` will need to look before submission.

<center><img src="https://i.imgur.com/8Ckg1L0.png" width=600></center>

In [32]:
# Merge all info together
train = pd.merge(train_image, train_study, 
                 left_on="StudyInstanceUID", right_on="study_id")

train.drop(["id_x", "StudyInstanceUID", "id_y"], axis=1, inplace=True)

Using **.merge** function we are combinig two different Dataframes and dropping useless columns

In [33]:
train.head(5)

> Let's also look at the study id `0fd2db233deb`, which has most of the images available. 8 images have nothing unusual in them and 1 is labeled as `Indeterminate Appearance`.

In [34]:
train[train["study_id"] == "0fd2db233deb"]

# 2. 📷 Images

Good, now that we've explored the metadata and the `target` labels, we can start focusing on the good stuff - meaning the **CT scans**.

<center><img src="https://i.imgur.com/DFDDdvD.png" width=550></center>

### WHAT ARE CT SCANS?

"A **Computerized Tomography** scan (CT or CAT scan) uses computers and rotating X-ray machines to **create cross-sectional images** of the body. These images provide **more detailed** information than normal X-ray images. They can show the **soft tissues**, **blood vessels**, and **bones** in various parts of the body."

<center><img src="https://i.imgur.com/5WUxAHZ.png" width=550></center>

In [35]:
run = wandb.init(project='siim-covid19', name='image_explore', config=CONFIG, anonymous="allow")

Lets create a function **show_dcm_info()** which converts an image file saved in the Digital Imaging and Communications in Medicine (**DICOM**) image format into more visual and scan form .

we will feed this in wand_logs 
our fig will have 2 rows and 3 col and fig size is 21,10

glob is used to take out files with similar nature 
our train dataset is stored in path : 
f"../input/siim-covid19-detection/train"

we are storing the paths to dcm files here with different study_id 
the study_idf which we are considering here are shown above 

How to read DICOM files ? 
- using pydicom
import pydicom
dataset = pydicom.dcmread('path/to/file')
storing images as arrays in VOI look up tables

In [36]:
def show_dcm_info(study_ids, df):
    '''Show .dcm images along with description.'''
    wandb_logs = []
    
    fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(21,10))

    # Get .dcm paths
    dcm_paths = [glob.glob(f"../input/siim-covid19-detection/train/{study_id}/*/*")[0]
                 for study_id in study_ids]
    datasets = [pydicom.dcmread(path) for path in dcm_paths]
    images = [apply_voi_lut(dataset.pixel_array, dataset) for dataset in datasets]

    # Loop through the information
    for study_id, data, img, i in zip(study_ids, datasets, images, range(2*3)):
        # Fix inverted images
        img = fix_inverted_radiograms(data, img)

        # Below function available in functions section ;)
        label, bbox = get_image_metadata(study_id, df)
        
        # Check for bounding box and add if it's the case
        try: 
            # For no bbox, the list is [nan]
            no_box = math.isnan(bbox[0])
            pass
        except TypeError:
            # Retrieve the bounding box
            all_coords = []
            for box in bbox:
                all_coords.append(return_coords(box))

            for (x1, y1, x2, y2) in all_coords:
                cv2.rectangle(img, (x1, y1), (x2, y2), (0, 80, 255), 15)
                cv2.putText(img, label, (x1, y1-14), 
                            cv2.FONT_HERSHEY_SIMPLEX, 3, (0, 0, 0), 4)
                
        # Plot the image
        x = i // 3
        y = i % 3
        
        axes[x, y].imshow(img, cmap="rainbow")
        axes[x, y].set_title(f"Label: {label} \n Sex: {data.PatientSex} | Body Part: {data.BodyPartExamined}", 
                  fontsize=14, weight='bold')
        axes[x, y].axis('off');
        
        # Save to W&B
        wandb_logs.append(wandb.Image(img, 
                                      caption=f"Label: {label} \n Sex: {data.PatientSex} | Body Part: {data.BodyPartExamined}"))
          
    wandb.log({f"{label}": wandb_logs})

Basically we are trying to find the dcm files , reading them and converting them in a readable form , finding the bounding boxes, returning error if no bounding box found and fetching the images from the one we found . ,
Then we appending it to our wandb_Log in wandbimages

### Typical Appearance

In [37]:
show_dcm_info(study_ids=["72044bb44d41", "5b65a69885b6", "6aa32e76f998",
                         "c9ffe6312921", "082cafb03942", "d3e83031ebea"], 
              df=train)

Here we can see that how the dcm information is presented in beautiful cheat radiographs showing bounding boxes around infections showing typical appearance 

### Atypical Appearance

In [38]:
show_dcm_info(study_ids=["f807cd855d31", "8087e3bc0efe", "7249de10ed69",
                         "e300a4e86207", "4bac6c7da8b8", "f2d30ac37f7b"], 
              df=train)

Here we can see that how the dcm information is presented in beautiful cheat radiographs showing bounding boxes around infections showing Atypical appearance 

### Indeterminate Appearance

In [39]:
show_dcm_info(study_ids=["b949689a9ef1", "fe7e6015560d", "feffa20fac13",
                         "747483509d0e", "c70369caef91", "1e1b4b1b53cb"], 
              df=train)

### Negative for Pneumonia

In [40]:
show_dcm_info(study_ids=["612ea5194007", "db14e640e037", "d4ab797396b4",
                         "6ae8a88c4b0c", "b3cf474bee3b", "0ba55e5422ab"], 
              df=train)

## 2.2 Magic - save images & bounding boxes to W&B

> Here we are trying to create a function which basically ingests images,boundingbox (bboxes),truee_label, class_id_to_label
Basically we are trying to save images and bounding boxes in W&B

In [41]:
def wandb_bbox(image, bboxes, true_label, class_id_to_label):
    all_boxes = []
    for bbox in bboxes:
        box_data = {"position": {
                        "minX": bbox[0],
                        "minY": bbox[1],
                        "maxX": bbox[2],
                        "maxY": bbox[3]
                    },
                     "class_id" : int(true_label),
                     "box_caption": class_id_to_label[true_label],
                     "domain" : "pixel"}
        all_boxes.append(box_data)
    

    return wandb.Image(image, boxes={
        "ground_truth": {
            "box_data": all_boxes,
          "class_labels": class_id_to_label
        }
    })


def resize_img_and_coord(img, coord, resize):
    '''Resizes the image and its coordinates.
    img: the pixel.array image
    coord: the speciffic coordinates from return_coordinates() function
    resize: and integer specifying the desired size of new image'''
    
    # Resize the image
    w_old, h_old = img.shape

    # Resize the coordinates
    img = cv2.resize(img, (resize,resize))

    new_x1 = int(coord[0][0] / (w_old/resize))
    new_y1 = int(coord[0][1] / (h_old/resize))
    new_x2 = int(coord[0][2] / (w_old/resize))
    new_y2 = int(coord[0][3] / (h_old/resize))
    new_coord = [(new_x1, new_y1, new_x2, new_y2)]
    
    return img, new_coord

Let's look at an **example of 10 images** (below you can see how all the code outputs in the [W&B Dashboard](https://wandb.ai/mehreet/siim-covid19?workspace=user-mehreet)!)
<center><video src="https://i.imgur.com/q0jCbza.mp4" width=650 controls></center>

In [42]:
# Get a few example ids (image_id)
example_ids = ["000a312787f2", "0012ff7358bc", "001398f4ff4f",
               "001bd15d1891", "002e9b2128d0", "ffbeafe30b77",
               "0022227f5adf", "00a129830f4e", "01376c1ba556", "008ca392cff3"]

# Read in datas
study_ids = train[train["image_id"].isin(example_ids)]["study_id"].values
paths = [glob.glob(f"../input/siim-covid19-detection/train/{i}/*/*")[0]
         for i in study_ids]

# Retrieve resized information
images, coords, labels = [], [], []
for path, study_id in zip(paths, study_ids):
    try:
        # Read data file
        data = pydicom.dcmread(path)
        # Get image data
        img = apply_voi_lut(data.pixel_array, data)
        # Get image coordinates
        label, bbox = get_image_metadata(study_id=study_id, df=train)
        coord = [return_coords(box) for box in bbox]

        # Fix inverted radiograms + resize
        img = fix_inverted_radiograms(data, img)
        resized_img, resized_coord = resize_img_and_coord(img, coord, resize=200)

        images.append(resized_img)
        coords.append(resized_coord)
        labels.append(label)
    except RuntimeError:
        pass

In [43]:
# Map each label to a number
class_label_to_id = {'atypical': 0, 'indeterminate': 1, 'typical': 2}
# And each number to a label
class_id_to_label = {val: key for key, val in class_label_to_id.items()}

# Log each image
wandb_bbox_list = []

for image, coord, label in zip(images, coords, labels):
    wandb_bbox_list.append(wandb_bbox(image=image,
                                      bboxes=coord, 
                                      true_label=class_label_to_id[label],
                                      class_id_to_label=class_id_to_label))

# Save images to W&B Dashboard
wandb.log({"radiograph": wandb_bbox_list})

print(color.BOLD + "Finished! Your Images were uploaded in your W&B Dashboard!" + color.END)

In [44]:
wandb.finish()

# 3. Extract Metadata from .dcm

> 📌 **Important**: We can create *more* **metadata** (more information on the images) from the information stored in the `.dcm` files. Below I am extracting all features stored in each `.dcm` file and storing them into a sepparate dataframe.

## 3.1 Store and Save Metadata

<center><img src="https://i.imgur.com/8Fhupoe.png" width=650></center>

In [45]:
def get_observation_data(path):
    """Get information from the .dcm files.
    path: complete path to the .dcm file"""

    image_data = pydicom.read_file(path)
    
    # Dictionary to store the information from the image
    observation_data = {
        "FileNumber" : path.split("/")[5],
        "Rows" : image_data.get("Rows"),
        "Columns" : image_data.get("Columns"),
        "PatientID" : image_data.get("PatientID"),
        "PatientName" : image_data.get("PatientName"),
        "PhotometricInterpretation" : image_data.get("PhotometricInterpretation"),
        "StudyInstanceUID" : image_data.get("StudyInstanceUID"),
        "SamplesPerPixel" : image_data.get("SamplesPerPixel"),
        "BitsAllocated" : image_data.get("BitsAllocated"),
        "BitsStored" : image_data.get("BitsStored"),
        "HighBit" : image_data.get("HighBit"),
        "PixelRepresentation" : image_data.get("PixelRepresentation"),
    }

    # String columns
    str_columns = ["ImageType", "Modality", "PatientSex", "BodyPartExamined"]
    for k in str_columns:
        observation_data[k] = str(image_data.get(k)) if k in image_data else None

    
    return observation_data

In [46]:
# # An example
# p = "../input/siim-covid19-detection/train/00792b5c8852/1f52bcb3143e/3fadf4b48db3.dcm"
# example = get_observation_data(p)
# print(example)

# # === GET ALL METADATA ===
# # Get all paths to .dcm files
# all_paths = glob.glob("../input/siim-covid19-detection/train/*/*/*")

# # === Get metadata ===
# exceptions = 0
# dicts = []

# for path in tqdm.tqdm(all_paths):
#     # Get .dcm metadata
#     ### TODO: add .dcm id
#     try:
#         d = get_observation_data(path)
#         dicts.append(d)
#     except Exception as e:
#         exceptions += 1
#         continue
        
# # === SAVE METADATA ===
# # Convert into df
# meta_train_data = pd.DataFrame(data=dicts, columns=example.keys())
# meta_train_data[""]
# # Export information
# meta_train_data.to_csv("meta_train.csv", index=False)

# print("Metadata processed & saved successfuly :)")

## 3.2 Let's Analyse the new information

In [None]:
# Save data to W&B Dashboard
save_dataset_artifact(run_name='dave-dcm-meta',
                   artifact_name='dcm_metadata', 
                      path="../input/siimfisabiorsna-covid-2021/meta_train.csv")

In [None]:
# Import
dcm_meta = pd.read_csv("../input/siimfisabiorsna-covid-2021/meta_train.csv")
dcm_meta = pd.concat([dcm_meta, train], axis=1)

dcm_meta.head(2)
#dcm_meta.info()

In [None]:
#Show data from particular column
dcm_meta[dcm_meta["Negative for Pneumonia"].isin([0, 1])]

### Patient's Gender

In [None]:
# Get the Data
labels = ['Negative for Pneumonia', 'Typical Appearance', 
          'Indeterminate Appearance', 'Atypical Appearance']
dt = dcm_meta.groupby("PatientSex")[labels].sum().reset_index()

# Plotting
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(21,20))
axs = [ax1, ax2, ax3, ax4]

for ax, title in zip(axs, labels):
    sns.barplot(data=dt, x=title, y="PatientSex", 
                ax=ax, palette=my_colors[0:])
    ax.set_title(title, fontsize=25, weight='bold')
    show_values_on_bars(ax, h_v="h", space=0.4)
    ax.set_xticklabels([])
    ax.set_ylabel('')
    ax.set_xlabel('')

### Location of X-rays

In [None]:
# Get the Data
labels = ['Negative for Pneumonia', 'Typical Appearance', 
          'Indeterminate Appearance', 'Atypical Appearance']
dt = dcm_meta.groupby("BodyPartExamined")[labels].sum().reset_index()
dt = dt.sort_values("Negative for Pneumonia", ascending=False)

# Plotting
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(21,20))
axs = [ax1, ax2, ax3, ax4]

for ax, title in zip(axs, labels):
    sns.barplot(data=dt, x=title, y="BodyPartExamined", 
                ax=ax, palette=my_colors)
    ax.set_title(title, fontsize=25, weight='bold')
    show_values_on_bars(ax, h_v="h", space=0.4)
    ax.set_xticklabels([])
    ax.set_ylabel('')
    ax.set_xlabel('')
    fig.tight_layout()

### MONOCHROME1 or MONOCHROME2? That's the question ...

In [None]:
# Get the Data
labels = ['Negative for Pneumonia', 'Typical Appearance', 
          'Indeterminate Appearance', 'Atypical Appearance']
dt = dcm_meta.groupby("PhotometricInterpretation")[labels].sum().reset_index()

# Plotting
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(21,20))
axs = [ax1, ax2, ax3, ax4]

for ax, title in zip(axs, labels):
    sns.barplot(data=dt, x=title, y="PhotometricInterpretation", 
                ax=ax, palette=my_colors[5:])
    ax.set_title(title, fontsize=25, weight='bold')
    show_values_on_bars(ax, h_v="h", space=0.4)
    ax.set_xticklabels([])
    ax.set_ylabel('')
    ax.set_xlabel('')
    fig.tight_layout()

MONOCHROME1
Pixel data represent a single monochrome image plane. The minimum sample value is intended to be displayed as white after any VOI gray scale transformations have been performed. See PS3.4. This value may be used only when Samples per Pixel (0028,0002) has a value of 1. May be used for pixel data in a Native (uncompressed) or Encapsulated (compressed) format; see Section 8.2 in PS3.5 .

MONOCHROME2
Pixel data represent a single monochrome image plane. The minimum sample value is intended to be displayed as black after any VOI gray scale transformations have been performed. See PS3.4. This value may be used only when Samples per Pixel (0028,0002) has a value of 1. May be used for pixel data in a Native (uncompressed) or Encapsulated (compressed) format; see Section 8.2 in PS3.5 .

PALETTE COLOR
Pixel data describe a color image with a single sample per pixel (single image plane). The pixel value is used as an index into each of the Red, Blue, and Green Palette Color Lookup Tables (0028,1101-1103&1201-1203). This value may be used only when Samples per Pixel (0028,0002) has a value of 1. May be used for pixel data in a Native (uncompressed) or Encapsulated (compressed) format; see Section 8.2 in PS3.5 . When the Photometric Interpretation is Palette Color; Red, Blue, and Green Palette Color Lookup Tables shall be present.

RGB
Pixel data represent a color image described by red, green, and blue image planes. The minimum sample value for each color plane represents minimum intensity of the color. This value may be used only when Samples per Pixel (0028,0002) has a value of 3. Planar Configuration (0028,0006) may be 0 or 1. May be used for pixel data in a Native (uncompressed) or Encapsulated (compressed) format; see Section 8.2 in PS3.5 .



> The Exploratory Dashboard in W&B!
<center><video src="https://i.imgur.com/P8tdXpU.mp4" width=650 controls></center>

