In [1]:
#| default_exp app_v1

### Parkinsons Disease Progression Predictions

Competition [Link](https://www.kaggle.com/competitions/amp-parkinsons-disease-progression-prediction/overview)

Submissions are evaluated on SMAPE between forecasts and actual values. We define SMAPE = 0 when the actual and predicted values are both 0.

For each patient visit where a protein/peptide sample was taken you will need to estimate both their UPDRS scores for that visit and predict their scores for any potential visits 6, 12, and 24 months later. Predictions for any visits that didn't ultimately take place are ignored.

You must submit to this competition using the provided python time-series API, which ensures that models do not peek forward in time. To use the API, follow this template in Kaggle Notebooks:

```Python
import amp_pd_peptide
env = amp_pd_peptide.make_env()   # initialize the environment
iter_test = env.iter_test()    # an iterator which loops over the test files
for (test, test_peptides, test_proteins, sample_submission) in iter_test:
    sample_prediction_df['rating'] = np.arange(len(sample_prediction))  # make your predictions here
    env.predict(sample_prediction_df)   # register your predictions

Dataset Description

The goal of this competition is to predict the course of Parkinson's disease (PD) using protein abundance data. The complete set of proteins involved in PD remains an open research question and any proteins that have predictive value are likely worth investigating further. The core of the dataset consists of protein abundance values derived from mass spectrometry readings of cerebrospinal fluid (CSF) samples gathered from several hundred patients. Each patient contributed several samples over the course of multiple years while they also took assessments of PD severity.

This is a time-series code competition: you will receive test set data and make predictions with Kaggle's time-series API.

Files
-----

**train\_peptides.csv** Mass spectrometry data at the peptide level. Peptides are the component subunits of proteins.

*   `visit_id` - ID code for the visit.
*   `visit_month` - The month of the visit, relative to the first visit by the patient.
*   `patient_id` - An ID code for the patient.
*   `UniProt` - [The UniProt ID code](https://www.uniprot.org/id-mapping) for the associated protein. There are often several peptides per protein.
*   `Peptide` - The sequence of amino acids included in the peptide. See [this table](https://en.wikipedia.org/wiki/Amino_acid#Physicochemical_properties_of_amino_acids) for the relevant codes. Some rare annotations may not be included in the table. The test set may include peptides not found in the train set.
*   `PeptideAbundance` - The frequency of the amino acid in the sample.

**train\_proteins.csv** Protein expression frequencies aggregated from the peptide level data.

*   `visit_id` - ID code for the visit.
*   `visit_month` - The month of the visit, relative to the first visit by the patient.
*   `patient_id` - An ID code for the patient.
*   `UniProt` - [The UniProt ID code](https://www.uniprot.org/id-mapping) for the associated protein. There are often several peptides per protein. The test set may include proteins not found in the train set.
*   `NPX` - Normalized protein expression. The frequency of the protein's occurrence in the sample. May not have a 1:1 relationship with the component peptides as some proteins contain repeated copies of a given peptide.

**train\_clinical\_data.csv**

*   `visit_id` - ID code for the visit.
*   `visit_month` - The month of the visit, relative to the first visit by the patient.
*   `patient_id` - An ID code for the patient.
*   `updrs_[1-4]` - The patient's score for part N of the [Unified Parkinson's Disease Rating Scale](https://www.movementdisorders.org/MDS/MDS-Rating-Scales/MDS-Unified-Parkinsons-Disease-Rating-Scale-MDS-UPDRS.htm). Higher numbers indicate more severe symptoms. Each sub-section covers a distinct category of symptoms, such as mood and behavior for Part 1 and motor functions for Part 3.
*   `upd23b_clinical_state_on_medication` - Whether or not the patient was taking medication such as Levodopa during the UPDRS assessment. Expected to mainly affect the scores for Part 3 (motor function). These medications wear off fairly quickly (on the order of one day) so it's common for patients to take the motor function exam twice in a single month, both with and without medication.

**supplemental\_clinical\_data.csv** Clinical records without any associated CSF samples. This data is intended to provide additional context about the typical progression of Parkinsons. Uses the same columns as **train\_clinical\_data.csv**.

**example\_test\_files/** Data intended to illustrate how the API functions. Includes the same columns delivered by the API (ie no updrs columns).

**amp\_pd\_peptide/** Files that enable the API. Expect the API to deliver all of the data (less than 1,000 additional patients) in under five minutes and to reserve less than 0.5 GB of memory. A brief demonstration of what the API delivers [is available here](https://www.kaggle.com/code/sohier/basic-api-demo).

**public\_timeseries\_testing\_util.py** An optional file intended to make it easier to run custom offline API tests. See the script's docstring for details.

#### Imports

In [2]:
#| export
from fastai.tabular.all import *

import seaborn as sns

pd.options.display.max_rows = 20
pd.options.display.max_columns = 8

In [3]:
#| export
try: import fastkaggle
except ModuleNotFoundError:
    !pip install -Uq fastkaggle

from fastkaggle import *

#### Downloading Datasets

In [4]:
#| export
comp = 'amp-parkinsons-disease-progression-prediction'
path = setup_comp(comp, install='fastai')

#### Create Dataframes

Training Data

In [5]:
#| export
protein_train_df = pd.read_csv(path/"train_proteins.csv", low_memory=False)
clinical_train_df = pd.read_csv(path/"train_clinical_data.csv", low_memory=False)
peptide_train_df = pd.read_csv(path/"train_peptides.csv", low_memory=False)
supplement_train_df = pd.read_csv(path/"supplemental_clinical_data.csv", low_memory=False)

Test Data

In [6]:
#| export
protein_test_df = pd.read_csv(path/"example_test_files/test_proteins.csv", low_memory=False)
peptide_test_df = pd.read_csv(path/"example_test_files/test_peptides.csv", low_memory=False)
updrs_test_df = pd.read_csv(path/"example_test_files/test.csv", low_memory=False)

#### TEMP - Replace NA with median

In [7]:
clinical_train_df.dtypes

visit_id                                object
patient_id                               int64
visit_month                              int64
updrs_1                                float64
updrs_2                                float64
updrs_3                                float64
updrs_4                                float64
upd23b_clinical_state_on_medication     object
dtype: object

In [8]:
def get_df_name(df):
    name =[x for x in globals() if globals()[x] is df][0]
    return name

In [9]:
def contains_nan(df):
    return df.isnull().values.any()

In [10]:
def replace_nan(df, col):
    try:
        df[col] = df[col].fillna(df[col].median())
    except:
        
        print(f"Manually Fill: {get_df_name(df)}, {col}")
    return df

In [11]:
train_dfs = [protein_train_df, clinical_train_df, peptide_train_df, supplement_train_df]
for x in train_dfs:
    if contains_nan(x):
        for y in x.columns:
            replace_nan(x, y)

Manually Fill: clinical_train_df, upd23b_clinical_state_on_medication
Manually Fill: supplement_train_df, upd23b_clinical_state_on_medication


In [12]:
clinical_train_df.upd23b_clinical_state_on_medication.value_counts()

On     775
Off    513
Name: upd23b_clinical_state_on_medication, dtype: int64

In [13]:
clinical_train_df['upd23b_clinical_state_on_medication'] = clinical_train_df['upd23b_clinical_state_on_medication'].replace('No', 0).astype(bool)
clinical_train_df['upd23b_clinical_state_on_medication'] = clinical_train_df['upd23b_clinical_state_on_medication'].replace('Yes', 1).astype(bool)
supplement_train_df['upd23b_clinical_state_on_medication'] = supplement_train_df['upd23b_clinical_state_on_medication'].replace('No', 0).astype(bool)
supplement_train_df['upd23b_clinical_state_on_medication'] = supplement_train_df['upd23b_clinical_state_on_medication'].replace('Yes', 1).astype(bool)

In [14]:
supplement_train_df.head()

Unnamed: 0,visit_id,patient_id,visit_month,updrs_1,updrs_2,updrs_3,updrs_4,upd23b_clinical_state_on_medication
0,35_0,35,0,5.0,3.0,16.0,0.0,True
1,35_36,35,36,6.0,4.0,20.0,0.0,True
2,75_0,75,0,4.0,6.0,26.0,0.0,True
3,75_36,75,36,1.0,8.0,38.0,0.0,True
4,155_0,155,0,5.0,5.0,0.0,0.0,True


#### Transform Dataframes

Merge Dataframes on common values

In [15]:
#| export
df_train = peptide_train_df.merge(protein_train_df, on=['patient_id', 'visit_id', 'visit_month', 'UniProt'], how='left')
df_train = df_train.merge(clinical_train_df, on=['patient_id', 'visit_id', 'visit_month'], how='left')


In [16]:
len(peptide_train_df) + len(protein_train_df) + len(clinical_train_df)

1217190

In [17]:
#| export
df_train = pd.concat([protein_train_df, clinical_train_df, peptide_train_df], axis=1)

In [18]:
df_train.head()

Unnamed: 0,visit_id,visit_month,patient_id,UniProt,...,patient_id.1,UniProt.1,Peptide,PeptideAbundance
0,55_0,0.0,55.0,O00391,...,55,O00391,NEQEQPLGQWHLS,11254.3
1,55_0,0.0,55.0,O00533,...,55,O00533,GNPEPTFSWTK,102060.0
2,55_0,0.0,55.0,O00584,...,55,O00533,IEIPSSVQQVPTIIK,174185.0
3,55_0,0.0,55.0,O14498,...,55,O00533,KPQSAVYSTGSNGILLC(UniMod_4)EAEGEPQPTIK,27278.9
4,55_0,0.0,55.0,O14773,...,55,O00533,SMEQNGPGLEYR,30838.7


In [19]:
df_train.shape

(981834, 19)

In [20]:
df_test = peptide_test_df.merge(protein_test_df, on=['patient_id', 'visit_id', 'visit_month'], how='left')

In [21]:
df_test = df_test.merge(updrs_test_df, on=['patient_id', 'visit_id', 'visit_month'], how='left')

In [22]:
df_test

Unnamed: 0,visit_id,visit_month,patient_id,UniProt_x,...,group_key_y,updrs_test,row_id,group_key
0,50423_0,0,50423,O00391,...,0,updrs_1,50423_0_updrs_1,0
1,50423_0,0,50423,O00391,...,0,updrs_2,50423_0_updrs_2,0
2,50423_0,0,50423,O00391,...,0,updrs_3,50423_0_updrs_3,0
3,50423_0,0,50423,O00391,...,0,updrs_4,50423_0_updrs_4,0
4,50423_0,0,50423,O00391,...,0,updrs_1,50423_0_updrs_1,0
...,...,...,...,...,...,...,...,...,...
1863607,3342_6,6,3342,Q9Y6R7,...,6,updrs_4,3342_6_updrs_4,6
1863608,3342_6,6,3342,Q9Y6R7,...,6,updrs_1,3342_6_updrs_1,6
1863609,3342_6,6,3342,Q9Y6R7,...,6,updrs_2,3342_6_updrs_2,6
1863610,3342_6,6,3342,Q9Y6R7,...,6,updrs_3,3342_6_updrs_3,6


In [23]:
df_test.columns

Index(['visit_id', 'visit_month', 'patient_id', 'UniProt_x', 'Peptide',
       'PeptideAbundance', 'group_key_x', 'UniProt_y', 'NPX', 'group_key_y',
       'updrs_test', 'row_id', 'group_key'],
      dtype='object')

In [24]:
df_test[-1:]

Unnamed: 0,visit_id,visit_month,patient_id,UniProt_x,...,group_key_y,updrs_test,row_id,group_key
1863611,3342_6,6,3342,Q9Y6R7,...,6,updrs_4,3342_6_updrs_4,6


In [25]:
df_test.shape

(1863612, 13)

Because processes later will not fill the NaN values for our dependent variable, filling with median to get baseline model running

In [26]:
#| export
median_targs = df_train[['updrs_1', 'updrs_2', 'updrs_3', 'updrs_4']].median()

In [27]:
#| export
df_train[['updrs_1', 'updrs_2', 'updrs_3', 'updrs_4']] = df_train[['updrs_1', 'updrs_2', 'updrs_3', 'updrs_4']].fillna(median_targs)

Specify Targets

In [28]:
#| export
dep_var = ['updrs_1', 'updrs_2', 'updrs_3', 'updrs_4']

Add processes to clean and normalize data, split continuous and categorical variables, and create random training and validation sets with an 80% train, 20% validation split

In [34]:
df_train.dtypes

visit_id                                object
visit_month                            float64
patient_id                             float64
UniProt                                 object
NPX                                    float64
visit_id                                object
patient_id                             float64
visit_month                            float64
updrs_1                                float64
updrs_2                                float64
updrs_3                                float64
updrs_4                                float64
upd23b_clinical_state_on_medication     object
visit_id                                object
visit_month                              int64
patient_id                               int64
UniProt                                 object
Peptide                                 object
PeptideAbundance                       float64
dtype: object

In [35]:
#| export
procs = [Categorify, FillMissing, Normalize]


In [36]:
cont, cat = cont_cat_split(df_train, dep_var=dep_var, max_card=1)


AttributeError: 'DataFrame' object has no attribute 'dtype'

In [None]:
splits = RandomSplitter(valid_pct=0.2)(range_of(df_train))

Create `TabularPandas` object to prepare DataFrames to be passed to `Dataloaders`

In [30]:
#| export
to = TabularPandas(df_train, procs, cat, cont, y_names=dep_var, splits=splits)

NameError: name 'cat' is not defined

This contest is scored on a SMAPE metric, so creating a class to calulate SMAPE score for multi-target models

In [None]:
#| export
class MultiTargetSMAPE(Metric):
    def __init__(self):
        super().__init__()
    
    def reset(self):
        self.total = 0.
        self.count = 0
        
    def accumulate(self, learn):
        pred,targ = learn.pred, learn.y
        denom = torch.abs(pred) + torch.abs(targ)
        non_zero_denom = denom != 0
        num = torch.abs(pred - targ)
        smape = torch.zeros_like(num)
        smape[non_zero_denom] = num[non_zero_denom] / denom[non_zero_denom]
        self.total += smape.sum(dim=0)
        self.count += learn.y.size(0)
    
    @property
    def value(self):
        return (self.total / self.count).mean().item() * 100  # SMAPE in percentage
    
    @property
    def name(self):
        return 'multi_target_smape'


Load data to dataloaders

In [None]:
#| export
dls = to.dataloaders(bs=256)

Create `learner` model

In [None]:
#| export
learn = tabular_learner(dls, layers=[200,100], metrics=[MultiTargetSMAPE()], n_out=4, y_range=(0, 80), loss_func=mse)

Find optimal learning rate

In [None]:
learn.lr_find()

Train Model

In [None]:
#| export
learn.fit_one_cycle(10, 1e-3)

It's continuing to improve with each epoch so far, but will need to check against test sets for overfitting to verify that these are actual improvements

In [None]:
#| export
xs, ys = to.train.xs, to.train.ys
valid_xs, valid_ys = to.valid.xs, to.valid.ys

In [None]:
updrs_test_df.head()

In [None]:
test_dl = learn.dls.test_dl(updrs_test_df)

In [None]:
import nbdev
nbdev.export.nb_export('pb_parkinsons_prog.ipynb', 'app_v1')
print("export successful")