In [None]:
#| default_exp app_v1

### Parkinsons Disease Progression Predictions

Competition [Link](https://www.kaggle.com/competitions/amp-parkinsons-disease-progression-prediction/overview)

Submissions are evaluated on SMAPE between forecasts and actual values. We define SMAPE = 0 when the actual and predicted values are both 0.

For each patient visit where a protein/peptide sample was taken you will need to estimate both their UPDRS scores for that visit and predict their scores for any potential visits 6, 12, and 24 months later. Predictions for any visits that didn't ultimately take place are ignored.

You must submit to this competition using the provided python time-series API, which ensures that models do not peek forward in time. To use the API, follow this template in Kaggle Notebooks:

```Python
import amp_pd_peptide
env = amp_pd_peptide.make_env()   # initialize the environment
iter_test = env.iter_test()    # an iterator which loops over the test files
for (test, test_peptides, test_proteins, sample_submission) in iter_test:
    sample_prediction_df['rating'] = np.arange(len(sample_prediction))  # make your predictions here
    env.predict(sample_prediction_df)   # register your predictions

Dataset Description

The goal of this competition is to predict the course of Parkinson's disease (PD) using protein abundance data. The complete set of proteins involved in PD remains an open research question and any proteins that have predictive value are likely worth investigating further. The core of the dataset consists of protein abundance values derived from mass spectrometry readings of cerebrospinal fluid (CSF) samples gathered from several hundred patients. Each patient contributed several samples over the course of multiple years while they also took assessments of PD severity.

This is a time-series code competition: you will receive test set data and make predictions with Kaggle's time-series API.

Files
-----

**train\_peptides.csv** Mass spectrometry data at the peptide level. Peptides are the component subunits of proteins.

*   `visit_id` - ID code for the visit.
*   `visit_month` - The month of the visit, relative to the first visit by the patient.
*   `patient_id` - An ID code for the patient.
*   `UniProt` - [The UniProt ID code](https://www.uniprot.org/id-mapping) for the associated protein. There are often several peptides per protein.
*   `Peptide` - The sequence of amino acids included in the peptide. See [this table](https://en.wikipedia.org/wiki/Amino_acid#Physicochemical_properties_of_amino_acids) for the relevant codes. Some rare annotations may not be included in the table. The test set may include peptides not found in the train set.
*   `PeptideAbundance` - The frequency of the amino acid in the sample.

**train\_proteins.csv** Protein expression frequencies aggregated from the peptide level data.

*   `visit_id` - ID code for the visit.
*   `visit_month` - The month of the visit, relative to the first visit by the patient.
*   `patient_id` - An ID code for the patient.
*   `UniProt` - [The UniProt ID code](https://www.uniprot.org/id-mapping) for the associated protein. There are often several peptides per protein. The test set may include proteins not found in the train set.
*   `NPX` - Normalized protein expression. The frequency of the protein's occurrence in the sample. May not have a 1:1 relationship with the component peptides as some proteins contain repeated copies of a given peptide.

**train\_clinical\_data.csv**

*   `visit_id` - ID code for the visit.
*   `visit_month` - The month of the visit, relative to the first visit by the patient.
*   `patient_id` - An ID code for the patient.
*   `updrs_[1-4]` - The patient's score for part N of the [Unified Parkinson's Disease Rating Scale](https://www.movementdisorders.org/MDS/MDS-Rating-Scales/MDS-Unified-Parkinsons-Disease-Rating-Scale-MDS-UPDRS.htm). Higher numbers indicate more severe symptoms. Each sub-section covers a distinct category of symptoms, such as mood and behavior for Part 1 and motor functions for Part 3.
*   `upd23b_clinical_state_on_medication` - Whether or not the patient was taking medication such as Levodopa during the UPDRS assessment. Expected to mainly affect the scores for Part 3 (motor function). These medications wear off fairly quickly (on the order of one day) so it's common for patients to take the motor function exam twice in a single month, both with and without medication.

**supplemental\_clinical\_data.csv** Clinical records without any associated CSF samples. This data is intended to provide additional context about the typical progression of Parkinsons. Uses the same columns as **train\_clinical\_data.csv**.

**example\_test\_files/** Data intended to illustrate how the API functions. Includes the same columns delivered by the API (ie no updrs columns).

**amp\_pd\_peptide/** Files that enable the API. Expect the API to deliver all of the data (less than 1,000 additional patients) in under five minutes and to reserve less than 0.5 GB of memory. A brief demonstration of what the API delivers [is available here](https://www.kaggle.com/code/sohier/basic-api-demo).

**public\_timeseries\_testing\_util.py** An optional file intended to make it easier to run custom offline API tests. See the script's docstring for details.

#### Imports

In [1]:
#| export
from fastai.tabular.all import *

pd.options.display.max_rows = 20
pd.options.display.max_columns = 8

In [2]:
#| export
try: import fastkaggle
except ModuleNotFoundError:
    !pip install -Uq fastkaggle

from fastkaggle import *

#### Downloading Datasets

In [3]:
#| export
comp = 'amp-parkinsons-disease-progression-prediction'
path = setup_comp(comp, install='fastai')

Downloading amp-parkinsons-disease-progression-prediction.zip to /home/petewin/kaggle_comps/parkinsons-progression


100%|██████████| 16.1M/16.1M [00:08<00:00, 1.97MB/s]





In [7]:
#| export
df_train_proteins = pd.read_csv(path/"train_proteins.csv")

In [None]:
import nbdev
nbdev.export.nb_export('pb_parkinsons_prog.ipynb', 'app_v1')
print("export successful")