I have been working in the domain of Bioinformatics for about 2 years now. I am currently enrolled in a MSc AI Specialization program at University of Windsor, Canada. As part of the same I have been reading a bunch of papers specifically in the domain of the drug, disease, gene etc. Almost within 4 months of my studies, I have decided to stick with Graph Neural Network and also I am a fan of Transformer and attention in general. I have been working on paper or atleast for my proposal but most of my ideas are not turning too promising with initial experimentations results. I have decided to take a revision together again on the concepts I have learnt and I am working on and as part of the same is my first blog post on understanding SIDER dataset. 

As of now, I am planning to work on Drug Side effect prediction - which is a multi-label multi-class problem. Drug being molecules can be natively represented as molecular graph and becomes a perfect fit to explore GNNs. Also, for the side-effect information. There are many datasets available and one among the famous one is <b>SIDER</b>.

<b>SIDER</b> stands for Side Effect Resource[1](http://sideeffects.embl.de/) and contains information on marketed medicines and their recorded adverse drug reactions. The information is extracted from public documents and package inserts/label. It also contains information about side effect frequency, drug and side effect classifications as well as links to firther information, for example drug-target relations.

Below is the statistics for the information contained in SIDER4.1 which was release in October 21, 2015.
| # of SE | # of Drugs | # of drug-SE pairs |
|--------------------|--------------------|--------------------|
|5868 | 1430 | 139756 |

The file we are really interested in using are drug_atc.tsv, meddra_all_se.tsv.gz. For the complete understanding of all the data provided please read README provided by SIDER.

<drug_atc.tsc> The file contains the drug_id and corresponding atc code and can be loaded in python using below snippet
```python
import pandas as pd
drug_atc_df = pd.read_csv('raw_data/sider/drug_atc.tsv', sep='\t', header=None).rename(columns={0:'drug_id', 1:'atc_code'})
```
Below is a sample of the data
| drug_id | atc_code |
|----------------|----------------|
| CID100000085 | A16AA01 |
| CID100000119 | L03AA03 |
| CID100000119 | N03AG03 |
| CID100000137 | L01XD04 |

Similary, let's extract and load meddra_all_se.tsv file. 
```python
drug_all_se_df = pd.read_csv('raw_data/sider/meddra_all_se.tsv', \
                            sep='\t', header=None).rename(
                                columns={0:'STITCH ID FLAT', 
                                        1:'STITCH ID STEREO',
                                        2:'UMLS CONCEPT ID',
                                        3:'MEDDRA CONCEPT TYPE',
                                        4:'MEDRA TERM UMLS CONCEPT ID',
                                        5:'SIDE EFFECT NAME'
                                        })
```
Below is a sample of the data
| STITCH ID FLAT	| STITCH ID STEREO| 	UMLS CONCEPT ID| 	MEDDRA CONCEPT TYPE| 	MEDRA TERM UMLS CONCEPT ID| 	SIDE EFFECT NAME| 
|-------------------|-------------------|-------------------|-------------------|-------------------|-------------------|
| 	CID100000085	| CID000010917| 	C0000729| 	LLT	| C0000729| 	Abdominal cramps| 
| 	CID100000085	| CID000010917| 	C0000729| 	PT	| C0000737| 	Abdominal pain| 
| 	CID100000085	| CID000010917| 	C0000737| 	LLT	| C0000737| 	Abdominal pain| 

Lets look at the header and introduce these terms: <br><br>
<b>STITCH ID FLAT: </b> Represent compounds without considering their stereochemistry such as the arrangement of atoms in 3-D space is ignore. Suitable for general-purpose analyses and interactions that do not depend on stereochemistry.<br><br>
<b>STITCH ID STEREO: </b> Represent compounds while preserving their sterochemical information - including stereochemistry of chiral centers.Chiral centers are specific atoms within a molecule that have four different substituents attached to them. These chiral centers give rise to stereoisomers, which are molecules that have the same connectivity of atoms but differ in their spatial arrangement.<br><br>
<b>UMLS Concept ID: </b> Refers to uniques identifier assigned to a specific concept within the Unified Medical language System(UMLS) Metathesaurus. UMLS is a comprehensive biomedical and healthcare terminology system developed by the National Library of Medicine(NLM) in the United States. The Metathesaurus integrates various controlled vocabularies, classifications, and ontologies from multiple sources, allowing for the mapping and interoperability of different tterminologies used in medical field.<br><br>
<b>Meddra Concept Type: </b> (LLT = lowest level term, PT = preferred term; in a few cases the term is neither LLT nor PT). All side effects found on the labels are given as LLT. Additionally, the PT is shown. There is at least one PT for every LLT, but sometimes the PT is the same as the LLT. LLTs are sometimes too detailed, and therefore you might want to filter for PT.<br><br>
<b> Meddra Term UMLS Concept ID: </b> UMLS concept id for MedDRA term<br><br>
<b> Side Effect Name: </b> Name for the associated side-effect name.<br><br>


For our work, we are interested to use STITCH ID STEREO and PT terms. We are deliberately leaving out LLT terms as they are sometimes too detailed. As next step, we will filter out <b>meddra_all_se_df</b> and keep rows with Meddra concept type PT only. Next we will remove all medra term umls concept id where the frequency is less than 5. 
```python
# keep only PT terms with min freq 5
print(f'Original All_Drug_SE Association {drug_all_se_df.shape}')
drug_all_se_study_df = drug_all_se_df[drug_all_se_df['MEDDRA CONCEPT TYPE'] == 'PT']
print(f'Only PT All_Drug_SE Association {drug_all_se_study_df.shape}')
print(f'Unique SE PT {drug_all_se_study_df["MEDRA TERM UMLS CONCEPT ID"].nunique()}')
# drug_all_se_study_df.head()
# keep only meddra concept type with freq > 5
concept_counts = drug_all_se_study_df['MEDRA TERM UMLS CONCEPT ID'].value_counts()
# Filter the DataFrame to keep only concept types with frequency > 5
filtered_df = drug_all_se_study_df[drug_all_se_study_df['MEDRA TERM UMLS CONCEPT ID'].isin(concept_counts[concept_counts >= 5].index)]
print(f'PT with freq >5 {filtered_df.shape}')
print(filtered_df.head())
print(f'Unique PT with freq >5 {filtered_df["MEDRA TERM UMLS CONCEPT ID"].nunique()}')
```

Through these procedures, we went from having almost 309,000 associations to about 159,000 associations, with a significant reduction of side–effect multiplicity (from the original 4,251 to 2,061 side–effects with a number of occurrences equal to or greater than 5). Finally, we will groupby the entries using STITCH ID STEREO and list all unique PT terms associated with each drug entry.


```python
# group by medra term umls concept id and list all side effects name/medra term umls concept id
drug_se_list_df = drug_all_se_study_df.groupby('STITCH ID STEREO')['MEDRA TERM UMLS CONCEPT ID'].apply(set).apply(list).reset_index().head()
```
Below is a sample of the drug_se_list_df

|STITCH ID STEREO	|MEDRA TERM UMLS CONCEPT ID|
|-------------------|--------------------------|
|	CID000000119	|[C0030193, C0042109, C0002792, C0002994, C0151,..]|
|	CID000000137	|[C0041834, C0340726, C0004936, C0702166, C0040,..]|
|	CID000000143	|[C0002792, C0041834, C0340865, C0746883, C0009,..]|

The final dataset we obtained, drug_se_list_df, has now the STITCH ID and the corresponding list of UMLS concept id for all associated and filtered meddra term. Next, we will also filter out by removing all drugs that were associated to either less than 5 or more than 400 side-effects. This is done as most drugs have few DSEs, while only few drugs are associated to a large number of DSEs, which could lead to imbalance class distribution which can lead to a bias in the model.
We can plot and see the distribution of # of side_effect vs # num of drugs with corresponding # of side_effects using below code
``` python
# histogram of se_count and label the x and y axis
plt.figure(figsize=(10, 6))
sns.histplot(data=drug_se_list_df, x='se_count', bins=100)
plt.xlabel('Number of Side Effects')
plt.ylabel('Number of Drugs')
```
![Distribution of # of side-effects vs # of drugs](imgs/output.png)

Next, to perform the filtering to keep only drugs with # of se less than 5 or more than 400.
``` python
print(drug_se_list_df.shape)
drug_se_list_df_final = drug_se_list_df[(drug_se_list_df['se_count'] >= 5) & (drug_se_list_df['se_count'] < 400)]
print(drug_se_list_df_final.shape)
```