#Homework 2: Drug Side Effects

*In the second homework, you are going to work with SIDER side effects dataset.*

**Submission Instructions**

---
It is important that you follow the submission instructions. 
1. Copy this assignment notebook to your Drive. <font color = 'red'> `File` --> `Save a copy in Drive`</font>. Rename it as <font color = 'green'>`Lastname_Firstname_hw1`</font>.

2. Write your solutions in the cells  marked <font color = 'green'>`# your code`</font>.

3. Do not delete your outputs. They are essential for the grading. Make sure that cells containing your solutions are executed, and the results are displayed on the notebook.

4. When you're done please submit your solutions as an <font color="red">`.ipynb`</font> file. To do so:


1.  Click on <font color="red">`File`</font>  at the top left on the Colab screen, then click on <font color = 'red'>`Download .ipynb`</font>.
2.   Then submit the downloaded <font color="red">`.ipynb`</font> version of your work on SUCourse.


For any question, you may send an email to the TAs and LAs.

---

In this homework, you will work on a dataset from [SIDER Side Effect Resource](http://sideeffects.embl.de/). SIDER contains information on marketed drugs and their recorded adverse drug reactions (ADR).

For this homework, you will use the provided file `meddra_all_se.csv`. This is a modified and simplified version of the original dataset, which contains possible side effects of different drugs.
<!-- This is not the original data file, we modified and eliminated some parts to make your work easier. -->

As listed in their prospectives, drugs can cause some side effects besides their healing properties. This dataset simply contains entries of drugs and their potential side effects.

In the dataset, each row has 4 attribute values separated with `,`. These attributes are described in the list below with their respective order in the dataset.


  1.   **STITCH compound id:** Refers to ID of a particular drug 

  2.   **UMLS concept id:** Unified Medical Language System ID

  3.   **UMLS concept id for MedDRA term** 

  4.   **Side effect name:** Contains the possible side effect entry.

**You will consider the STITCH compound id as the id of a particular drug. Therefore, rows with the same STITCH compound id refer to possible side effects of the same particular drug.**

A snapshot from the dataset containing three sample rows is also provided below.
```
...
CID100000085,CID000010917,C0015230,Rash
CID100000085,CID000010917,C0015397,Eye disorder
CID100000085,CID000010917,C0015967,Body temperature increased

...
```

*Do not forget to add the shared `meddra_all_se.csv` file to your drive and mount to your drive. Otherwise, you won't be able to read the file.*

**!!!You are not allowed to use `pandas` in this homework**

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from os.path import join

%matplotlib inline

In [None]:
from google.colab import drive
drive.mount('./drive', force_remount=True)

path_prefix = "./drive/My Drive"

Mounted at ./drive


## Q1: Descriptive Statistics of the Dataset

In this question, your task is to gather some descriptive information about the dataset. To do this, first read the provided dataset and store it in a 2 dimensional numpy array. Then, print the listed descriptive statistics about the dataset.

*   Shape of the dataset, i.e. number of rows and columns
*   Number of unique drugs 
*   Number of unique side effects

In [44]:
with open('./drive/My Drive/meddra_all_se.csv') as csvfile: # Opening and reading the file properly.
  drugArray = np.genfromtxt(csvfile, delimiter=',', dtype=str) # My multi-dimensional variable.

print("Shape of the dataset: ", drugArray.shape, sep='') # Printing the first wanted output(Number of rows and columns).
(row, column) = drugArray.shape # Since drugArray.shape is a tuple, my row and column variables are in tuple too.

liste=list() # It is empty now.
for x in range(row):
  liste.append(drugArray[x][0]) # Filling it.
print('Number of unique drugs: ', len(np.unique(liste)), sep='') # Printing the second wanted output.

sideEffectsList = list() # It is empty now.
for y in range(row):
  sideEffectsList.append(drugArray[y][3]) # Filling it.
print('Number of unique side effects: ', len(np.unique(sideEffectsList)), sep='') # Printing the third wanted output.


Shape of the dataset: (91281, 4)
Number of unique drugs: 953
Number of unique side effects: 5064


## Q2: Side Effects of Drugs

### Part A 

As explained above, side effect entries in the dataset correspond to possible side effects of different drugs.

In this part, your goal is to find the drug which has the most side effects in the dataset. <!-- Print the `drug id` *(STITCH compound id)* of that drug and `number of its indications`. --> Print the `drug id` *(STITCH compound id)* of that drug and print the number of possible side effects (indications) it has.

Print your results in the following format.

``` py
drug id: CID100002771
number of side effects: 766
```




In [50]:
indexx=0
NumberOfTimes = indexx
which_has_the_most_side_effects = ["drugid", indexx] # List consists of string and number.
altering = drugArray[indexx][indexx] # Zeroth row and column index of multidimensional drugArray.

for eachh in range(row): # Iterating over rows.
  if (drugArray[eachh][indexx]!=altering):
    if (NumberOfTimes > which_has_the_most_side_effects[indexx+1]):
      which_has_the_most_side_effects[indexx+1] = NumberOfTimes
      which_has_the_most_side_effects[indexx] = altering
      NumberOfTimes=indexx+1
      altering=drugArray[eachh][indexx]
    else:
      NumberOfTimes=indexx+1
      altering=drugArray[eachh][indexx]
  else:
    NumberOfTimes=NumberOfTimes+1

print("drug id: ", which_has_the_most_side_effects[indexx], sep='')
print("number of side effects: ", which_has_the_most_side_effects[indexx+1], sep='')

drug id: CID100002771
number of side effects: 766


### Part B 

Now, find the number of side effects for all drugs and display its distribution as an histogram.

An exemplary figure can be observed below.

![](https://i.ibb.co/7zPSnkR/hist.jpg)



In [None]:
# your code

## Q3: The Most Frequent Side Effects

Here, you are going to analyze side effects that can be more commonly caused by drugs. 

Since we have a dataset of drugs and their potential side effects, one of the first question that comes to mind is to see the most common side effects of these drugs.

*   Plot a `bar chart` to show most frequent 15 side effects of the dataset.

![](https://i.ibb.co/Y8nQr3p/barh.jpg)

*You can choose to make the bar chart with a different style, but the bars should look like the chart above.* 


In [None]:
# your code

## Q4: Conditional Probability

In Probability Theory, conditional probability is a measure of the probability of an event occurring, given that another event has already occurred. The formula for the conditional probability is shared below.

```
P(B|A) = P(A and B) / P(A)
```

As also suggested by our dataset, drugs can have multiple side effects. With the conditional probability, we can study the chances of a particular side effect with a given presence of another one.

So, the following question can be answered with our dataset. 

**If a drug has `Headache` side effect, what is the probability it also has `Vomiting` side effect?**

Please calculate this conditional probability from the dataset and report the resulting probability score.

In [None]:
# your code