# Description
#### author Houcmeddine Othman, September 2019, [My Github page](https://github.com/hothman) 

The codes in this notebook are part of data cleaning workflow for the manually curated data (raw file is xlsx format).
Some of the tasks include: 
* Discard some contaminating data. 
* Normalizing some labels for the classes description. 

### requirements:
* Python 3
* Pandas, numpy

---
## Credits
### Without the contribution of these people, I would end up with ugly empty output files and big ERROR message. 

#### Manual Curation (in alphabetical order) 
Ayoub Ksouri, Chaimae SAMTAL, Chiamaka Jessica Okeke, Fouzia Radouani, Haifa Jmal, Kais Ghedira, Lyndon Zass, Melek Chaouch, Olivier, Reem Sallam, Rym Kefi, Samah Ahmed, Samar kamel Kassem, Yosr Hamdi.

#### Data mining and wrangling
Jorge da Rocha and Lyndon Zass

## Cleansing of the curation document

In [194]:
import pandas as pd 
import numpy as np

# reading excel file requires xlrd ( conda install -c anaconda xlrd )
data = pd.read_excel("../data/curation_22Aug.xlsx")

# replace URL links and renaming columns 
data.replace(to_replace="https://www.pharmgkb.org/variantAnnotation/", value='', regex=True, inplace=True)
data.replace(to_replace="https://www.ncbi.nlm.nih.gov/pubmed/", value='', regex=True, inplace=True)
data.rename(columns={'link_to_variant': 'id_in_source', 'pubmed_link ': 'reference_id' ,
                     'Country of Participants':'Country_of_Participants'}, inplace=True)

Let is check the categories in PharmGKB Category.

In [195]:
data.groupby(["PharmGKB Category"]).count()

Unnamed: 0_level_0,id_in_source,reference_id,P-value,Country_of_Participants,Volunteer 1,Volunteer 2,IF No p value - not BLANK use @,Remarks
PharmGKB Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
African American/Afro-Caribbean,224,224,224,223,224,224,3,0
East Asian,2,2,2,2,2,0,0,0
European,7,7,7,7,7,4,1,0
Mixed Population,24,24,24,23,24,12,15,0
Near Eastern,54,54,54,54,54,54,0,0
Sub-Saharan Africa,1,1,1,1,1,1,0,0
Sub-Saharan African,166,166,166,166,166,166,1,0


Some changings to be done: 
* Study from reference PMID25393304 used egyptian subjects as control group. 
* Study from PMID29580174 contains only finnish subjects
* remove East Asian
* use only one notation for Sub-Saharan African
    
Let's now check the categories in country of participents 

In [196]:
data

Unnamed: 0,id_in_source,reference_id,P-value,PharmGKB Category,Country_of_Participants,Volunteer 1,Volunteer 2,IF No p value - not BLANK use @,Remarks
0,608431789,20072124,< 0.001,African American/Afro-Caribbean,USA,Jorge,Chaimae,IF multiple p values - ERROR,
1,608431793,20072124,< 0.001,African American/Afro-Caribbean,USA,Samar/done,Kais,,
2,608431781,20072124,0.023,African American/Afro-Caribbean,USA,Samar/done,Kais,,
3,608431785,20072124,< 0.001,African American/Afro-Caribbean,USA,Samar/done,Kais,,
4,637879876,20200517,0.03,African American/Afro-Caribbean,USA,Samar/done,Kais,,
...,...,...,...,...,...,...,...,...,...
474,1449270311,29580174,ERROR,European,Finland,Kais,Olivier,,
475,1449270340,29580174,0.016,European,Finland,Kais,Olivier,,
476,1450367980,30672385,0.016,Sub-Saharan African,Egypt,Kais,Olivier,,
477,1450377018,30767719,0.028,Sub-Saharan African,NIgeria,Kais,Olivier,,


In [197]:
data.replace("Sub-Saharan Africa", "Sub-Saharan African", inplace=True)


# remove a studyfrom PMID25393304
var_to_remove = []
for varid in list(data[data["reference_id"] == "25393304"].id_in_source) : 
    var_to_remove.append(varid)

# study from finland will be removed
for varid in list(data[data["reference_id"] == "29580174"].id_in_source) : 
    var_to_remove.append(varid)

# Remove Finland, Germany & Egypt (study PMID25393304) Oman, Israel, East Asian
index_to_remove = data[data['Country_of_Participants'].isin(['Germany & Egypt', 'Oman', 'Israel', 'Finland'])].index
data.drop(index_to_remove, inplace = True)

# remove East Asians from the table 
index_to_remove = data[data['PharmGKB Category'] == 'East Asian'].index
data.drop(index_to_remove, inplace = True)

# Change the tag for mixed population
data["PharmGKB Category"].replace({"Mixed Population":"Mixed Population containing african descendant groups"}, inplace = True)

index_to_remove = data[data["PharmGKB Category"] == "European"].index
data.drop(index_to_remove, inplace = True)


### correct entries from near eastern countries: 
* Assign "north africa" to Tunisia, Egypt, Morocco

In [198]:
cp = data[data["PharmGKB Category"].str.contains("Near Eastern", na=False)] 
data["PharmGKB Category"].replace({"Near Eastern":"North African"}, inplace = True)

### Replacing "@" and "ERROR"

In [199]:
data["P-value"].replace({"@":"", "ERROR":"ambiguous"}, inplace = True)

### Rename columns and output to csv


In [200]:
data.rename(columns={'PharmGKB Category': 'region'}, inplace=True)
data.drop(columns=["Remarks"], inplace=True)

data.to_csv("../data/clean_curation.csv", index=False)