# Unique Phenotypes

You have been given a file that consists of phenotypes that have been assigned to individual cardiology patients. The data was supposed to have been entered as a table with two columns per patient. The first column was to be a single, primary phenotype for a patient. The second column was to be a comma separated list of secondary phenotypes for that patient. Unfortunately, data entry did not follow this prescription and was inconsistent. Sometimes multiple phenotypes are provided in column one, sometimes a different delimiter than a comma is used.

Your task is to provide an alphabetically sorted list of unique phenotypes encountered in the data. Missing values are denoted by an empty string and should **not** be treated as a phenotype.

As a side note, the researcher responsible for this data would also like to know what delimiters were used in data entry.

I have used Pandas to read the data in and convert it to a list for you to start with.

In [1]:
import pandas as pd

In [13]:
data = pd.read_excel("./create_uniq_phenotypes.xlsx", header=None)
print(data.columns)
data.head(20)

Int64Index([0, 1], dtype='int64')


Unnamed: 0,0,1
0,"AS, BAV",Cardiology
1,Capillary malformation,Lymph F#69-01
2,Congenital Heart Defect,
3,Congenital Heart Defect,TAPVR
4,CHD,
5,CHD,
6,CHD,
7,CHD,
8,CHD,
9,CHD,


In [14]:
data = [", ".join(d) for d in data.fillna("").values.tolist()]

In [15]:
data

['AS, BAV, Cardiology',
 'Capillary malformation, Lymph F#69-01',
 'Congenital Heart Defect, ',
 'Congenital Heart Defect, TAPVR',
 'CHD, ',
 'CHD, ',
 'CHD, ',
 'CHD, ',
 'CHD, ',
 'CHD, ',
 'CHD, ',
 'CHD, ',
 'CHD & Cardiology, AVC (balanced) subvalvar aortic fibrous ring',
 'CHD/Cardiology, ',
 'CHD, ',
 'CHD, ',
 'Congenital Heart Defect & Cardiology, D-TGA, VSD muscular',
 'Cardiology & Congenital Heart Defect, L-TGA, PS, LV-PA',
 'Congenital Heart Defect & Cardiology, Tricupsid Atresia, D-TGA, Hypoplastic p-ventricle & aorta w/VSD',
 'Cardiology, ',
 'Congenital Heart Disease & Cardiology, HLHS',
 'Cardiology, PA/IVS',
 'Cardiology, ',
 'Cardiology, ',
 'Cardiology, ',
 'Cardiology, ',
 'Cardiology, ',
 'Cardiology, ',
 'Cardiology, ',
 'Cardiology, ',
 'Cardiology, ',
 'Cardiology, ',
 'Cardiology, Kawisaki',
 'Cardiology, Hypoplastic left Heart Syndrome',
 'Cardiology, ',
 'Cardiology, ',
 'Cardiology, Tetraology of fallot/PulmonaryArtresea',
 'Cardiology, TOF',
 'Cardiology, 

In [32]:
#alphabetical sorted list of unique phenotypes

#first let's get all the phenotypes out
#data is a list of strings and we want to split it by commas
phenotypes = []
for string in data:
    phenotypes.extend(string.split(",")) #put each new string into a list   
#print(phenotypes)

In [44]:
#preprocess data
#strip removes leading and trailing whitespaces
#left of the for loop is what we want to keep -- no whitespaces, put all in uppercase for later comparison in the set and sorting
#right of the for loop we have a boolean condition: only keep if true (empty string is false)
#could also use the remove method of a set
phenotypes = [p.strip().upper() for p in phenotypes if p]

In [45]:
#now, to find out which ones are unique use sets
phenotypes_set = set(phenotypes)

In [47]:
'' in phenotypes_set

False

In [55]:
help(plist.sort)

Help on built-in function sort:

sort(...) method of builtins.list instance
    L.sort(key=None, reverse=False) -> None -- stable sort *IN PLACE*



In [60]:
#key passes in a function to sort by
plist.sort(key = len, reverse=True)

#anonymous functions are defined using the lambda keyword
plist.sort(key = lambda x: x[-1], reverse=True)

In [61]:
plist

['DCM POSSIBLY 2TO DAUNORUBICIN THERAPY',
 'CONGENITAL HEART DISEASE & CARDIOLOGY',
 'CONGENITAL HEART DEFECT & CARDIOLOGY',
 'NONCOMPACTION LV CARDIOMYOPATHY',
 "AV CANAL WITH EBSTEIN'S ANOMALY",
 'DILATED CARDIOMYOPATHY',
 'INNOMINATE ARTERY',
 "EBSTEIN'S ANOMALY",
 'EBSTEINS ANOMALY',
 'CHD & CARDIOLOGY',
 'KD W/O CORONARY',
 'EBSTEIN ANAMDY',
 'CHD/CARDIOLOGY',
 'KD W/CORONARY',
 'HYPERTROPHY',
 'LATERALITY',
 'HETEROTAXY',
 'CARDIOLOGY',
 'CARDIOLGOY',
 'X',
 'HYPOPLASTIC L AW',
 'WPW',
 'VSD (PERIMEBRANOUS) LSVC BAV',
 'SEVERLY HYPOPLASTIC RV',
 'HYPOPLASTIC RV',
 'CLEFT LEFT AV',
 'PROBABLE BAV',
 'HYPLAS LV',
 'CLEFT MV',
 'L AV',
 'HRRV',
 'HPLV',
 'DORV',
 'DIRV',
 'DILV',
 'TGV',
 'BAV',
 'POSSIBLE VSD NOTED IN CLINIC NOTES BUT NOT ON ECHO REPORT',
 'CARDIOLOGY & CONGENITAL HEART DEFECT',
 'KD W/O CORONARY INVOLVEMENT',
 'COMPLETE AV SEPTAL DEFECT',
 "SHONE'S COMPLEX VARIANT",
 'PARTIAL AC CANAL DEFECT',
 'CONGENITAL HEART DEFECT',
 'PS (UNKNOWN TYPE) SVT',
 'TETROLOGY OF FA

In [62]:
#go back to a list so we can work with it
plist = list(phenotypes_set)
#sort it
plist.sort() #can also put reverse = True to sort the other way
#print(plist) 

In [63]:
help(lambda)

SyntaxError: invalid syntax (<ipython-input-63-9b4a285515f7>, line 1)