# PubChemPy for the Bioinformatics Club
This notebooks is designed to introduced you to PubChemPy, a library for working with [PubChem](https://www.example.com) resource. To use pubchempy, you'll need to either use the command

```pip install pubchempy```

on your command line or use the command

```!pip install pubchempy```

in the first coding cell in this notebook.

In [None]:
!pip install pubchempy

Once you have installed pubchempy on your computer, you'll need to import it to use it. The standard abbreviation for pubchempy is pcp.

In [1]:
import pubchempy as pcp

Now let's play with is a bit. We're going to learn a bit about the compound object that pubchempy creates, starting with NAD+, a compound I worked with every day in graduate school. In the next cell, use the 

```Compound.from_cid(compound#)```

command to pull NAD+ from PubChem.

In [2]:
molecule = pcp.Compound.from_cid(5287958)

Now we will use explore the contents of the compound object that can be extracted using the command

```molecule = c.trait```

where trait can be molecular_weight, molecular_formula, isomeric_smiles, xlogp, iupac_name, and synonyms. Try each

In [3]:
print(molecule.molecular_weight)

663.4


In [4]:
print(molecule.molecular_formula)

C21H27N7O14P2


In [5]:
print(molecule.isomeric_smiles)

C1=C(C=[NH+]C=C1C(=O)N)[C@H]2[C@@H]([C@@H]([C@H](O2)COP(=O)([O-])OP(=O)(O)OC[C@@H]3[C@H]([C@H]([C@@H](O3)N4C=NC5=C(N=CN=C54)N)O)O)O)O


In [6]:
print(molecule.xlogp)

-6.2


In [7]:
print(molecule.iupac_name)

[[(2R,3S,4R,5R)-5-(6-aminopurin-9-yl)-3,4-dihydroxyoxolan-2-yl]methoxy-hydroxyphosphoryl] [(2R,3S,4R,5S)-5-(5-carbamoylpyridin-1-ium-3-yl)-3,4-dihydroxyoxolan-2-yl]methyl phosphate


In [8]:
print(molecule.synonyms)

['5-BETA-D-RIBOFURANOSYLNICOTINAMIDE ADENINE DINUCLEOTIDE', 'DB03020']


What if you don't know the PubChem cid for your compound of interest? pubchempy has a get_compound function that addresses this.

In [9]:
results = pcp.get_compounds('C21H27N7O14P2', 'formula')
print(results)

[Compound(5892), Compound(5288979), Compound(21604869), Compound(444170), Compound(10897651), Compound(5289104), Compound(444215), Compound(127255362), Compound(16219771), Compound(925), Compound(24916815), Compound(72200610), Compound(9874504), Compound(25162925), Compound(111288), Compound(90663709), Compound(163190097), Compound(6604186), Compound(12358825), Compound(4231851), Compound(4349538), Compound(5287958), Compound(196623), Compound(134720244), Compound(46936879), Compound(146167235), Compound(16758169), Compound(5315996), Compound(134559577), Compound(90657086), Compound(138105875), Compound(146019234), Compound(154701110), Compound(154701119), Compound(45109817), Compound(129630323), Compound(86289063), Compound(6419894), Compound(45105095), Compound(169424640), Compound(3283972), Compound(23644209), Compound(44297758), Compound(44297952), Compound(46936557), Compound(46936558), Compound(46936878), Compound(59148228), Compound(59148240), Compound(71751003), Compound(891309

In [10]:
pcp.get_compounds('tylenol', 'name', record_type='3d')

[Compound(1983)]

In [11]:
tylenol = pcp.Compound.from_cid(1983)
print(tylenol.iupac_name)
print(tylenol.molecular_weight)
print(tylenol.molecular_formula)

N-(4-hydroxyphenyl)acetamide
151.16
C8H9NO2


In [12]:
pcp.get_compounds('benzene', 'name')

[Compound(241)]

In [13]:
benzene = pcp.Compound.from_cid(241)
print(benzene.isomeric_smiles)

C1=CC=CC=C1


In [18]:
pcp.get_compounds('cocaine', 'name')

[Compound(446220)]

In [20]:
cids = pcp.get_cids('2-nonenal', 'name', 'substance', list_return='flat')

In [21]:
[pcp.Compound.from_cid(cid) for cid in cids]

[Compound(17166), Compound(5283335), Compound(5354833)]

In [26]:
results = pcp.get_substances('cocaine', 'name')

In [27]:
print(results)

[Substance(4603), Substance(822487), Substance(830115), Substance(841206), Substance(7886706), Substance(7978982), Substance(10317514), Substance(14717642), Substance(46506326), Substance(49846233), Substance(53786861), Substance(85756453), Substance(123092435), Substance(126621786), Substance(134971337), Substance(135652677), Substance(160779671), Substance(162220988), Substance(175266196), Substance(175443665), Substance(241088610), Substance(250229866), Substance(250231451), Substance(274117514), Substance(312228488), Substance(315677870), Substance(318160701), Substance(319223586), Substance(340516292), Substance(341138794), Substance(347950541), Substance(348283521), Substance(349083927), Substance(349979172), Substance(363592834), Substance(374912610), Substance(381330040), Substance(385640345), Substance(386264694), Substance(387185910), Substance(433778754), Substance(433788092), Substance(441072251), Substance(481107971), Substance(482183260), Substance(482322432), Substance(4

In [32]:
substance = pcp.Substance.from_sid(822487)
print(substance.synonyms)
print(substance.source_id)
print(substance.standardized_cid)
print(substance.standardized_compound)

['COCAINE', 'COC']
16807.5
446220
Compound(446220)


In [33]:
p = pcp.get_properties('IsomericSMILES', 'CC', 'smiles', searchtype='superstructure')

In [34]:
print(p)

[{'CID': 297, 'IsomericSMILES': 'C'}, {'CID': 783, 'IsomericSMILES': '[HH]'}, {'CID': 6324, 'IsomericSMILES': 'CC'}, {'CID': 5462310, 'IsomericSMILES': '[C]'}, {'CID': 24523, 'IsomericSMILES': '[2H][2H]'}, {'CID': 24824, 'IsomericSMILES': '[3H][3H]'}, {'CID': 1038, 'IsomericSMILES': '[H+]'}, {'CID': 123070, 'IsomericSMILES': '[2H]C([2H])([2H])[2H]'}, {'CID': 137127, 'IsomericSMILES': '[2H]C([2H])([2H])C([2H])([2H])[2H]'}, {'CID': 167583, 'IsomericSMILES': '[2HH]'}, {'CID': 26873, 'IsomericSMILES': '[14CH4]'}, {'CID': 5362549, 'IsomericSMILES': '[H]'}, {'CID': 114789, 'IsomericSMILES': '[11CH4]'}, {'CID': 160315322, 'IsomericSMILES': '[HH].[HH].[HH].[HH].[HH].[HH].[HH]'}, {'CID': 166653, 'IsomericSMILES': '[H-]'}, {'CID': 12669, 'IsomericSMILES': '[2H]C'}, {'CID': 138525, 'IsomericSMILES': '[2H]CC'}, {'CID': 10866248, 'IsomericSMILES': '[2H]C[2H]'}, {'CID': 10953798, 'IsomericSMILES': '[2H]C([2H])[2H]'}, {'CID': 12053198, 'IsomericSMILES': '[2H]C([2H])C([2H])([2H])[2H]'}, {'CID': 152445

In [36]:
pcp.get_synonyms('THC', 'name', 'substance')

[{'SID': 9186,
  'Synonym': ['Dronabinol',
   'TETRAHYDROCANNABINOL',
   'delta9-Tetrahydrocannabinol',
   '1972-08-3',
   'THC',
   'C06972']},
 {'SID': 841255,
  'Synonym': ['Dronabinol',
   'Marinol',
   'TETRAHYDROCANNABINOL',
   'Deltanyne',
   'delta9-THC',
   'delta9-Tetrahydrocannabinol',
   'Abbott 40566',
   'delta1-THC',
   'delta(sup 1)-Thc',
   'delta(sup 9)-Thc',
   'delta1-Tetrahydrocannabinol',
   'QCD 84924',
   'SP 104',
   'Cannabinol, delta1-tetrahydro-',
   'CCRIS 4726',
   '(-)-trans-Delta9-THC',
   'delta(sup 1)-Tetrahydrocannabinol',
   'delta(sup 9)-Tetrahydrocannabinol',
   '(-)-delta9-trans-Tetrahydrocannabinol',
   'L-delta1-trans-Tetrahydrocannabinol',
   '(-)-delta9-Tetrahydrocannabinol',
   '1972-08-3',
   '(l)-delta1-Tetrahydrocannabinol',
   '(-)-delta1-Tetrahydrocannabinol',
   'delta9-trans-Tetrahydrocannabinol',
   'THC',
   'trans-delta9-Tetrahydrocannabinol',
   'L-trans-delta9-Tetrahydrocannabinol',
   'Cannabinol, tetrahydro- (6CI)',
   'DRG-0138

In [37]:
pcp.get_sids(446220)

[{'CID': 446220,
  'SID': [4603,
   822487,
   830115,
   841206,
   7886706,
   7978982,
   10299853,
   10317514,
   14717642,
   14849677,
   16061210,
   23856122,
   24893116,
   36511843,
   46392192,
   46392962,
   46506326,
   49846233,
   53786861,
   57404730,
   78299047,
   85756453,
   104636021,
   123092435,
   126621786,
   129937523,
   134221937,
   134971337,
   135652677,
   152104973,
   160779671,
   162220988,
   175266196,
   175443665,
   179252632,
   226411107,
   241088610,
   250229866,
   250231451,
   257083036,
   274117514,
   274130708,
   275593489,
   310130212,
   312228488,
   315677870,
   318160701,
   319223586,
   329774761,
   340514937,
   340514938,
   340516292,
   341138794,
   342583730,
   347950541,
   348283521,
   349083927,
   349979172,
   363592834,
   374912610,
   381330040,
   385464889,
   385640345,
   385980703,
   386264694,
   386634583,
   387185910,
   403383507,
   404760972,
   419507876,
   433778754,
   433788092,
  

In [38]:
pcp.get_aids(446220)

[{'CID': 446220,
  'AID': [6428,
   7547,
   7738,
   7783,
   7931,
   7932,
   7933,
   7934,
   7935,
   7936,
   7937,
   7938,
   7939,
   7940,
   7941,
   7942,
   7943,
   7944,
   7945,
   7946,
   8002,
   15715,
   16309,
   19262,
   19424,
   19563,
   23920,
   26362,
   28233,
   28235,
   28236,
   29337,
   29423,
   29925,
   36995,
   42235,
   42236,
   42844,
   52193,
   52195,
   52197,
   52198,
   52199,
   52200,
   52202,
   52203,
   52204,
   61818,
   61819,
   62290,
   63021,
   63033,
   64205,
   64214,
   64341,
   64342,
   64356,
   64361,
   64362,
   64364,
   64367,
   64377,
   64503,
   64506,
   64508,
   64511,
   64513,
   64526,
   64530,
   64539,
   64541,
   64542,
   64677,
   64678,
   64679,
   64682,
   64683,
   64687,
   64690,
   64692,
   64695,
   64697,
   64698,
   64701,
   64703,
   64704,
   64711,
   64833,
   64841,
   64842,
   64843,
   64846,
   64847,
   64849,
   64852,
   64853,
   64857,
   64860,
   64862,
   6486

In [42]:
df1 = pcp.get_compounds('C20H41Br', 'formula', as_dataframe=True)
df2 = pcp.get_substances([9,99,999,9999], as_dataframe=True)
df3 = pcp.get_properties(['isomeric_smiles', 'xlogp', 'rotatable_bond_count'], 'C20H41Br', 'formula', as_dataframe=True)

In [40]:
df3.head()

Unnamed: 0_level_0,IsomericSMILES,XLogP,RotatableBondCount
CID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
20271,CCCCCCCCCCCCCCCCCCCCBr,11.4,18
23148745,CCCCCCCCCCC(CCCCCCCC)CBr,10.6,17
10808570,CC(C)CCCC(C)CCCC(C)CCCC(C)CCBr,9.9,14
14350910,C[C@H](CCC[C@H](C)CCCC(C)C)CCC[C@@H](C)CCBr,9.9,14
15340915,CCCCC(C)CCCC(C)CCCCCCCCCBr,10.8,16


In [43]:
df2.head()

Unnamed: 0_level_0,source_id,source_name,standardized_cid,synonyms
sid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
9,MOLI000011,MOLI,,[MOLI000011]
99,MOLI000104,MOLI,449713.0,[MOLI000104]
999,MOLI001049,MOLI,450537.0,[MOLI001049]
9999,C07797,KEGG,,


In [44]:
df1.head()

Unnamed: 0_level_0,atom_stereo_count,atoms,bond_stereo_count,bonds,cactvs_fingerprint,canonical_smiles,charge,complexity,conformer_id_3d,conformer_rmsd_3d,...,pharmacophore_features_3d,record,rotatable_bond_count,shape_fingerprint_3d,shape_selfoverlap_3d,tpsa,undefined_atom_stereo_count,undefined_bond_stereo_count,volume_3d,xlogp
cid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
20271,0,"[{'aid': 1, 'number': 35, 'element': 'Br', 'y'...",0,"[{'aid1': 1, 'aid2': 20, 'order': 1}, {'aid1':...",1111000001111000000000000000000000000000000100...,CCCCCCCCCCCCCCCCCCCCBr,0,167,,,...,,"{'id': {'id': {'cid': 20271}}, 'atoms': {'aid'...",18,,,0,0,0,,11.4
23148745,1,"[{'aid': 1, 'number': 35, 'element': 'Br', 'y'...",0,"[{'aid1': 1, 'aid2': 10, 'order': 1}, {'aid1':...",1111000001111000000000000000000000000000000100...,CCCCCCCCCCC(CCCCCCCC)CBr,0,179,,,...,,"{'id': {'id': {'cid': 23148745}}, 'atoms': {'a...",17,,,0,1,0,,10.6
10808570,3,"[{'aid': 1, 'number': 35, 'element': 'Br', 'y'...",0,"[{'aid1': 1, 'aid2': 21, 'order': 1}, {'aid1':...",1111000001111000000000000000000000000000000100...,CC(C)CCCC(C)CCCC(C)CCCC(C)CCBr,0,212,,,...,,"{'id': {'id': {'cid': 10808570}}, 'atoms': {'a...",14,,,0,3,0,,9.9
14350910,3,"[{'aid': 1, 'number': 35, 'element': 'Br', 'y'...",0,"[{'aid1': 1, 'aid2': 21, 'order': 1}, {'aid1':...",1111000001111000000000000000000000000000000100...,CC(C)CCCC(C)CCCC(C)CCCC(C)CCBr,0,212,,,...,,"{'id': {'id': {'cid': 14350910}}, 'atoms': {'a...",14,,,0,0,0,,9.9
15340915,2,"[{'aid': 1, 'number': 35, 'element': 'Br', 'y'...",0,"[{'aid1': 1, 'aid2': 21, 'order': 1}, {'aid1':...",1111000001111000000000000000000000000000000100...,CCCCC(C)CCCC(C)CCCCCCCCCBr,0,190,,,...,,"{'id': {'id': {'cid': 15340915}}, 'atoms': {'a...",16,,,0,2,0,,10.8


In [49]:
df4 = pcp.get_compounds('aspirin', 'name', as_dataframe=True)

In [50]:
df4

Unnamed: 0_level_0,atom_stereo_count,atoms,bond_stereo_count,bonds,cactvs_fingerprint,canonical_smiles,charge,complexity,conformer_id_3d,conformer_rmsd_3d,...,pharmacophore_features_3d,record,rotatable_bond_count,shape_fingerprint_3d,shape_selfoverlap_3d,tpsa,undefined_atom_stereo_count,undefined_bond_stereo_count,volume_3d,xlogp
cid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2244,0,"[{'aid': 1, 'number': 8, 'element': 'O', 'y': ...",0,"[{'aid1': 1, 'aid2': 5, 'order': 1}, {'aid1': ...",1100000001110000001110000000000000000000000000...,CC(=O)OC1=CC=CC=C1C(=O)O,0,212,,,...,,"{'id': {'id': {'cid': 2244}}, 'atoms': {'aid':...",3,,,63.6,0,0,,1.2


In [52]:
sids = get_sids('Aspirin', 'name')
for sid in sids:
    s = Substance.from_sid(sid)

NameError: name 'get_sids' is not defined