##### **Why revamping this post?**

Three reasons for this: 

1. ChEMBL data downloaded straight from the ChEMBL database website is way too large to be uploaded to GitHub - this is one of my very early posts where ChEMBL is completely new to me at the time so I've downloaded the ChEMBL data without thinking too much, obviously there are other better and more reproducible ways to source ChEMBL data e.g. my more recent posts or through other ways in the literatures

    Note: GitHub blocks files larger than 100 MiB, which is in mebibytes and equivalent to 1,048,576 bytes or 1.04858 MB ([reference](https://www.ibm.com/docs/en/storage-insights?topic=overview-units-measurement-storage-data)) - my bad before as I've read "MiB" as "MB" from this [GitHub doc](https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-large-files-on-github)!

2. Polars seems to be a bit more integrated with scikit-learn now so I'm wondering if Polars dataframe library can be used with scikit-learn solely (i.e. not using Pandas at all)

3. This post is one of my earlier less mature posts (very embarrassing when I'm looking at it now...) so I just want to improve it a little at least

<br>

##### **Previous post updates**

*Update on 19th April 2024 - Polars is currently more integrated with scikit-learn from version 1.4 (since January 2024), see this link re. [Polars output in set_output](https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_4_0.html#polars-output-in-set-output) for Polars dataframe outputs in scikit-learn, and also a few other Polars enhancements from release [version 1.4 changelog](https://scikit-learn.org/dev/whats_new/v1.4.html#).*

*Update on 16th August 2023 - some code updates only, please always refer to [Polars API reference](https://docs.pola.rs/py-polars/html/reference/index.html) for most up-to-date code.*

<br>

##### **Background**

This is the first part of the series of posts on building a logistic regression model by using scikit-learn with [Polars dataframe library](https://docs.pola.rs/) (note: the older version of this post also uses Pandas). Polars is a fast (or more commonly known as "blazingly fast") dataframe library that is written completely in Rust with a very light Python binding that is available for use in Python or Rust programming language. Here I'll be using Python throughout all posts in the series. 

This post will only focus on getting the small molecules data ready from ChEMBL database via a straight website download (not recommended if you're researching or doing virtual experiments that require a good level of data reproducibility, e.g. you'll need the version of data etc., this is however only a demonstration so I'll leave it as it is), and then convert the comma separated value (.csv) file into a parquet file (for better file compressions) in order to upload the data into GitHub. 

<br>

##### **Install and import Polars**

In [None]:
## To install Polars dataframe library (or install in virtual environments)
#%pip install polars
## Update Polars version
#%pip install --upgrade polars

import polars as pl
pl.show_versions()

--------Version info---------
Polars:              1.6.0
Index type:          UInt32
Platform:            macOS-12.7.6-x86_64-i386-64bit
Python:              3.11.0 (v3.11.0:deaf509e8f, Oct 24 2022, 14:43:23) [Clang 13.0.0 (clang-1300.0.29.30)]

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               5.4.1
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
great_tables         <not installed>
matplotlib           3.9.2
nest_asyncio         1.6.0
numpy                2.1.0
openpyxl             <not installed>
pandas               <not installed>
pyarrow              17.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not install

<br>

##### **Download dataset**

The file being used here will be equivalent to a straight download from the home page of ChEMBL database, via clicking on the "Distinct compounds" (see the circled area in the image below). Options are available to download the files as .csv, .tsv or .sdf formats (located at the top right of the page).

![Image adapted from ChEMBL database website at version 31](ChEMBL_cpds.jpg){fig-align="center"}

I'm reading the .csv file first to have an overall look at the data.

In [None]:
df = pl.read_csv("chembl_mols.csv")
df.head()

"ChEMBL ID"";""Name"";""Synonyms"";""Type"";""Max Phase"";""Molecular Weight"";""Targets"";""Bioactivities"";""AlogP"";""Polar Surface Area"";""HBA"";""HBD"";""#RO5 Violations"";""#Rotatable Bonds"";""Passes Ro3"";""QED Weighted"";""CX Acidic pKa"";""CX Basic pKa"";""CX LogP"";""CX LogD"";""Aromatic Rings"";""Structure Type"";""Inorganic Flag"";""Heavy Atoms"";""HBA (Lipinski)"";""HBD (Lipinski)"";""#RO5 Violations (Lipinski)"";""Molecular Weight (Monoisotopic)"";""Molecular Species"";""Molecular Formula"";""Smiles"";""Inchi Key"
str
"""CHEMBL1206185;"";"";Small molecu…"
"""CHEMBL539070;"";"";Small molecul…"
"""CHEMBL3335528;"";"";Small molecu…"
"""CHEMBL2419030;"";"";Small molecu…"
"""CHEMBL4301448;"";"";Small molecu…"


<br>

##### **Some data wrangling and converting a csv file into a parquet file**

A .csv file tends to be separated by delimiters e.g. commas, semicolons or tabs. To read it properly, we can add a delimiter term in the code to transform the dataframe into a more readable format.

Another thing being added below is to deal with null values early - by filling in "None" and "" values in the dataframe as "null" first. This will save some hassles later on (I've encountered this problem when trying to convert column data types so found out this may be the best way to resolve it).

In [None]:
df = pl.read_csv("chembl_mols.csv", separator = ";", null_values = ["None", ""])
df.head()
#df

ChEMBL ID,Name,Synonyms,Type,Max Phase,Molecular Weight,Targets,Bioactivities,AlogP,Polar Surface Area,HBA,HBD,#RO5 Violations,#Rotatable Bonds,Passes Ro3,QED Weighted,CX Acidic pKa,CX Basic pKa,CX LogP,CX LogD,Aromatic Rings,Structure Type,Inorganic Flag,Heavy Atoms,HBA (Lipinski),HBD (Lipinski),#RO5 Violations (Lipinski),Molecular Weight (Monoisotopic),Molecular Species,Molecular Formula,Smiles,Inchi Key
str,str,str,str,i64,f64,i64,i64,f64,f64,i64,i64,i64,i64,str,f64,f64,f64,f64,f64,i64,str,i64,i64,i64,i64,i64,f64,str,str,str,str
"""CHEMBL1206185""",,,"""Small molecule""",0,607.88,,,9.46,89.62,5,2,2,17,"""N""",0.09,-1.91,8.38,9.4,9.36,3,"""MOL""",-1,42,5,3,2,607.279,"""ACID""","""C35H45NO4S2""","""CCCCCCCCCCC#CC(N)c1ccccc1-c1cc…","""UFBLKYIDZFRLPR-UHFFFAOYSA-N"""
"""CHEMBL539070""",,,"""Small molecule""",0,286.79,1.0,1.0,2.28,73.06,6,2,0,5,"""N""",0.63,13.84,3.64,2.57,2.57,2,"""MOL""",-1,17,5,3,0,250.0888,"""NEUTRAL""","""C11H15ClN4OS""","""CCCOc1ccccc1-c1nnc(NN)s1.Cl""","""WPEWNRKLKLNLSO-UHFFFAOYSA-N"""
"""CHEMBL3335528""",,,"""Small molecule""",0,842.8,2.0,6.0,0.18,269.57,18,5,2,17,"""N""",0.09,3.2,,3.31,-0.14,3,"""MOL""",-1,60,19,5,2,842.2633,"""ACID""","""C41H46O19""","""COC(=O)[C@H](O[C@@H]1O[C@@H](C…","""KGUJQZWYZPYYRZ-LWEWUKDVSA-N"""
"""CHEMBL2419030""",,,"""Small molecule""",0,359.33,4.0,4.0,3.94,85.13,6,1,0,3,"""N""",0.66,,,3.66,3.66,2,"""MOL""",-1,24,6,1,0,359.0551,"""NEUTRAL""","""C14H12F3N3O3S""","""O=c1nc(NC2CCCC2)sc2c([N+](=O)[…","""QGDMYSDFCXOKML-UHFFFAOYSA-N"""
"""CHEMBL4301448""",,,"""Small molecule""",0,465.55,,,5.09,105.28,6,4,1,10,"""N""",0.15,,12.14,4.41,2.0,4,"""MOL""",-1,33,7,5,1,465.1635,"""BASE""","""C24H24FN5O2S""","""N=C(N)NCCCOc1ccc(CNc2nc3ccc(Oc…","""RXTJPHLPHOZLFS-UHFFFAOYSA-N"""


Below is a series of data checks and cleaning that'll reduce the original .csv file size (about 664.8 MB) into something more manageable. My goal is to get a parquet file under 104 MB which can then be uploaded to GitHub without using Git large file storage (this will be the last resort if this fails).

I'm checking the "Type" column first.

In [None]:
df.group_by("Type").len()

Type,len
str,u32
"""Antibody""",974
"""Protein""",22682
"""Unclassified""",4
"""Small molecule""",1920366
"""Gene""",77
…,…
"""Oligosaccharide""",92
"""Enzyme""",118
,369155
"""Cell""",47


The dataframe is further reduced in size by filtering the data for small molecules only, which are what I aim to look at.

In [None]:
df_sm = df.filter((pl.col("Type") == "Small molecule"))
df_sm #1,920,366 entries

ChEMBL ID,Name,Synonyms,Type,Max Phase,Molecular Weight,Targets,Bioactivities,AlogP,Polar Surface Area,HBA,HBD,#RO5 Violations,#Rotatable Bonds,Passes Ro3,QED Weighted,CX Acidic pKa,CX Basic pKa,CX LogP,CX LogD,Aromatic Rings,Structure Type,Inorganic Flag,Heavy Atoms,HBA (Lipinski),HBD (Lipinski),#RO5 Violations (Lipinski),Molecular Weight (Monoisotopic),Molecular Species,Molecular Formula,Smiles,Inchi Key
str,str,str,str,i64,f64,i64,i64,f64,f64,i64,i64,i64,i64,str,f64,f64,f64,f64,f64,i64,str,i64,i64,i64,i64,i64,f64,str,str,str,str
"""CHEMBL1206185""",,,"""Small molecule""",0,607.88,,,9.46,89.62,5,2,2,17,"""N""",0.09,-1.91,8.38,9.4,9.36,3,"""MOL""",-1,42,5,3,2,607.279,"""ACID""","""C35H45NO4S2""","""CCCCCCCCCCC#CC(N)c1ccccc1-c1cc…","""UFBLKYIDZFRLPR-UHFFFAOYSA-N"""
"""CHEMBL539070""",,,"""Small molecule""",0,286.79,1,1,2.28,73.06,6,2,0,5,"""N""",0.63,13.84,3.64,2.57,2.57,2,"""MOL""",-1,17,5,3,0,250.0888,"""NEUTRAL""","""C11H15ClN4OS""","""CCCOc1ccccc1-c1nnc(NN)s1.Cl""","""WPEWNRKLKLNLSO-UHFFFAOYSA-N"""
"""CHEMBL3335528""",,,"""Small molecule""",0,842.8,2,6,0.18,269.57,18,5,2,17,"""N""",0.09,3.2,,3.31,-0.14,3,"""MOL""",-1,60,19,5,2,842.2633,"""ACID""","""C41H46O19""","""COC(=O)[C@H](O[C@@H]1O[C@@H](C…","""KGUJQZWYZPYYRZ-LWEWUKDVSA-N"""
"""CHEMBL2419030""",,,"""Small molecule""",0,359.33,4,4,3.94,85.13,6,1,0,3,"""N""",0.66,,,3.66,3.66,2,"""MOL""",-1,24,6,1,0,359.0551,"""NEUTRAL""","""C14H12F3N3O3S""","""O=c1nc(NC2CCCC2)sc2c([N+](=O)[…","""QGDMYSDFCXOKML-UHFFFAOYSA-N"""
"""CHEMBL4301448""",,,"""Small molecule""",0,465.55,,,5.09,105.28,6,4,1,10,"""N""",0.15,,12.14,4.41,2.0,4,"""MOL""",-1,33,7,5,1,465.1635,"""BASE""","""C24H24FN5O2S""","""N=C(N)NCCCOc1ccc(CNc2nc3ccc(Oc…","""RXTJPHLPHOZLFS-UHFFFAOYSA-N"""
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""CHEMBL2017916""",,,"""Small molecule""",0,312.35,3,3,2.86,77.0,6,1,0,4,"""N""",0.8,8.13,3.49,2.17,2.1,3,"""MOL""",-1,22,6,1,0,312.0681,"""NEUTRAL""","""C15H12N4O2S""","""COc1ccc(-c2nnc(NC(=O)c3cccnc3)…","""XIZUJGDKNPVNQA-UHFFFAOYSA-N"""
"""CHEMBL374652""",,,"""Small molecule""",0,403.83,1,1,5.98,36.02,2,2,1,4,"""N""",0.42,13.65,,5.36,5.36,3,"""MOL""",-1,26,2,2,1,403.0421,"""NEUTRAL""","""C18H14ClF4NOS""","""CC(O)(CSc1ccc(F)cc1)c1cc2cc(Cl…","""CRPQTBRTHURKII-UHFFFAOYSA-N"""
"""CHEMBL1416264""",,,"""Small molecule""",0,380.41,6,8,3.06,85.07,7,1,0,5,"""N""",0.54,13.85,3.86,2.47,2.47,4,"""MOL""",-1,27,7,1,0,380.0856,"""NEUTRAL""","""C18H13FN6OS""","""O=C(CSc1ccc2nnc(-c3cccnc3)n2n1…","""QVYIEKHEJKFNAT-UHFFFAOYSA-N"""
"""CHEMBL213734""",,,"""Small molecule""",0,288.26,2,3,2.32,101.7,5,2,0,5,"""N""",0.5,7.2,,2.36,1.95,2,"""MOL""",-1,21,7,2,0,288.0746,"""NEUTRAL""","""C14H12N2O5""","""O=C(COc1ccccc1)Nc1ccc([N+](=O)…","""PZTWAHGBGTWVEB-UHFFFAOYSA-N"""


I'm looking at "Structure Type" column next.

In [None]:
df_sm.group_by("Structure Type").len()

Structure Type,len
str,u32
"""MOL""",1914876
"""SEQ""",1
"""BOTH""",4
"""NONE""",5485


There are 5485 entries with "NONE" as "Structure Type" which means they have unknown compound structures or not recorded in either compound_structures table or protein_therapeutics table. These entries will be removed from df_sm first.

Next, I'm filtering the df_sm dataset further by restricting the filters to only small molecules and excluding all "NONE" structure types. 

In [None]:
df_sm = df.filter((pl.col("Type") == "Small molecule") & (pl.col("Structure Type") != "NONE"))

df_sm #1,914,881 entries

ChEMBL ID,Name,Synonyms,Type,Max Phase,Molecular Weight,Targets,Bioactivities,AlogP,Polar Surface Area,HBA,HBD,#RO5 Violations,#Rotatable Bonds,Passes Ro3,QED Weighted,CX Acidic pKa,CX Basic pKa,CX LogP,CX LogD,Aromatic Rings,Structure Type,Inorganic Flag,Heavy Atoms,HBA (Lipinski),HBD (Lipinski),#RO5 Violations (Lipinski),Molecular Weight (Monoisotopic),Molecular Species,Molecular Formula,Smiles,Inchi Key
str,str,str,str,i64,f64,i64,i64,f64,f64,i64,i64,i64,i64,str,f64,f64,f64,f64,f64,i64,str,i64,i64,i64,i64,i64,f64,str,str,str,str
"""CHEMBL1206185""",,,"""Small molecule""",0,607.88,,,9.46,89.62,5,2,2,17,"""N""",0.09,-1.91,8.38,9.4,9.36,3,"""MOL""",-1,42,5,3,2,607.279,"""ACID""","""C35H45NO4S2""","""CCCCCCCCCCC#CC(N)c1ccccc1-c1cc…","""UFBLKYIDZFRLPR-UHFFFAOYSA-N"""
"""CHEMBL539070""",,,"""Small molecule""",0,286.79,1,1,2.28,73.06,6,2,0,5,"""N""",0.63,13.84,3.64,2.57,2.57,2,"""MOL""",-1,17,5,3,0,250.0888,"""NEUTRAL""","""C11H15ClN4OS""","""CCCOc1ccccc1-c1nnc(NN)s1.Cl""","""WPEWNRKLKLNLSO-UHFFFAOYSA-N"""
"""CHEMBL3335528""",,,"""Small molecule""",0,842.8,2,6,0.18,269.57,18,5,2,17,"""N""",0.09,3.2,,3.31,-0.14,3,"""MOL""",-1,60,19,5,2,842.2633,"""ACID""","""C41H46O19""","""COC(=O)[C@H](O[C@@H]1O[C@@H](C…","""KGUJQZWYZPYYRZ-LWEWUKDVSA-N"""
"""CHEMBL2419030""",,,"""Small molecule""",0,359.33,4,4,3.94,85.13,6,1,0,3,"""N""",0.66,,,3.66,3.66,2,"""MOL""",-1,24,6,1,0,359.0551,"""NEUTRAL""","""C14H12F3N3O3S""","""O=c1nc(NC2CCCC2)sc2c([N+](=O)[…","""QGDMYSDFCXOKML-UHFFFAOYSA-N"""
"""CHEMBL4301448""",,,"""Small molecule""",0,465.55,,,5.09,105.28,6,4,1,10,"""N""",0.15,,12.14,4.41,2.0,4,"""MOL""",-1,33,7,5,1,465.1635,"""BASE""","""C24H24FN5O2S""","""N=C(N)NCCCOc1ccc(CNc2nc3ccc(Oc…","""RXTJPHLPHOZLFS-UHFFFAOYSA-N"""
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""CHEMBL2017916""",,,"""Small molecule""",0,312.35,3,3,2.86,77.0,6,1,0,4,"""N""",0.8,8.13,3.49,2.17,2.1,3,"""MOL""",-1,22,6,1,0,312.0681,"""NEUTRAL""","""C15H12N4O2S""","""COc1ccc(-c2nnc(NC(=O)c3cccnc3)…","""XIZUJGDKNPVNQA-UHFFFAOYSA-N"""
"""CHEMBL374652""",,,"""Small molecule""",0,403.83,1,1,5.98,36.02,2,2,1,4,"""N""",0.42,13.65,,5.36,5.36,3,"""MOL""",-1,26,2,2,1,403.0421,"""NEUTRAL""","""C18H14ClF4NOS""","""CC(O)(CSc1ccc(F)cc1)c1cc2cc(Cl…","""CRPQTBRTHURKII-UHFFFAOYSA-N"""
"""CHEMBL1416264""",,,"""Small molecule""",0,380.41,6,8,3.06,85.07,7,1,0,5,"""N""",0.54,13.85,3.86,2.47,2.47,4,"""MOL""",-1,27,7,1,0,380.0856,"""NEUTRAL""","""C18H13FN6OS""","""O=C(CSc1ccc2nnc(-c3cccnc3)n2n1…","""QVYIEKHEJKFNAT-UHFFFAOYSA-N"""
"""CHEMBL213734""",,,"""Small molecule""",0,288.26,2,3,2.32,101.7,5,2,0,5,"""N""",0.5,7.2,,2.36,1.95,2,"""MOL""",-1,21,7,2,0,288.0746,"""NEUTRAL""","""C14H12N2O5""","""O=C(COc1ccccc1)Nc1ccc([N+](=O)…","""PZTWAHGBGTWVEB-UHFFFAOYSA-N"""


In [None]:
# Check "NONE" entries are removed/filtered
df_sm.group_by("Structure Type").len()

Structure Type,len
str,u32
"""MOL""",1914876
"""SEQ""",1
"""BOTH""",4


I've tried filtering out data using "Inorganic flag" previously, however it turns out to be not so suitable - it'll rule out a lot of preclinical compounds with max phase 0 or max phase > 1 compounds with no calculated physicochemical properties, which means there may not be enough training data to build a machine learning model. So I'm opting for the "Targets" column here by ruling out the ones with zero targets.

In [None]:
df_sm.group_by("Targets").len()

Targets,len
i64,u32
,83321
42,401
18,8992
134,60
557,2
…,…
113,25
378,1
354,2
753,1


In [None]:
df_sm = df.filter((pl.col("Type") == "Small molecule") & (pl.col("Structure Type") != "NONE") & (pl.col("Targets") > 0 ))

df_sm #1,831,560 entries

ChEMBL ID,Name,Synonyms,Type,Max Phase,Molecular Weight,Targets,Bioactivities,AlogP,Polar Surface Area,HBA,HBD,#RO5 Violations,#Rotatable Bonds,Passes Ro3,QED Weighted,CX Acidic pKa,CX Basic pKa,CX LogP,CX LogD,Aromatic Rings,Structure Type,Inorganic Flag,Heavy Atoms,HBA (Lipinski),HBD (Lipinski),#RO5 Violations (Lipinski),Molecular Weight (Monoisotopic),Molecular Species,Molecular Formula,Smiles,Inchi Key
str,str,str,str,i64,f64,i64,i64,f64,f64,i64,i64,i64,i64,str,f64,f64,f64,f64,f64,i64,str,i64,i64,i64,i64,i64,f64,str,str,str,str
"""CHEMBL539070""",,,"""Small molecule""",0,286.79,1,1,2.28,73.06,6,2,0,5,"""N""",0.63,13.84,3.64,2.57,2.57,2,"""MOL""",-1,17,5,3,0,250.0888,"""NEUTRAL""","""C11H15ClN4OS""","""CCCOc1ccccc1-c1nnc(NN)s1.Cl""","""WPEWNRKLKLNLSO-UHFFFAOYSA-N"""
"""CHEMBL3335528""",,,"""Small molecule""",0,842.8,2,6,0.18,269.57,18,5,2,17,"""N""",0.09,3.2,,3.31,-0.14,3,"""MOL""",-1,60,19,5,2,842.2633,"""ACID""","""C41H46O19""","""COC(=O)[C@H](O[C@@H]1O[C@@H](C…","""KGUJQZWYZPYYRZ-LWEWUKDVSA-N"""
"""CHEMBL2419030""",,,"""Small molecule""",0,359.33,4,4,3.94,85.13,6,1,0,3,"""N""",0.66,,,3.66,3.66,2,"""MOL""",-1,24,6,1,0,359.0551,"""NEUTRAL""","""C14H12F3N3O3S""","""O=c1nc(NC2CCCC2)sc2c([N+](=O)[…","""QGDMYSDFCXOKML-UHFFFAOYSA-N"""
"""CHEMBL3827271""",,,"""Small molecule""",0,712.85,1,1,-2.84,319.06,10,11,2,16,"""N""",0.07,4.08,10.49,-6.88,-8.95,0,"""MOL""",-1,50,19,14,3,712.4232,"""ZWITTERION""","""C31H56N10O9""","""CC(C)C[C@@H]1NC(=O)[C@H](CCCNC…","""QJQNNLICZLLPMB-VUBDRERZSA-N"""
"""CHEMBL3465961""",,,"""Small molecule""",0,319.42,16,22,2.22,50.5,4,1,0,6,"""N""",0.87,,9.38,2.13,-0.44,1,"""MOL""",-1,23,4,1,0,319.206,"""BASE""","""C18H26FN3O""","""CC(O)CN1CCC(CN(C)Cc2cc(C#N)ccc…","""FZEVYCHTADTXPM-UHFFFAOYSA-N"""
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""CHEMBL2017916""",,,"""Small molecule""",0,312.35,3,3,2.86,77.0,6,1,0,4,"""N""",0.8,8.13,3.49,2.17,2.1,3,"""MOL""",-1,22,6,1,0,312.0681,"""NEUTRAL""","""C15H12N4O2S""","""COc1ccc(-c2nnc(NC(=O)c3cccnc3)…","""XIZUJGDKNPVNQA-UHFFFAOYSA-N"""
"""CHEMBL374652""",,,"""Small molecule""",0,403.83,1,1,5.98,36.02,2,2,1,4,"""N""",0.42,13.65,,5.36,5.36,3,"""MOL""",-1,26,2,2,1,403.0421,"""NEUTRAL""","""C18H14ClF4NOS""","""CC(O)(CSc1ccc(F)cc1)c1cc2cc(Cl…","""CRPQTBRTHURKII-UHFFFAOYSA-N"""
"""CHEMBL1416264""",,,"""Small molecule""",0,380.41,6,8,3.06,85.07,7,1,0,5,"""N""",0.54,13.85,3.86,2.47,2.47,4,"""MOL""",-1,27,7,1,0,380.0856,"""NEUTRAL""","""C18H13FN6OS""","""O=C(CSc1ccc2nnc(-c3cccnc3)n2n1…","""QVYIEKHEJKFNAT-UHFFFAOYSA-N"""
"""CHEMBL213734""",,,"""Small molecule""",0,288.26,2,3,2.32,101.7,5,2,0,5,"""N""",0.5,7.2,,2.36,1.95,2,"""MOL""",-1,21,7,2,0,288.0746,"""NEUTRAL""","""C14H12N2O5""","""O=C(COc1ccccc1)Nc1ccc([N+](=O)…","""PZTWAHGBGTWVEB-UHFFFAOYSA-N"""


The next step is to save the dataframe as a parquet file.

Reference: [Apache Parquet documentations](https://parquet.apache.org/docs/)

I have tried two main different ways where one is using the `write_parquet()` by only adding file compression level parameter (the "without partition" way), and the other one using use_pyarrow & pyarrow_options to partition datasets. The changes in parquet file size are shown in the following two tables.

```{{python}}
# Without partitioning dataset
from pathlib import Path
path = Path.cwd() / "chembl_sm_mols.parquet"
df_sm.write_parquet(path, compression_level=22)
```

+-------------------+-------------------------------------+--------------------------------------------+-------------------+
| Compression level | Data restrictions                   | File size                                  | Number of entries |
+===================+=====================================+============================================+===================+
| 22                | - None                              | 127.3 MB                                   |                   |
|                   |                                     |                                            | 2,331,700         |
|                   |                                     |                                            |                   |
+-------------------+-------------------------------------+--------------------------------------------+-------------------+
| 22                | - Small molecules only              | 105.4 MB                                   | 1,920,366         |
|                   |                                     |                                            |                   |
+-------------------+-------------------------------------+--------------------------------------------+-------------------+
| 22                | - Small molecules only              | 105.1 MB                                   | 1,914,881         |
|                   | - Exclude structure type with "NONE"|                                            |                   |
+-------------------+-------------------------------------+--------------------------------------------+-------------------+
| 22                | - Small molecules only              | 100.4 MB                                   | 1,831,560         |
|                   | - Exclude structure type with "NONE"|                                            |                   |
|                   | - Remove compounds with no targets  |                                            |                   |
+-------------------+-------------------------------------+--------------------------------------------+-------------------+

: Parquet file size changes without data partitions (note: original .csv file size is 664.8 MB)

```{{python}}
# Partitioning dataset
path = Path.cwd() / "chembl_mols_type_part"
df.write_parquet(
    path,
    #compression_level=20,
    use_pyarrow=True,
    pyarrow_options={"partition_cols": ["Type"]},
)
```

+-------------------+----------------------+---------------------------------------------+-------------------+
| Compression level | Data restrictions    | File size                                   | Number of entries |
+===================+======================+=============================================+===================+
| default           | None                 | - using "Max Phase" as partition column     |                   |
|                   |                      | - max phase 0 > 104 MB                      | 2,331,700         |
|                   |                      | - max phases 1-4: each < 104 MB             |                   |
+-------------------+----------------------+---------------------------------------------+-------------------+
| 15                | None                 | - max phase 0 > 104 MB                      | 2,331,700         |
|                   |                      | - max phase 1-4: each < 104 MB              |                   |
+-------------------+----------------------+---------------------------------------------+-------------------+
| 20                | None                 | - similar sizes as mentioned above          | 2,331,700         |
+-------------------+----------------------+---------------------------------------------+-------------------+
| default           | None                 | - using "Type" as partition column          | 2,331,700         |
|                   |                      | - "Small molecule" file size = 135.2 MB     |                   |
+-------------------+----------------------+---------------------------------------------+-------------------+

: Parquet file size changes with data partitions (note: original .csv file size is 664.8 MB)

Finally, it appears that the one with three data restrictions at compression level of 22 has produced a file at 100.4 MB. I'm reading this file below into a dataframe to see if it's working.

In [None]:
df_pa = pl.read_parquet("chembl_sm_mols.parquet")
df_pa

ChEMBL ID,Name,Synonyms,Type,Max Phase,Molecular Weight,Targets,Bioactivities,AlogP,Polar Surface Area,HBA,HBD,#RO5 Violations,#Rotatable Bonds,Passes Ro3,QED Weighted,CX Acidic pKa,CX Basic pKa,CX LogP,CX LogD,Aromatic Rings,Structure Type,Inorganic Flag,Heavy Atoms,HBA (Lipinski),HBD (Lipinski),#RO5 Violations (Lipinski),Molecular Weight (Monoisotopic),Molecular Species,Molecular Formula,Smiles,Inchi Key
str,str,str,str,i64,f64,i64,i64,f64,f64,i64,i64,i64,i64,str,f64,f64,f64,f64,f64,i64,str,i64,i64,i64,i64,i64,f64,str,str,str,str
"""CHEMBL539070""",,,"""Small molecule""",0,286.79,1,1,2.28,73.06,6,2,0,5,"""N""",0.63,13.84,3.64,2.57,2.57,2,"""MOL""",-1,17,5,3,0,250.0888,"""NEUTRAL""","""C11H15ClN4OS""","""CCCOc1ccccc1-c1nnc(NN)s1.Cl""","""WPEWNRKLKLNLSO-UHFFFAOYSA-N"""
"""CHEMBL3335528""",,,"""Small molecule""",0,842.8,2,6,0.18,269.57,18,5,2,17,"""N""",0.09,3.2,,3.31,-0.14,3,"""MOL""",-1,60,19,5,2,842.2633,"""ACID""","""C41H46O19""","""COC(=O)[C@H](O[C@@H]1O[C@@H](C…","""KGUJQZWYZPYYRZ-LWEWUKDVSA-N"""
"""CHEMBL2419030""",,,"""Small molecule""",0,359.33,4,4,3.94,85.13,6,1,0,3,"""N""",0.66,,,3.66,3.66,2,"""MOL""",-1,24,6,1,0,359.0551,"""NEUTRAL""","""C14H12F3N3O3S""","""O=c1nc(NC2CCCC2)sc2c([N+](=O)[…","""QGDMYSDFCXOKML-UHFFFAOYSA-N"""
"""CHEMBL3827271""",,,"""Small molecule""",0,712.85,1,1,-2.84,319.06,10,11,2,16,"""N""",0.07,4.08,10.49,-6.88,-8.95,0,"""MOL""",-1,50,19,14,3,712.4232,"""ZWITTERION""","""C31H56N10O9""","""CC(C)C[C@@H]1NC(=O)[C@H](CCCNC…","""QJQNNLICZLLPMB-VUBDRERZSA-N"""
"""CHEMBL3465961""",,,"""Small molecule""",0,319.42,16,22,2.22,50.5,4,1,0,6,"""N""",0.87,,9.38,2.13,-0.44,1,"""MOL""",-1,23,4,1,0,319.206,"""BASE""","""C18H26FN3O""","""CC(O)CN1CCC(CN(C)Cc2cc(C#N)ccc…","""FZEVYCHTADTXPM-UHFFFAOYSA-N"""
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""CHEMBL2017916""",,,"""Small molecule""",0,312.35,3,3,2.86,77.0,6,1,0,4,"""N""",0.8,8.13,3.49,2.17,2.1,3,"""MOL""",-1,22,6,1,0,312.0681,"""NEUTRAL""","""C15H12N4O2S""","""COc1ccc(-c2nnc(NC(=O)c3cccnc3)…","""XIZUJGDKNPVNQA-UHFFFAOYSA-N"""
"""CHEMBL374652""",,,"""Small molecule""",0,403.83,1,1,5.98,36.02,2,2,1,4,"""N""",0.42,13.65,,5.36,5.36,3,"""MOL""",-1,26,2,2,1,403.0421,"""NEUTRAL""","""C18H14ClF4NOS""","""CC(O)(CSc1ccc(F)cc1)c1cc2cc(Cl…","""CRPQTBRTHURKII-UHFFFAOYSA-N"""
"""CHEMBL1416264""",,,"""Small molecule""",0,380.41,6,8,3.06,85.07,7,1,0,5,"""N""",0.54,13.85,3.86,2.47,2.47,4,"""MOL""",-1,27,7,1,0,380.0856,"""NEUTRAL""","""C18H13FN6OS""","""O=C(CSc1ccc2nnc(-c3cccnc3)n2n1…","""QVYIEKHEJKFNAT-UHFFFAOYSA-N"""
"""CHEMBL213734""",,,"""Small molecule""",0,288.26,2,3,2.32,101.7,5,2,0,5,"""N""",0.5,7.2,,2.36,1.95,2,"""MOL""",-1,21,7,2,0,288.0746,"""NEUTRAL""","""C14H12N2O5""","""O=C(COc1ccccc1)Nc1ccc([N+](=O)…","""PZTWAHGBGTWVEB-UHFFFAOYSA-N"""


So it looks like it does. The next series of posts will be about trying to use Polars dataframe library all the way with scikit-learn.