# Data handling with Pandas and Visualisation with Matplotlib

## Pandas
### Brief Overview of the Pandas Library

Pandas is a fast, powerful, and flexible open-source data analysis and data manipulation Python library. Built on top of the Python programming language, it provides data structures like Series and DataFrames for handling and analyzing structured data.

### Data Structures:
- **Series**: 1D labeled array capable of holding any data type.
- **DataFrame**: 2D labeled data structure with columns that can be of different types.
- **Functionalities**: It offers various functionalities to perform tasks ranging from data cleaning, transformation, and visualization to more complex operations like aggregation and reshaping.

### Relevance in Data Analysis

- **Data Wrangling**: Pandas makes it straightforward to clean and process messy and raw data into a more suitable and clean form.

- **Exploratory Data Analysis (EDA)**: With functions that help to inspect data, compute summary statistics, and visualize distributions, Pandas is a go-to tool for preliminary data investigation.

### Why Pandas?

Importance in Data Manipulation and Analysis

- **Ease of Use**: The library’s syntax is straightforward, which makes it easy for beginners to get started.

- **Flexibility**: Can handle a variety of data formats (like CSV, Excel, SQL databases, and even HDF5), and its DataFrame structure allows you to transform data quickly.

- **Performance**: Built on top of C libraries like NumPy, it’s fast and efficient for data manipulation.

- **Community Support**: As one of the most popular data science libraries, it has a strong community and a plethora of tutorials, making it easier to find help and resources.

Ease of Use

- **Intuitive Syntax**: The Pandas API is designed to be intuitive and easy, making it accessible for new users and comprehensive for experienced ones.

- **Comprehensive Documentation**: An array of examples and community-contributed tutorials mean that most problems you encounter have already been solved and documented.

By integrating seamlessly with other data science libraries like Matplotlib for plotting and Scikit-learn for machine learning, Pandas enables end-to-end data analysis right within Python, making it an essential tool for anyone engaged in data science or data analysis.

# Importing libraries
## import pandas and numpy

It is convention import these libraries in the following way.



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Basic Data Structures in Pandas
### Series

A Pandas Series is essentially a one-dimensional array that can hold any data type. It comes with an index, which allows for both positional and label-based indexing.

    Syntax: pd.Series(data)


Creation of a Pandas Series: You can create a Series from a list, dictionary, or NumPy array.

- From a List


In [None]:
data = [1, 2, 3, 4]
ds = pd.Series(data)
ds

0    1
1    2
2    3
3    4
dtype: int64


- From a Dictionary


In [None]:
data = {'a': 1, 'b': 2, 'c': 3}
print("The data dictionary is ", data)
s = pd.Series(data)
s

The data dictionary is  {'a': 1, 'b': 2, 'c': 3}


a    1
b    2
c    3
dtype: int64

- From a NumPy Array


In [None]:
data = np.array([1, 2, 3])
print("The data as a numpy array is :", data)
s = pd.Series(np.array([1, 2, 3]))
s

The data as a numpy array is : [1 2 3]


0    1
1    2
2    3
dtype: int64


### DataFrame

**Definition**: A Pandas DataFrame is a 2D table with labeled axes (rows and columns). Each column can have its own data type, and you can perform operations much like you would in a SQL table or an Excel spreadsheet.

**Syntax**: ```pd.DataFrame(data, columns=columns, index=index)```

**Creation of a Pandas DataFrame**
DataFrames can be created from lists, dictionaries, Series, and even other DataFrames.

- From a List of Lists


In [None]:
df = pd.DataFrame([[1, 'a'], [2, 'b'], [3, 'c']], columns=['Number', 'Letter'])
df

Unnamed: 0,Number,Letter
0,1,a
1,2,b
2,3,c



 - From a Dictionary


In [None]:
df = pd.DataFrame({'Number': [1, 2, 3], 'Letter': ['a', 'b', 'c']})
df

Unnamed: 0,Number,Letter
0,1,a
1,2,b
2,3,c


- From Series


In [None]:
s1 = pd.Series([1, 2, 3])
s2 = pd.Series(['a', 'b', 'c'])
df = pd.DataFrame({'Number': s1, 'Letter': s2})
df

Unnamed: 0,Number,Letter
0,1,a
1,2,b
2,3,c



By understanding Series and DataFrames, you set the foundation for almost everything you'll do with Pandas, from data manipulation to advanced analysis.

## Data Import and Export in Pandas

### Writing Data
How to write a Pandas DataFrame back to a CSV file.

    Function: to_csv()
        Syntax: DataFrame.to_csv('file_path')

In [None]:
# Create a simple DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']})

# Write DataFrame to a CSV file
df.to_csv('output.csv', index=False)
!ls output.csv

output.csv


By setting index=False, you ensure that the DataFrame index doesn’t get written into the CSV file, keeping the file clean and ready for further use.


### Reading Data

How to import data into Pandas, specifically focusing on reading CSV files.

    Function: read_csv()
        Syntax: pd.read_csv('file_path')

In [None]:
# Read a CSV file into a DataFrame
df = pd.read_csv('output.csv')

# Display the first few rows of the DataFrame
print(df.head())


# Molecular Property Data

Now, let us import data on molecular properties.  For this tutorial, We take the [solubility](https://raw.githubusercontent.com/GLambard/Molecules_Dataset_Collection/master/latest/ESOL_delaney-processed.csv) data from [Guillaume Lambard ](https://github.com/GLambard).
This file contains index column, so we add `index_column=0`.

In [None]:
import pandas as pd
url="https://raw.githubusercontent.com/GLambard/Molecules_Dataset_Collection/master/latest/ESOL_delaney-processed.csv"
df = pd.read_csv(url, index_col=0)
df

Unnamed: 0,Compound ID,ESOL predicted log solubility in mols per litre,Minimum Degree,Molecular Weight,Number of H-Bond Donors,Number of Rings,Number of Rotatable Bonds,Polar Surface Area,measured log solubility in mols per litre,smiles
0,Amigdalin,-0.974,1,457.432,7,3,7,202.32,-0.770,N#CC(OC1OC(COC2OC(CO)C(O)C(O)C2O)C(O)C(O)C1O)C...
1,Fenfuram,-2.885,1,201.225,1,2,2,42.24,-3.300,CC1:O:C:C:C:1C(=O)NC1:C:C:C:C:C:1
2,citral,-2.579,1,152.237,0,0,4,17.07,-2.060,CC(C)=CCCC(C)=CC=O
3,Picene,-6.618,2,278.354,0,5,0,0.00,-7.870,C1:C:C:C2:C(:C:1):C:C:C1:C:2:C:C:C2:C3:C:C:C:C...
4,Thiophene,-2.232,2,84.143,0,1,0,0.00,-1.330,C1:C:C:S:C:1
...,...,...,...,...,...,...,...,...,...,...
1123,halothane,-2.608,1,197.381,0,0,0,0.00,-1.710,FC(F)(F)C(Cl)Br
1124,Oxamyl,-0.908,1,219.266,1,0,1,71.00,0.106,CNC(=O)ON=C(SC)C(=O)N(C)C
1125,Thiometon,-3.323,1,246.359,0,0,7,18.46,-3.091,CCSCCSP(=S)(OC)OC
1126,2-Methylbutane,-2.245,1,72.151,0,0,1,0.00,-3.180,CCC(C)C


What are the columns in this dataframe?

In [None]:
df.columns

Index(['Compound ID', 'ESOL predicted log solubility in mols per litre',
       'Minimum Degree', 'Molecular Weight', 'Number of H-Bond Donors',
       'Number of Rings', 'Number of Rotatable Bonds', 'Polar Surface Area',
       'measured log solubility in mols per litre', 'smiles'],
      dtype='object')

  Compound ID:
  
    Description: A unique identifier for each chemical compound.
    Type: Categorical/String

ESOL predicted log solubility in mols per litre:

    Description: The predicted solubility of the compound in mols per litre, computed using the ESOL model.
    Type: Numerical/Float

Minimum Degree:

    Description: The minimum degree (number of edges connected to a node) in the molecular graph representation of the compound.
    Type: Numerical/Integer

Molecular Weight:

    Description: The molecular weight of the compound, usually in g/mol.
    Type: Numerical/Float

Number of H-Bond Donors:

    Description: The number of atoms in the compound that can act as hydrogen bond donors.
    Type: Numerical/Integer

Number of Rings:

    Description: The number of ring structures in the compound.
    Type: Numerical/Integer

Number of Rotatable Bonds:

    Description: The number of bonds in the molecule that are capable of rotation.
    Type: Numerical/Integer

Polar Surface Area:

    Description: The surface area of the molecule that is polar, usually measured in square angstroms (A˚2A˚2).
    Type: Numerical/Float

Measured log solubility in mols per litre:

    Description: The experimentally measured solubility of the compound in mols per litre, usually obtained through lab experiments.
    Type: Numerical/Float

Smiles:

    Description: Simplified Molecular Input Line Entry System (SMILES) notation, representing the structural formula of the compound.
    Type: Categorical/String

## ESOL model
The ESOL (Estimated SOLubility) model is an empirical model designed to predict the aqueous solubility of organic compounds. It was developed by John S. Delaney and published in a 2004 paper titled "ESOL: Estimating Aqueous Solubility Directly from Molecular Structure". The model uses a set of molecular descriptors, such as molecular weight, number of atoms, and polar surface area, to estimate the log solubility (log⁡SlogS) of a compound in water.

The ESOL model is relatively simple and interpretable compared to more complex machine learning models. It is often used as a baseline or comparative standard in solubility prediction tasks. The model is particularly useful for screening large compound libraries in drug discovery and other chemical engineering applications, where quick and reasonable solubility estimates are needed.

## Data Inspection



    Head and Tail
        Using head() and tail() to inspect the first and last rows of a DataFrame.


In [None]:
df.head()

Unnamed: 0,Compound ID,ESOL predicted log solubility in mols per litre,Minimum Degree,Molecular Weight,Number of H-Bond Donors,Number of Rings,Number of Rotatable Bonds,Polar Surface Area,measured log solubility in mols per litre,smiles
0,Amigdalin,-0.974,1,457.432,7,3,7,202.32,-0.77,N#CC(OC1OC(COC2OC(CO)C(O)C(O)C2O)C(O)C(O)C1O)C...
1,Fenfuram,-2.885,1,201.225,1,2,2,42.24,-3.3,CC1:O:C:C:C:1C(=O)NC1:C:C:C:C:C:1
2,citral,-2.579,1,152.237,0,0,4,17.07,-2.06,CC(C)=CCCC(C)=CC=O
3,Picene,-6.618,2,278.354,0,5,0,0.0,-7.87,C1:C:C:C2:C(:C:1):C:C:C1:C:2:C:C:C2:C3:C:C:C:C...
4,Thiophene,-2.232,2,84.143,0,1,0,0.0,-1.33,C1:C:C:S:C:1


In [None]:
df.tail()

Unnamed: 0,Compound ID,ESOL predicted log solubility in mols per litre,Minimum Degree,Molecular Weight,Number of H-Bond Donors,Number of Rings,Number of Rotatable Bonds,Polar Surface Area,measured log solubility in mols per litre,smiles
1123,halothane,-2.608,1,197.381,0,0,0,0.0,-1.71,FC(F)(F)C(Cl)Br
1124,Oxamyl,-0.908,1,219.266,1,0,1,71.0,0.106,CNC(=O)ON=C(SC)C(=O)N(C)C
1125,Thiometon,-3.323,1,246.359,0,0,7,18.46,-3.091,CCSCCSP(=S)(OC)OC
1126,2-Methylbutane,-2.245,1,72.151,0,0,1,0.0,-3.18,CCC(C)C
1127,Stirofos,-4.32,1,365.964,0,1,5,44.76,-4.522,COP(=O)(OC)OC(=CCl)C1:C:C(Cl):C(Cl):C:C:1Cl


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1128 entries, 0 to 1127
Data columns (total 10 columns):
 #   Column                                           Non-Null Count  Dtype  
---  ------                                           --------------  -----  
 0   Compound ID                                      1128 non-null   object 
 1   ESOL predicted log solubility in mols per litre  1128 non-null   float64
 2   Minimum Degree                                   1128 non-null   int64  
 3   Molecular Weight                                 1128 non-null   float64
 4   Number of H-Bond Donors                          1128 non-null   int64  
 5   Number of Rings                                  1128 non-null   int64  
 6   Number of Rotatable Bonds                        1128 non-null   int64  
 7   Polar Surface Area                               1128 non-null   float64
 8   measured log solubility in mols per litre        1128 non-null   float64
 9   smiles                        

In [None]:
df.shape

(1128, 10)

## Descriptive Statistics
        How to use describe() to get a statistical summary of the data.


In [None]:
df.describe()

Unnamed: 0,ESOL predicted log solubility in mols per litre,Minimum Degree,Molecular Weight,Number of H-Bond Donors,Number of Rings,Number of Rotatable Bonds,Polar Surface Area,measured log solubility in mols per litre
count,1128.0,1128.0,1128.0,1128.0,1128.0,1128.0,1128.0,1128.0
mean,-2.988192,1.058511,203.937074,0.701241,1.390957,2.177305,34.872881,-3.050102
std,1.68322,0.23856,102.738077,1.089727,1.318286,2.640974,35.383593,2.096441
min,-9.702,0.0,16.043,0.0,0.0,0.0,0.0,-11.6
25%,-3.94825,1.0,121.183,0.0,0.0,0.0,0.0,-4.3175
50%,-2.87,1.0,182.179,0.0,1.0,1.0,26.3,-2.86
75%,-1.84375,1.0,270.372,1.0,2.0,3.0,55.44,-1.6
max,1.091,2.0,780.949,11.0,8.0,23.0,268.68,1.58


In [None]:
df.index

Int64Index([   0,    1,    2,    3,    4,    5,    6,    7,    8,    9,
            ...
            1118, 1119, 1120, 1121, 1122, 1123, 1124, 1125, 1126, 1127],
           dtype='int64', length=1128)


## Data Filtering and Selection


## Row and Column Selection using iloc[] and loc[].

The loc[] method allows you to select rows and columns from a DataFrame based on their labels.
Syntax: `DataFrame.loc[<row_labels>, <column_labels>]`

Features

- Label-based: Uses the explicit index and column names for selection.

- Sliceable: Allows slicing of rows and columns using their labels.

- Boolean Masking: Can be used with boolean conditions to filter data.

Examples

- Selecting a Single Row by Index Label



In [None]:
df.loc[1]

Compound ID                                                                 Fenfuram
ESOL predicted log solubility in mols per litre                               -2.885
Minimum Degree                                                                     1
Molecular Weight                                                             201.225
Number of H-Bond Donors                                                            1
Number of Rings                                                                    2
Number of Rotatable Bonds                                                          2
Polar Surface Area                                                             42.24
measured log solubility in mols per litre                                       -3.3
smiles                                             CC1:O:C:C:C:1C(=O)NC1:C:C:C:C:C:1
Name: 1, dtype: object


Selecting Multiple Rows by Index Labels


In [None]:
rows = df.loc[1:3]
rows

Unnamed: 0,Compound ID,ESOL predicted log solubility in mols per litre,Minimum Degree,Molecular Weight,Number of H-Bond Donors,Number of Rings,Number of Rotatable Bonds,Polar Surface Area,measured log solubility in mols per litre,smiles
1,Fenfuram,-2.885,1,201.225,1,2,2,42.24,-3.3,CC1:O:C:C:C:1C(=O)NC1:C:C:C:C:C:1
2,citral,-2.579,1,152.237,0,0,4,17.07,-2.06,CC(C)=CCCC(C)=CC=O
3,Picene,-6.618,2,278.354,0,5,0,0.0,-7.87,C1:C:C:C2:C(:C:1):C:C:C1:C:2:C:C:C2:C3:C:C:C:C...


Selecting Specific Columns with Rows


In [37]:
df.loc[1:3, ['Compound ID', 'smiles']]

Unnamed: 0,Compound ID,smiles
1,Fenfuram,CC1:O:C:C:C:1C(=O)NC1:C:C:C:C:C:1
2,citral,CC(C)=CCCC(C)=CC=O
3,Picene,C1:C:C:C2:C(:C:1):C:C:C1:C:2:C:C:C2:C3:C:C:C:C...


### Selecting columns

In [39]:
df['Molecular Weight']

0       457.432
1       201.225
2       152.237
3       278.354
4        84.143
         ...   
1123    197.381
1124    219.266
1125    246.359
1126     72.151
1127    365.964
Name: Molecular Weight, Length: 1128, dtype: float64

In [40]:
df[['Compound ID', 'smiles']]

Unnamed: 0,Compound ID,smiles
0,Amigdalin,N#CC(OC1OC(COC2OC(CO)C(O)C(O)C2O)C(O)C(O)C1O)C...
1,Fenfuram,CC1:O:C:C:C:1C(=O)NC1:C:C:C:C:C:1
2,citral,CC(C)=CCCC(C)=CC=O
3,Picene,C1:C:C:C2:C(:C:1):C:C:C1:C:2:C:C:C2:C3:C:C:C:C...
4,Thiophene,C1:C:C:S:C:1
...,...,...
1123,halothane,FC(F)(F)C(Cl)Br
1124,Oxamyl,CNC(=O)ON=C(SC)C(=O)N(C)C
1125,Thiometon,CCSCCSP(=S)(OC)OC
1126,2-Methylbutane,CCC(C)C


### Selecting with iloc[]

Syntax: `DataFrame.iloc[<row_positions>, <column_positions>]`

The `iloc[]` method allows you to select rows and columns from a DataFrame based on their integer positions.
Features

- Position-based: Uses the implicit integer index for selection.

- Boolean Masking: Can be used with boolean arrays to filter data.

Examples

- Selecting a Single Row by Position



In [41]:
df.iloc[0]

Compound ID                                                                                Amigdalin
ESOL predicted log solubility in mols per litre                                               -0.974
Minimum Degree                                                                                     1
Molecular Weight                                                                             457.432
Number of H-Bond Donors                                                                            7
Number of Rings                                                                                    3
Number of Rotatable Bonds                                                                          7
Polar Surface Area                                                                            202.32
measured log solubility in mols per litre                                                      -0.77
smiles                                             N#CC(OC1OC(COC2OC(CO)C(O)C(O)C2O)C(O)C(O

Selecting Multiple Rows by Position

In [42]:
df.iloc[0:3]

Unnamed: 0,Compound ID,ESOL predicted log solubility in mols per litre,Minimum Degree,Molecular Weight,Number of H-Bond Donors,Number of Rings,Number of Rotatable Bonds,Polar Surface Area,measured log solubility in mols per litre,smiles
0,Amigdalin,-0.974,1,457.432,7,3,7,202.32,-0.77,N#CC(OC1OC(COC2OC(CO)C(O)C(O)C2O)C(O)C(O)C1O)C...
1,Fenfuram,-2.885,1,201.225,1,2,2,42.24,-3.3,CC1:O:C:C:C:1C(=O)NC1:C:C:C:C:C:1
2,citral,-2.579,1,152.237,0,0,4,17.07,-2.06,CC(C)=CCCC(C)=CC=O


Selecting Specific Columns with Rows

In [43]:
df.iloc[0:3, 0:2]

Unnamed: 0,Compound ID,ESOL predicted log solubility in mols per litre
0,Amigdalin,-0.974
1,Fenfuram,-2.885
2,citral,-2.579


Selecting Specific Cells


In [44]:

df.iloc[0, 1]

-0.974

Use-cases

- Data Inspection: Quickly access specific rows or columns based on their position.

- Data Cleaning: Use integer-based slicing for more control during cleaning.

- Data Transformation: Isolate specific sections of a DataFrame for analysis or modification.

Understanding iloc[] gives you greater flexibility in data selection by allowing for position-based operations, essential for various data manipulation or analysis tasks.


In [None]:
df.columns


## Conditional Selection

Show how to filter rows based on column values.

How many molecules have more than five rings?

In [46]:
condition = df['Number of Rings']>5
df[condition]

Unnamed: 0,Compound ID,ESOL predicted log solubility in mols per litre,Minimum Degree,Molecular Weight,Number of H-Bond Donors,Number of Rings,Number of Rotatable Bonds,Polar Surface Area,measured log solubility in mols per litre,smiles
188,Diosgenin,-5.681,1,414.63,1,6,0,38.69,-7.32,CC1COC2CC3(C)C(CC4C5C=C6CC(O)CCC6(C)CC5CCC43C)...
219,"Etoposide (148-167,25mg/ml)",-3.292,1,588.562,3,7,5,160.83,-3.571,COC1:C:C(C2C3:C:C4:C(:C:C:3C(OC3OC5COC(C)OC5C(...
429,Kepone,-5.112,1,490.639,0,6,0,17.07,-5.259,O=C1C2(Cl)C3(Cl)C4(Cl)C(Cl)(Cl)C5(Cl)C3(Cl)C1(...
555,"Digoxin (L1=41,8mg/mL, L2=68,2mg/mL, Z=40,1mg/mL)",-5.312,1,780.949,6,8,7,203.06,-4.081,CC1OC(OC2C(O)CC(OC3C(O)CC(OC4CCC5(C)C(CCC6C5CC...
640,Digitoxin,-6.114,1,764.95,5,8,7,182.83,-5.293,CC1OC(OC2C(O)CC(OC3C(O)CC(OC4CCC5(C)C(CCC6C5CC...
676,Benzo[ghi]perylene,-6.446,2,276.338,0,6,0,0.0,-9.018,C1:C:C2:C:C:C3:C:C:C4:C:C:C5:C:C:C:C6:C(:C:1):...
718,Coronene,-6.885,2,300.36,0,7,0,0.0,-9.332,C1:C:C2:C:C:C3:C:C:C4:C:C:C5:C:C:C6:C:C:C:1:C1...
922,norbormide,-4.238,1,511.581,2,7,5,92.18,-3.931,O=C1NC(=O)C2C3C(C(O)(C4:C:C:C:C:C:4)C4:C:C:C:C...
988,Mirex,-6.155,1,545.546,0,6,0,0.0,-6.8,ClC1(Cl)C2(Cl)C3(Cl)C4(Cl)C(Cl)(Cl)C5(Cl)C3(Cl...


##Conditional Selection with Multiple Conditions


In [48]:
condition = (df['Number of Rings'] >3) & (df['Number of H-Bond Donors']>3)
df[condition]

Unnamed: 0,Compound ID,ESOL predicted log solubility in mols per litre,Minimum Degree,Molecular Weight,Number of H-Bond Donors,Number of Rings,Number of Rotatable Bonds,Polar Surface Area,measured log solubility in mols per litre,smiles
62,Triamcinolone,-2.734,1,394.439,4,4,2,115.06,-3.68,CC12C=CC(=O)C=C1CCC1C3CC(O)C(O)(C(=O)CO)C3(C)C...
555,"Digoxin (L1=41,8mg/mL, L2=68,2mg/mL, Z=40,1mg/mL)",-5.312,1,780.949,6,8,7,203.06,-4.081,CC1OC(OC2C(O)CC(OC3C(O)CC(OC4CCC5(C)C(CCC6C5CC...
640,Digitoxin,-6.114,1,764.95,5,8,7,182.83,-5.293,CC1OC(OC2C(O)CC(OC3C(O)CC(OC4CCC5(C)C(CCC6C5CC...
790,hematein,-1.795,1,300.266,4,4,0,107.22,-2.7,O=C1C=C2CC3(O)COC4:C(:C:C:C(O):C:4O)C3=C2C=C1O


Which is the molecule with highest molecular weight?

In [50]:
condition = df['Molecular Weight'] == df['Molecular Weight'].max()
df[condition]

Unnamed: 0,Compound ID,ESOL predicted log solubility in mols per litre,Minimum Degree,Molecular Weight,Number of H-Bond Donors,Number of Rings,Number of Rotatable Bonds,Polar Surface Area,measured log solubility in mols per litre,smiles
555,"Digoxin (L1=41,8mg/mL, L2=68,2mg/mL, Z=40,1mg/mL)",-5.312,1,780.949,6,8,7,203.06,-4.081,CC1OC(OC2C(O)CC(OC3C(O)CC(OC4CCC5(C)C(CCC6C5CC...


### Resetting index
Let us set the index to `'Compound ID'`. This creates a new Dataframe.

In [None]:
df_new = df.set_index('Compound ID')

Or, we can change the df itself using `inplace=True`

In [None]:
df.set_index('Compound ID', inplace=True)

# Matplotlib

In [None]:
import matplotlib.pyplot as plt

In [51]:
df.columns

Index(['Compound ID', 'ESOL predicted log solubility in mols per litre',
       'Minimum Degree', 'Molecular Weight', 'Number of H-Bond Donors',
       'Number of Rings', 'Number of Rotatable Bonds', 'Polar Surface Area',
       'measured log solubility in mols per litre', 'smiles'],
      dtype='object')

In [None]:
df.plot('Molecular Weight', 'measured log solubility in mols per litre', kind='scatter')

# Problem Statement: Predicting Molecular Solubility Using Molecular Descriptors

## Objective

To predict the solubility of a set of chemical compounds in water, given a dataset of molecular descriptors. Solubility is a critical property that affects the distribution and efficacy of a compound in biological systems, making it a key focus in computational chemistry.

## Dataset

The above dataset.

## Data Source

https://raw.githubusercontent.com/GLambard/Molecules_Dataset_Collection/master/latest/ESOL_delaney-processed.csv

**Expected Outcome**:

- Exploratory Data Analysis report identifying key features and correlations.
- A regression model to predict solubility from molecular descriptors.
- Visualizations to support the analysis and model evaluation.
- Insights into which molecular descriptors are most critical in determining solubility.