## ML-Project 1
## <span style="color:maroon"> **Predicting Drug-like Ligands Using Molecular Descriptors and Atom Encodings**

This project helps in finding potential drugs by predicting molecules that are drug-like. By using molecular descriptors and machine learning, we can screen thousands of molecules without testing each of them in the lab. It can save lots of time and money. Feature importance plot also tells which chemical properties are most important, which can help in designing better and more effective drugs.

## What is Drug-Likeness?

> Drug-likeness tells if a molecule has properties that make it suitable to be a drug. Drugs should absorb well in the body, be safe and effective.
> Lipinski’s Rule of Five is used to check if a molecule has drug like qualities. 
> In this project, a computer learned patterns from molecules that are drugs and non-drugs. Then, it predicted new molecules based on those patterns.


## Workflow
1. **Read molecule data** from .sdf files that contain information about drug-like and non-drug-like molecules.
2. **Extracted molecular descriptors** to understand the molecule's chemical behavior.
3. **Trained a machine learning model** to predict if a molecule is drug-like or not.
4. **Evaluated the model** to see how well it works and saved the results for analysis.
5. **Created visualizations** to understand the model’s performance and most important properties.


### 1. Labelling 
- Created a table with the file paths, molecule names, and labels.
- The labels (1 or 0) are answers model learned to predict.

### 2. Molecular Descriptor Properties
  - **Molecular Weight (Mol_Weight)**: How heavy the molecule is.
  - **LogP**: How well the molecule dissolves in fat (important for absorption in the body).
  - **HBD (Hydrogen Bond Donors)**: How many parts of the molecule can donate hydrogen bonds (affects how it binds to proteins).
  - **HBA (Hydrogen Bond Acceptors)**: How many parts can accept hydrogen bonds.
  - **TPSA (Topological Polar Surface Area)**: How polar (water-attracting) the molecule is.
  - **Rotatable Bonds (RB)**: How flexible the molecule is.
  - **Aromatic Rings**: How many stable ring structures (like benzene) it has.
  - **Rings**: Total number of ring structures.
  - **Fsp3**: Fraction of carbon atoms with sp3 hybridization (affects 3D shape).
  - **Heteroatoms**: Non-carbon, non-hydrogen atoms (like oxygen or nitrogen).
  - **BertzCT**: How complex the molecule’s structure is.
  - **Heavy Atoms**: Number of non-hydrogen atoms.
- These properties describe a molecule’s size, shape, solubility, and chemical behavior, which determine if it can work as a drug. Drugs need to be small enough to enter cells but not too big as big drugs cause side effects.
- **Biological reasoning**: These properties affect how a molecule moves through the body, binds to targets (like proteins), and avoids toxicity. LogP tells if a molecule can pass through cell membranes and TPSA tells how well it will dissolve in water.


### 3. Checking Lipinski’s Rule of Five
- I counted how many times a molecule violates Lipinski’s rules. Rules:
  - Molecular weight less than 500.
  - LogP fat solubility is less than 5.
  - Hydrogen bond donors less than 5.
  - Hydrogen bond acceptors less than 10.
- If a molecule violates too many rules (more than 2), it cannot be a drug.
- For example, too many hydrogen bond donors make the molecule stick to water and not let it enter cells.

### 4. Preparing Data for Machine Learning
- Split the data into features and labels. I also removed any rows with missing or invalid data.
- Model needs clean data to learn patterns. Features are inputs and labels are the outputs to be predicted.

### 5. Performing 5-Fold Cross-Validation
- Data spilt into 5 parts, training on 4 parts, and testing on the remaining 1 part. Repeated this 5 times, each time a different part for testing is used.
- Checks if the model works well on unseen data. This prevents overfitting.
- The average accuracy from the 5 tests gives the model’s reliability.

### 6. Training the Random Forest Model
- Random Forest Classifier uses many decision trees to make predictions. 
- It is good at finding patterns in complex data. It’s reliable and less likely to overfit compared to other models.
- Molecular properties interact in complex ways. This model handles these interactions well. 



### 7. Making Predictions
- The model predicted if test molecules were drug-like or non-drug-like and saved the results to “predictions.csv”.

### 8. Evaluating the Model
- Several metrics were used to check the model’s performance. 
  - **Confusion Matrix**: Table showing correct and incorrect predictions (True Positives, True Negatives, False Positives, False Negatives).
  - **Classification Report**: Includes precision (how many predicted drugs are actually drugs), recall (how many actual drugs were predicted correctly), and F1-score (a balance of precision and recall).
  - **Accuracy**: Percentage of correct predictions.
  - **Mean Absolute Error (MAE)**: Average difference between predicted and actual labels.
  - **Mean Squared Error (MSE)**: Similar to MAE but squares the errors (less useful for classification).
- Metrics tell me how well the model works. Accuracy shows overall correctness, precision makes sure we havent I don’t mislabeled non-drugs as drugs, and recall makes sure we havent missed actual drugs. MAE and MSE give extra details about errors.

- **Biological reasoning**: Accurate predictions reduce the cost of testing bad molecules and they increase the chance of finding effective drugs.

### 9. Visualizing Results
- I created two graphs:
  - **Confusion Matrix Heatmap** (saved as “confusion_matrix.png”): A colorful grid showing correct and incorrect predictions.
  - **Feature Importance Bar Plot** (saved as “feature_importance.png”): A bar graph showing which molecular descriptors matter most for predictions.
- Heatmap makes it easy to see how many predictions were correct or wrong. The bar plot shows which properties are most important in deciding if a molecule is drug-like.

## Output Files and there content

1. **finaldataset.csv**:
   - Contains the full dataset with molecule names and all related information. 
   - **Purpose**: This file saves all the data I used for training and analysis, to reuse easily. 

2. **predictions.csv**:
   - Contains predictions for the test set, including: file paths, molecule names, actual labels, and predicted labels.
   - **Purpose**: To tell how well model predicted on new data. 

3. **confusion_matrix.png**:
   - A heatmap showing the number of correct and incorrect predictions.
   - **Analysis**: If heatmap shows high numbers on top-left and bottom-right, the model is accurate. Small numbers in non diagonal cells mean fewer errors.

4. **feature_importance.png**:
   - A bar plot showing which molecular descriptors are most important for predictions.
   - **Analysis**: Taller bars indicate more important features. This helps focus on optimizing these properties in drug design.

## Metrics details

- **Confusion Matrix Heatmap**: The model is good at identifying both drugs and non-drugs.

- **Feature Importance Bar Plot**: LogP or TPSA have high importance, the model relies on fat solubility or polarity to decide drug-likeness. These properties affect how a molecule moves through the body and interacts with cells.

- **Classification Report**: 
  - **Precision**: How many molecules I predicted as drugs were actually drugs. High precision means fewer false positives, which is important to avoid wasting time on bad molecules.
  - **Recall**: How many actual drugs I correctly predicted. High recall means I’m not missing good drugs.
  - **F1-score**: A balance between precision and recall, useful when both are important.
  - **Accuracy**: The overall percentage of correct predictions. If accuracy is 1 (100%), it might mean the data is unbalanced (see below).

- **Accuracy of 1 and Unbalanced Data**:
- The model’s accuracy is 1 (100%), it might mean the data is unbalanced. This happened because the model correctly predicted all test molecules (as seen in “predictions.csv”).
  - Unbalanced data means I have way more drug-like molecules than non-drug-like molecules. The model might predict “drug” every time and get high accuracy but it’s not learning properly.
  - The dataset should have a similar number of drug-like and non-drug-like molecules (50% each or close). This helps the model learn patterns for both groups equally.
  - To fix unbalanced data next time I can collect more non drug-like molecules or use techniques like:
  -  oversampling (adding copies of non-drugs) or undersampling (removing some drugs).



# Analysis and Conclusion of My Drug-Likeness Prediction Project
## Analysis of the Results

### 1. Dataset Overview 
The file contains 61 molecules with their chemical properties (41 drug-like molecules and 20 non-drug-like molecules.
- **Drug-like molecules**: Usually have smaller weights compared to some non-drug-like molecules.
- **LogP**: Drug-like molecules have balanced solubility while non-drug-like molecules like decane are very fat-soluble.
- **Lipinski Violations**: Most drug-like molecules have 0–2 violations while non-drug-like molecules like insulin (3) or digoxin (3) have more, they’re less likely to be drugs.
- **TPSA**: Drug-like molecules have moderate TPSA while non-drug-like molecules like insulin have extreme values.
- **Fsp3**: Non-drug-like molecules like alkanes have high Fsp3 (simple structures) while drug-like molecules like ibuprofen have more complex shapes.
- **Data Imbalance**: There are 41 drug-like and 20 non-drug-like molecules (2:1 ratio), this makes the model favor drug-like predictions.

### 2. Model Predictions 
The file shows predictions for 14 test molecules. The model got all 14 correct, with 100% accuracy.

- **Perfect Accuracy**: The model got every prediction right.

  - The test set is small (14 molecules), so it does not show models true ability.
  - Dataset is imbalanced.
  - The model might have memorized the data instead of learning general patterns, called overfitting. 

 Correct predictions are good for drug discovery, but perfect accuracy says that model might not work correctly on new and different molecules.

### 3. Confusion Matrix  

- Since the model got 100% accuracy, matrix shows:
  - High numbers on the diagonal (correct predictions for drugs and non-drugs)
  - Zero errors (no false positives or false negatives).
- The heatmap’s dark diagonal (in blue colour) shows all predictions are correct.

### 4. Feature Importance 
The feature importance bar plot shows which chemical properties the model used most to make predictions.

**What I Found**:
- important properties::
  - **LogP**: How fat-soluble a molecule is, critical for passing through cell membranes.
  - **TPSA**: How polar a molecule is, affecting how it dissolves in the body.
  - **Molecular Weight**: How big the molecule is, impacting absorption.
  - **Lipinski_Violations**: How many drug-likeness rules the molecule breaks.
- Less important properties can be **Aromatic** or **Rings** , vary less between drugs and non-drugs.

- LogP and TPSA are key as they decide how a molecule moves through the body.

### 5. Evaluation Metrics 
- **Accuracy**: 1.0 (100%) as all predictions are correct.
- **Classification Report**:
  - **Precision**: 1.0 (all predicted drugs are drugs, all predicted non-drugs are non-drugs).
  - **Recall**: 1.0 (correctly predicted actual values).
  - **F1-score**: 1.0 (balance of precision and recall).
- **Mean Absolute Error (MAE)**: 0 (no errors).
- **Mean Squared Error (MSE)**:  0 


