# DDInter Data Processing

This Jupyter notebook processes drug interaction data from the DDInter database. The dataset, `dd_inter_downloads_code_A.csv`, is sourced from [DDInter](https://ddinter.scbdd.com/static/media/download/ddinter_downloads_code_A.csv).

## Description

The notebook performs the following tasks:

1. Loads the dataset using pandas
2. Explores the data structure and content
3. Identifies and removes rows with 'Unknown' interactions
4. Extracts all unique drug names into a list from both columns: Drug_A and Drug_B
5. Queries the related CID values for every drug name
6. Creates a DataFrame that stores every drug name with its corresponding CID values
7. Saves this DataFrame as a CSV file
8. Creates a separate DataFrame containing only the CID values
9. Saves the CID-only DataFrame as a CSV file for querying on the PubChem website (https://pubchem.ncbi.nlm.nih.gov/)
10. Prepares the data for further analysis and machine learning modeling

By removing the 'Unknown' interactions, we ensure that our dataset contains only well-defined drug interactions, which is crucial for building an accurate machine learning model. The creation of separate CSV files for drug names with CIDs and CIDs alone facilitates easier querying and analysis of the chemical compounds involved in the interactions, both locally and on the PubChem website.

Citations:
[1] https://ppl-ai-file-upload.s3.amazonaws.com/web/direct-files/43483989/4d7c7aff-11ae-47a1-ace5-513973556ae6/pubchempy.py
[2] https://ddinter.scbdd.com/static/media/download/ddi

In [16]:
import pubchempy as pcp
import pandas as pd


In [None]:
# Load the CSV file into a DataFrame
file_path = 'data/ddinter_downloads_code_A.csv'
df = pd.read_csv(file_path)


In [42]:
# Print basic info about the dataset
print(f"DataFrame shape: {df.shape}")


DataFrame shape: (56367, 5)


In [40]:
print(df.head())

   DDInterID_A              Drug_A  DDInterID_B        Drug_B     Level
0  DDInter1263          Naltrexone     DDInter1      Abacavir  Moderate
1     DDInter1            Abacavir  DDInter1348      Orlistat  Moderate
2    DDInter58  Aluminum hydroxide   DDInter582  Dolutegravir     Major
3   DDInter112          Aprepitant   DDInter582  Dolutegravir     Minor
4   DDInter138         Attapulgite   DDInter582  Dolutegravir     Major


In [41]:
# Levels of interaction
unique_levels = df.Level.unique()
unique_levels

array(['Moderate', 'Major', 'Minor', 'Unknown'], dtype=object)

In [43]:
# Drop the interactions whose level is unknown
df = df.drop(df[df.Level == 'Unknown'].index)

In [45]:
# Check the resulting shape of the data frame
print(f"DataFrame shape: {df.shape}")

DataFrame shape: (41600, 5)


In [None]:
# Combine unique drug names
unique_drugs = pd.concat([df['Drug_A'], df['Drug_B']]).unique()
print(f"Unique drug names extracted: {len(unique_drugs)}")

Unique drug names extracted: 1757


In [None]:
# Function to fetch CIDs
def fetch_cids(drug_name):
    try:
        # print(f"Querying PubChem for: {drug_name}")
        cids = pcp.get_cids(drug_name, 'name', 'substance', list_return='flat')
        return list(map(int, cids)) if cids else []
    except Exception as e:
        # print(f"Error fetching CID for {drug_name}: {e}")
        return [f"Error: {e}"]

# Fetch CIDs for all test drugs
drug_cids = {drug: fetch_cids(drug) for drug in unique_drugs}



In [29]:
# Convert results to DataFrame
drug_cids_df = pd.DataFrame(list(drug_cids.items()), columns=["Drug Name", "CIDs"]).explode("CIDs")
drug_cids_df["CIDs"] = pd.to_numeric(drug_cids_df["CIDs"], errors="coerce").astype("Int64")  # Ensure numeric CIDs



In [30]:
drug_cids_df.head()

Unnamed: 0,Drug Name,CIDs
0,Naltrexone,4428
0,Naltrexone,5360515
0,Naltrexone,5702239
0,Naltrexone,6321302
0,Naltrexone,6604527


In [31]:
drug_cids_df.shape

(11783, 2)

In [32]:
# Save intermediate CID results
drug_cids_df.to_csv("data/drug_cids.csv", index=False)
print("CIDs saved to 'data/drug_cids.csv'")


CIDs saved to 'data/drug_cids.csv'


In [39]:
# Get the CIDs
cids = drug_cids_df["CIDs"].dropna()
cids.to_csv("data/cids.csv", index=False, header=False)