# Experiment: Which drugs are "similar", from prescription data?

In the USA the social program for prescription drugs is the "Medicare Part D Prescription Drug" program. Data regarding the prescribed drugs per provider and drug are available at https://data.cms.gov/provider-summary-by-type-of-service/medicare-part-d-prescribers/medicare-part-d-prescribers-by-provider-and-drug/

On this website you can select the columns you want. I downloaded only the code for the drug provider (`Prscrbr_NPI`), the type of provider (`Prscrbr_Type`), the generic name for the drug (`Gnrc_Name`) and the total number of claims made by this provider (`Tot_Clms`):

<center><img src='./image.png' width=50%></img></center>

As a result, I got a 197MB zip file. I made a folder named `data` and placed the file there. Let's load the data:

In [None]:
import zipfile
from pathlib import Path

import pandas as pd

DATA_DIR = Path.cwd() / 'data'
CSV_FILE_NAME = 'Medicare_Part_D_Prescribers_by_Provider_and_Drug_2022.csv'


def load_data(data_dir=DATA_DIR, csv_file_name=CSV_FILE_NAME):
    csv_file_path = data_dir / csv_file_name

    if not csv_file_path.is_file():
        print(f'{csv_file_path} not found, extracting zip file')
        zip_file_path = data_dir / "Medicare_Part_D_Prescribers_by_Provider_and_Drug_2022.zip"
        if not zip_file_path.is_file():
            raise FileNotFoundError(f"File not found: {zip_file_path}")
        with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
            zip_ref.extractall(data_dir)
    print(f'Loading data from {csv_file_path}')
    df = pd.read_csv(csv_file_path)
    return df

In [None]:
df = load_data()

## Exploratory analysis

What does the data look like?

In [None]:
df.shape

- About 25 million rows, 4 columns

In [None]:
df.info()

- The columns are:

| Column Name  | Description                  | Dtype   | Type meaning |
|--------------|------------------------------|---------|--------------|
| Prscrbr_NPI  | Doctor's ID                  | int64   | an integer   |  
| Prscrbr_Type | Doctor type                  | object  | a string     | 
| Gnrc_Name    | Drug name                    | object  | a string     | 
| Tot_Clms     | Number of prescriptions made | int64   | an integer   | 

In [None]:
df.head(10)

How many absent items (NaNs) are there in the dataset?

In [None]:
df.isna().sum()

So there are only five absent items in the `Prscrbr_Type` column. Let's discard the corresponding rows:

In [None]:
df = df.dropna()

In [None]:
df.shape

The `Prscrbr_NPI`, `Prscrbr_Type`, and `Gnrc_Name` contain categorical values. How many categories in each of them?

In [None]:
df[['Prscrbr_NPI', 'Prscrbr_Type', 'Gnrc_Name']].nunique()

If we were to create a matrix of `Prscrbr_NPI` by `Gnrc_Name`, with `Tot_Clms` for cell values, that matrix would be of size $1057564 \times 1757$! How many values a matrix of this size would store?

In [None]:
num_positions = 1057564 * 1757
print(f'{num_positions:,}')

Almost two billion positions! And most of them are zeros, because we only have $25,869,516$ non-zero values.

In [None]:
num_nonzero_positions = df.shape[0]
print(f'{num_nonzero_positions:,}')

The occupancy ratio of this matrix is very, very small:

In [None]:
occupancy_ratio = num_nonzero_positions / num_positions
print(f'{occupancy_ratio:.2%}')

Only about $1.39\%$ of the matrix is non-zero. Such matrices are called *sparse matrices*. There are algorithmic techniques to handle sparse matrices in an efficient manner, both in terms of amount of computation to be performed and in memory usage. We will make use of it in a bit.

But we should also consider whether all of these categories in `Prscrbr_NPI` and `Gnrc_Name` are truly useful. Maybe categories with under-representation could be discarded, since we want to focus on the most present information only.

Let's start with the physician type:

In [None]:
count_Prscrbr_Type = df['Prscrbr_Type'].value_counts()

In [None]:
count_Prscrbr_Type.head(10)

There is a steep decline in representation from the most common type of doctor ("Family Practice") to the 10th most common ("Ophtalmology").

In [None]:
count_Prscrbr_Type.tail(10)

The least represented types of doctor have only one entry each.

Let's decide on a cutoff value to filter the least common types. It will be quite arbitrary, and it is worthwhile to consider whether this removal of information will qualitatively impact the conclusions to be obtained from the analysis of this dataset - keep this in mind, ok?

In [None]:
import numpy as np

In [None]:
count_Prscrbr_Type.quantile(np.linspace(0, 1, 21))

Making a list of quantiles from zero to $100\%$, in steps of $5\%$, we see an almost exponential growth in number of physicians per type. Let's cut aggressively then: it will not change the number of data points substantially.

In [None]:
q_Prscrbr_Type = 0.70
cut_value = count_Prscrbr_Type.quantile(q_Prscrbr_Type)

useful_Prscrbr_Type = count_Prscrbr_Type[count_Prscrbr_Type >= cut_value].index

In [None]:
useful_Prscrbr_Type

In [None]:
df = df[df['Prscrbr_Type'].isin(useful_Prscrbr_Type)]

In [None]:
df.shape

We only reduced the number of data points from $25,869,516$ to $25,754,507$!

Now let's look into the physicians:

In [None]:
count_Prscrbr_NPI = df['Prscrbr_NPI'].value_counts()

In [None]:
count_Prscrbr_NPI.quantile(np.linspace(0, 1, 21))

Again, a very exponential-like growth of number of prescriptions per physician. Let's cut aggressively again, but remember: this may impact the quality of the subsequent analysis, so it may be interesting to revisit these arbitrary decisions later.

In [None]:
q_Prscrbr_NPI = 0.5
cut_value = count_Prscrbr_NPI.quantile(q_Prscrbr_NPI)

useful_Prscrbr_NPI = count_Prscrbr_NPI[count_Prscrbr_NPI >= cut_value].index

In [None]:
len(useful_Prscrbr_NPI)

In [None]:
df = df[df['Prscrbr_NPI'].isin(useful_Prscrbr_NPI)]

In [None]:
df.shape

We are keeping a large portion of the information still: $24,297,207$, from the previous number of $25,754,507$ rows.

Now let's investigate the drugs:

In [None]:
count_Gnrc_Name = df['Gnrc_Name'].value_counts()

In [None]:
count_Gnrc_Name.head(10)

In [None]:
count_Gnrc_Name.tail(10)

In [None]:
count_Gnrc_Name.quantile(np.linspace(0, 1, 21))

Again, an exponential growth in quantiles, let's cut:

In [None]:
q_Gnrc_Name = 0.5
cut_value = count_Gnrc_Name.quantile(q_Gnrc_Name)

useful_Gnrc_Name = count_Gnrc_Name[count_Gnrc_Name >= cut_value].index

In [None]:
useful_Gnrc_Name

In [None]:
df = df[df['Gnrc_Name'].isin(useful_Gnrc_Name)]

In [None]:
df.shape

Finally, after all this filtering, we went from $25,869,521$ items to $24,246,835$.

In [None]:
print(f'We dropped {1.0 - 24246835/25869521:.2%} of the data')

In [None]:
df['Prscrbr_NPI'].nunique()

In [None]:
df['Gnrc_Name'].nunique()

In [None]:
318660 * 857

In [None]:
df.shape

In [None]:
21400306 / 273091620

create a sparse matrix from the dataframe

In [None]:
df = df.reset_index(drop=True)

In [None]:
df['Prscrbr_NPI'] = df['Prscrbr_NPI'].astype('category')
df['Gnrc_Name'] = df['Gnrc_Name'].astype('category')
df['Prscrbr_Type'] = df['Prscrbr_Type'].astype('category')
df['Tot_Clms'] = df['Tot_Clms'].astype(float)

In [None]:
df.info()

In [None]:
df['Gnrc_Name'].cat.categories

In [None]:
df.head()

In [None]:
df['Gnrc_Name'].cat.codes

In [None]:
df['Gnrc_Name'].cat.categories[19]

In [None]:
row_index = df['Prscrbr_NPI'].cat.codes
num_rows = df['Prscrbr_NPI'].cat.categories.size

col_index = df['Gnrc_Name'].cat.codes
num_cols = df['Gnrc_Name'].cat.categories.size

values = df['Tot_Clms'].values

In [None]:
from scipy.sparse import csr_matrix

sparse_matrix = csr_matrix(
    (values, (row_index, col_index)),
    shape=(num_rows, num_cols),
)

In [None]:
sparse_matrix.shape

In [None]:
from scipy.sparse.linalg import svds

In [None]:
U, sigma, Vt = svds(sparse_matrix, k=20)

In [None]:
U.shape, sigma.shape, Vt.shape

In [None]:
np.savetxt(DATA_DIR / 'features.tsv', Vt.T, delimiter='\t')

In [None]:
np.savetxt(
    DATA_DIR / 'labels.tsv',
    df['Gnrc_Name'].cat.categories.values,
    delimiter='\t',
    fmt='%s',
    encoding='utf-8',
)