# D212 - Data Mining II - Performance Assessment Task 3
## Joshua T. Funderburk

#### Programming Environment

In [1]:
from platform import python_version
print(f"Python version: {python_version()}")

Python version: 3.12.8


#### Steps required prior to completing rubric requirements

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from mlxtend.frequent_patterns import association_rules, apriori
from mlxtend.preprocessing import TransactionEncoder

In [3]:
# Read CSV & Load Data in to Pandas Dataframe
df = pd.read_csv(r'C:\Users\funde\Desktop\WGU\D212\Task 3\medical_market_basket.csv')

# Print top 5 rows of the Dataframe
df.head()

Unnamed: 0,Presc01,Presc02,Presc03,Presc04,Presc05,Presc06,Presc07,Presc08,Presc09,Presc10,Presc11,Presc12,Presc13,Presc14,Presc15,Presc16,Presc17,Presc18,Presc19,Presc20
0,,,,,,,,,,,,,,,,,,,,
1,amlodipine,albuterol aerosol,allopurinol,pantoprazole,lorazepam,omeprazole,mometasone,fluconozole,gabapentin,pravastatin,cialis,losartan,metoprolol succinate XL,sulfamethoxazole,abilify,spironolactone,albuterol HFA,levofloxacin,promethazine,glipizide
2,,,,,,,,,,,,,,,,,,,,
3,citalopram,benicar,amphetamine salt combo xr,,,,,,,,,,,,,,,,,
4,,,,,,,,,,,,,,,,,,,,


In [4]:
df.shape

(15002, 20)

# Part I: Research Question

## A1: Proposal of Question

The question proposed for this analysis is "What medications are most frequently prescribed alongside Abilify, and how strong are these prescription associations?"

## A2: Defined Goal

The goal of this analysis is to identify and analyze medication combinations that have a statistically significant association with Abilify prescriptions to understand common prescribing patterns. Understanding co-prescription patterns is crucial to patient care and improving patient health outcomes. In a hospital system, this understanding can potentially help develop treatment protocols for patients on Abilify. Abilify is especially important to understand as it occurs far more frequently that any other prescription in this data set. From a clinical perspective, this understanding could support decision-making by providing insights about common prescription combinations including medication associations that may warrant clinical investigation. 

# Part II: Market Basket Justification

## B1: Explanation of Market Basket

Market Basket Analysis (MBA) is a data mining technique that discovers patterns in how items are purchased or used together (GeeksforGeeks, 2024). Through analyzing transaction data, MBA identifies relationships between items using association rules - if customers buy one item (the antecedent), what else are they likely to buy (the consequent)? Businesses use these insights to optimize product placement, design promotions, and improve their sales strategies.

Market Basket Analysis examines prescription data by transforming individual patient records into a binary matrix, where each row represents a transaction and columns indicate medication presence (True) or absence (False). This transformation is accomplished using Python's mlxtend.preprocessing.TransactionEncoder and then stored in a pandas DataFrame. The Apriori algorithm from mlxtend.frequent_patterns then identifies frequent medication combinations. Following this, mlxtend.frequent_patterns.association_rules generates association rules from these frequent patterns, which are sorted using metrics like lift, support, and confidence to evaluate relationship strength.

For this medical dataset, the analysis focuses specifically on prescriptions containing Abilify. By filtering association rules where Abilify appears as either an antecedent or consequent, the analysis reveals common prescription patterns and their statistical significance. The expected outcomes include discovering which medications are most commonly prescribed alongside Abilify, understanding the strength of these prescription patterns through support, confidence, and lift metrics, and identifying clinically relevant medication combinations. Healthcare providers can use these insights to develop treatment protocols and support clinical decision-making. Additionally, the findings have practical applications in pharmacy inventory management and can inform manufacturer rebate programs within legal boundaries.

## B2: Transaction Example

The following code identifies the first transaction listed in the dataset:

In [5]:
# Show an example of a transaction in the dataset
df.iloc[1]

Presc01                 amlodipine
Presc02          albuterol aerosol
Presc03                allopurinol
Presc04               pantoprazole
Presc05                  lorazepam
Presc06                 omeprazole
Presc07                 mometasone
Presc08                fluconozole
Presc09                 gabapentin
Presc10                pravastatin
Presc11                     cialis
Presc12                   losartan
Presc13    metoprolol succinate XL
Presc14           sulfamethoxazole
Presc15                    abilify
Presc16             spironolactone
Presc17              albuterol HFA
Presc18               levofloxacin
Presc19               promethazine
Presc20                  glipizide
Name: 1, dtype: object

## B3: Market Basket Assumption

One key assumption of Market Basket Analysis is that the data set is a large representative sample to explain overall customer transactions (Deniran, 2023). Without a representative sample, it would be difficult to draw any meaningful insights. Potentially any insights drawn could also be incorrect with an insufficient sample size. To ensure that a sample is representative, the analyst should confirm:
1. The sample covers different time periods to account for seasonality
2. The samples come from different customer populations to capture dirverse shopping behaviors
3. The data contains complete transaction information including item ids, store locations, and timestamps

This is particularly crucial for larger retail chains where transaction data can be segmented across multiple dimensions (geographic regions, store formats, demographic areas). Ensuring representative sampling across all these dimensions is essential for gathering meaningful insights.

# Part III: Data Preparation and Analysis

## C1: Transforming the Data Set

To transform the data set, the following steps are taken:
1. Remove blank rows from the dataset
2. Store the data in a list of lists
3. Use TransactionEncoder() to learn unique items and transform the data in to a binary matrix
4. Convert binary matrix in to a DataFrame

Following data transformation, the DataFrame is written to a csv for task submission.

In [6]:
# Remove blank rows from the data set
bask = df[df['Presc01'].notna()]
bask.reset_index(drop=True, inplace=True)
bask

Unnamed: 0,Presc01,Presc02,Presc03,Presc04,Presc05,Presc06,Presc07,Presc08,Presc09,Presc10,Presc11,Presc12,Presc13,Presc14,Presc15,Presc16,Presc17,Presc18,Presc19,Presc20
0,amlodipine,albuterol aerosol,allopurinol,pantoprazole,lorazepam,omeprazole,mometasone,fluconozole,gabapentin,pravastatin,cialis,losartan,metoprolol succinate XL,sulfamethoxazole,abilify,spironolactone,albuterol HFA,levofloxacin,promethazine,glipizide
1,citalopram,benicar,amphetamine salt combo xr,,,,,,,,,,,,,,,,,
2,enalapril,,,,,,,,,,,,,,,,,,,
3,paroxetine,allopurinol,,,,,,,,,,,,,,,,,,
4,abilify,atorvastatin,folic acid,naproxen,losartan,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7496,amphetamine,clotrimazole,lantus,,,,,,,,,,,,,,,,,
7497,citalopram,metoprolol,amphetamine salt combo xr,glyburide,celebrex,losartan,,,,,,,,,,,,,,
7498,clopidogrel,,,,,,,,,,,,,,,,,,,
7499,alprazolam,losartan,,,,,,,,,,,,,,,,,,


In [7]:
# Get counts of all medications
med_counts = bask.melt()['value'].value_counts()
med_counts

value
abilify                       1788
amphetamine salt combo xr     1348
carvedilol                    1306
glyburide                     1282
diazepam                      1230
                              ... 
flovent hfa 110mcg inhaler      29
cefdinir                        14
fluoxetine HCI                   7
finasteride                      5
hydrocortisone 2.5% cream        3
Name: count, Length: 119, dtype: int64

In [8]:
# Store the data in a list of lists
bask_list = bask.stack().groupby(level=0).apply(list).values.tolist()

# Instantiate TransactionEncoder
bask_encoder = TransactionEncoder()

# Fit the encoder to learn unique items and transform data into binary matrix
bask_matrix = bask_encoder.fit(bask_list).transform(bask_list)

# Convert binary matrix to DataFrame with prescription names as columns
bask_matrix = pd.DataFrame(bask_matrix, columns=bask_encoder.columns_)

# Display the DataFrame
bask_matrix

Unnamed: 0,Duloxetine,Premarin,Yaz,abilify,acetaminophen,actonel,albuterol HFA,albuterol aerosol,alendronate,allopurinol,...,trazodone HCI,triamcinolone Ace topical,triamterene,trimethoprim DS,valaciclovir,valsartan,venlafaxine XR,verapamil SR,viagra,zolpidem
0,False,False,False,True,False,False,True,True,False,True,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7496,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7497,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7498,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7499,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [9]:
bask_matrix.shape

(7501, 119)

In [10]:
# Write the bask_matrix to a csv file
bask_matrix.to_csv('bask_matrix_t3.csv', index=False)
bask_matrix.columns

Index(['Duloxetine', 'Premarin', 'Yaz', 'abilify', 'acetaminophen', 'actonel',
       'albuterol HFA', 'albuterol aerosol', 'alendronate', 'allopurinol',
       ...
       'trazodone HCI', 'triamcinolone Ace topical', 'triamterene',
       'trimethoprim DS', 'valaciclovir', 'valsartan', 'venlafaxine XR',
       'verapamil SR', 'viagra', 'zolpidem'],
      dtype='object', length=119)

## C2: Code Execution

The Apriori algorithm generates association rules below. A minimum support threshold of 0.02 is used in this analysis. This means that an itemset must appear in a least 2% of all transactions to be considered frequent enough to generate rules from. It is a good starting point because it is low enough to catch meaningful patterns that are not extremely common, but high enough to filter out very rare combinations.

In [11]:
# Generate association rules with the Apriori algorithm
freq_meds = apriori(bask_matrix, min_support=0.02, use_colnames=True)
freq_meds.head

<bound method NDFrame.head of       support                          itemsets
0    0.046794                        (Premarin)
1    0.238368                         (abilify)
2    0.020397               (albuterol aerosol)
3    0.033329                     (allopurinol)
4    0.079323                      (alprazolam)
..        ...                               ...
98   0.023064            (diazepam, lisinopril)
99   0.023464              (diazepam, losartan)
100  0.022930            (diazepam, metoprolol)
101  0.020131  (glyburide, doxycycline hyclate)
102  0.028530             (glyburide, losartan)

[103 rows x 2 columns]>

## C3: Association Rules Table

The association rules table is generated with the metric as lift and a minimum threshold of 1.5. Lift is a good metric of choice for this analysis as it accounts for the baseline frequency of both the antecedent and the consequent as opposed to one or the other as with support and confidence. It also has a very intuitive scale. A lift of 1.5 allows for identification of items that appear together at least 50% more often than expected by chance. 1.0 only shows positive correlation which could just be chance. A lift of 2.0 was too limiting for this dataset.

In [12]:
# Generate assocation rules
rules = association_rules(freq_meds, metric="lift", min_threshold=1.5)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(abilify),(atorvastatin),0.238368,0.129583,0.047994,0.201342,1.553774,0.017105,1.08985,0.46795
1,(atorvastatin),(abilify),0.129583,0.238368,0.047994,0.37037,1.553774,0.017105,1.20965,0.409465
2,(clopidogrel),(abilify),0.059992,0.238368,0.022797,0.38,1.594172,0.008497,1.228438,0.396502
3,(abilify),(clopidogrel),0.238368,0.059992,0.022797,0.095638,1.594172,0.008497,1.039415,0.489364
4,(fenofibrate),(abilify),0.05106,0.238368,0.020131,0.394256,1.653978,0.00796,1.257349,0.416672
5,(abilify),(fenofibrate),0.238368,0.05106,0.020131,0.084452,1.653978,0.00796,1.036472,0.519145
6,(abilify),(glipizide),0.238368,0.065858,0.027596,0.115772,1.757904,0.011898,1.056449,0.566075
7,(glipizide),(abilify),0.065858,0.238368,0.027596,0.419028,1.757904,0.011898,1.310962,0.461536
8,(lisinopril),(abilify),0.098254,0.238368,0.040928,0.416554,1.747522,0.017507,1.305401,0.474369
9,(abilify),(lisinopril),0.238368,0.098254,0.040928,0.1717,1.747522,0.017507,1.088672,0.561638


## C4: Top Three Rules

The top three rules for this association rule are identified by a descending sort of the values by lift. The top three rules are as followed:

**Rule #33: carvedilol → lisinopril**
- Antecedent: carvedilol
- Consequent: lisinopril
- Metric analysis:
    - Confidence: 0.225115
        - Indicates that when carvedilol is prescribed, lisinopril is also purchased 22.51% of the time
    - Lift: 2.291162
        - The lift >1 indicates a positive correlation between lisinopril and carvedilol
    - Support: 0.039195
        - 3.92% of transactions contain this combination

**Rule #32: lisinopril → carvedilol**
- Antecedent: lisinopril
- Consequent: carvedilol
- Metric analysis:
    - Confidence: 0.398915
        - Indicates that when lisinopril is prescribed, carvedilol is also purchased 39.89% of the time
    - Lift: 2.291162
        - The lift >1 indicates a positive correlation between carvedilol and lisinopril (Same as Rule #33) 
    - Support: 0.039195
        - 3.92% of transactions contain this combination (Same as Rule #33)

**Rule #30: glipizide → carvedilol**
- Antecedent: glipizide
- Consequent: carvedilol
- Metric analysis:
    - Confidence: 0.348178
        -  Indicates that when glipizide is prescribed, carvedilol is also purchased 34.82% of the time
    - Lift: 1.999758
        - The lift >1 indicates a positive correlation between glipizide and carvedilol
    - Support: 0.022930
       - 2.29% of transactions contain this combination

Carvedilol appears in all of the top 3 rules, which does make sense to a certain extent as it is the third most common medication in the data set. Interestingly, Rules 32 and 33 show the same relationship from different directions (lisinopril and carvedilol), which seems to indicate that they are often prescribed together. All 3 rules show strong lifts, appear roughly at or above 100% more than would be expected by chance. The confidence metrics all show strong relationships of the "if then" logic. The percentage of the support in each of the top three is low, but does make sense given the amount of transactions in the data set as well as the possible number of medication combinations.

In [13]:
# Identify the top three rules
top_three_rules = rules.sort_values('lift', ascending=False).head(3)
top_three_rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
33,(carvedilol),(lisinopril),0.17411,0.098254,0.039195,0.225115,2.291162,0.022088,1.163716,0.682343
32,(lisinopril),(carvedilol),0.098254,0.17411,0.039195,0.398915,2.291162,0.022088,1.373997,0.624943
30,(glipizide),(carvedilol),0.065858,0.17411,0.02293,0.348178,1.999758,0.011464,1.267048,0.535186


# Part IV: Data Summary and Implications

## D1: Significance of Support, Lift, and Confidence Summary

With the association rules formed, the analysis can now turn towards analyzing rules where Abilify appears in the antecedents or the consequent. There are several rules where Abilify appears, so in order to perform a more directed analysis, a filter on the confidence metric is also set. The lift filter remains at 1.5, but the confidence metric is required to be greater than or equal to 0.35. This criterion leaves 7 association rules left. It is interesting to note that only a single medication is included in the antecedent and the consequent for each rule, as has seemed to be a trend for this data set. Most notably, all seven rules have Abilify as the consequent, indicating these medications are strong predictors of Abilify prescriptions. This could indicate great variance in prescriptions for this data set.

With the filters set and the association rules identified, this analysis can glean important prescription patterns involving Abilify from the support, lift, and confidence metrics. The support values range from 0.02 to 0.047, indicating that only 2% to 4.7% of these medication combinations occur consistently. While relatively low, this aligns with the idea that health care is complex and an infinite number of medical conditions someone could have is associated to a large number of possible medication combinations.

The confidence values reveal several medications showing strong predictive relationships with Abilify, especially since all 7 of the association rules have Abilify as the consequent. Metformin had the highest confidence at 0.456, meaning that when metformin is prescribed, Abilify is prescribed 45.6% of the time. The lowest of the confidence values in the association rules shows that when atorvastatin is prescribed, Abilify is also prescribed 37% of the time. Overall, all 7 of the rules showed at least a 37% predictive relationship with Abilify.

The lift values exceeding 1.5 for these associations demonstrate that these medication combinations occur at least 50% more frequently than would be expected by chance. The strongest lift values were observed with metformin (1.92 lift) and glipizide (1.76 lift) as antecedents, indicating particularly strong non-random associations that suggest meaningful clinical relationships rather than the medications being prescribed together randomly. This is particularly important to note when trying to understand patterns. The strength of these associations provides statistical validation for these prescription combinations.

These metrics together paint a comprehensive picture of prescription patterns, with support values confirming consistent co-occurrence despite medical complexity, confidence values showing strong predictive relationships, and lift values validating the non-random nature of these associations. The analysis reveals a clear pattern of certain medications being reliable predictors of Abilify prescriptions, with varying degrees of association strength.

In [14]:
# Filter rules by both lift and confidence thresholds
strong_rules = rules[
    (rules['lift'] >= 1.5) &
    (rules['confidence'] >= 0.35)
]

# Filter to antecedents and consequents that contain Abilify
abilify_rules = strong_rules[
    (strong_rules['antecedents'].apply(lambda x: 'abilify' in str(x).lower())) |
    (strong_rules['consequents'].apply(lambda x: 'abilify' in str(x).lower()))
]

# Sort by lift and show top relationships
abilify_sorted = abilify_rules.sort_values('lift', ascending=False)
print("\nStrongest Abilify Relationships (by lift):")
abilify_sorted


Strongest Abilify Relationships (by lift):


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
11,(metformin),(abilify),0.050527,0.238368,0.023064,0.456464,1.914955,0.01102,1.401255,0.503221
7,(glipizide),(abilify),0.065858,0.238368,0.027596,0.419028,1.757904,0.011898,1.310962,0.461536
8,(lisinopril),(abilify),0.098254,0.238368,0.040928,0.416554,1.747522,0.017507,1.305401,0.474369
4,(fenofibrate),(abilify),0.05106,0.238368,0.020131,0.394256,1.653978,0.00796,1.257349,0.416672
2,(clopidogrel),(abilify),0.059992,0.238368,0.022797,0.38,1.594172,0.008497,1.228438,0.396502
12,(metoprolol),(abilify),0.095321,0.238368,0.035729,0.374825,1.572463,0.013007,1.21827,0.402413
1,(atorvastatin),(abilify),0.129583,0.238368,0.047994,0.37037,1.553774,0.017105,1.20965,0.409465


## D2: Practical Significance of Findings

This analysis has revealed 7 medications that are typically prescribed and purchased alongside Abilify as was the goal of this analysis and answers part of the proposed question. The support, lift, and confidence metrics reveal strong association between these medications as described in section D1 to answer the other part of the proposed question. It would take clinical expertise to gather any medical insights from these relationships. At a high level, it can be assumed that these findings could support clinical decision-making processes. From an inventory management perspective, pharmacies can optimize their stock levels by anticipating that when prescriptions for medications like metformin are filled, there's a higher likelihood (45.6%) of needing Abilify. Overall, the analysis provides quantitative data to support inventory decisions. Within legal and ethical boundaries, pharmacies can use this information to make marketing or promotion decisions. Often copay cards are one way a pharmacy is able to do this.

## D3: Course of Action

Based on the significant associations identified between Abilify and seven other medications there are several actions available as a result. The first would be to set up data-driven inventory management. For example, inventory alerts can be set up for when metformin, glipizide, or lisinopril prescriptions are filled, since they showed the strongest associations with Abilify. These alerts could be based on a threshold obtained from the confidence values in this analysis. From a pharmacy operations perspective, the pharmacy could design workflows that ensure efficient processing of these commonly associated prescriptions. Clinical leaders could use this data to analyze relationships and make standard prescribing procedures within a hospital system. Another interesting course of action would be to procure a dataset that has more information, such as using transaction dates instead of a prescription record and pharmacy locations so that more detailed recommendations can be produced.

# Part V: Attachments

## F: Sources of Third-Party Code

Kamara, K. (2025, January). Market Basket Analysis in Python. Lecture. 

Kamara, K. (n.d.-a). Market Basket Analysis PPT. https://srm--c.vf.force.com/apex/CourseArticle?id=kA0S60000000ilpKAA 

## G: Sources

Deniran, O. H. (2023, November 27). Boosting sales with data: The Power of Market Basket Analysis in retail. Medium. https://medium.com/@chemistry8526/boosting-sales-with-data-the-power-of-market-basket-analysis-in-retail-c79cc10a14df 

GeeksforGeeks. (2024, August 16). Market basket analysis in Data Mining. GeeksforGeeks. https://www.geeksforgeeks.org/market-basket-analysis-in-data-mining/ 

Kamara, K. (2025, January). Market Basket Analysis in Theory. Lecture.

Kamara, K. (n.d.-a). Market Basket Analysis PPT. https://srm--c.vf.force.com/apex/CourseArticle?id=kA0S60000000ilpKAA 