# Task 1: Symptom Co-occurrence Analysis

## Objective
The goal of this task is to analyse the co-occurrence patterns of different symptoms within disease profiles. Specifically, the task aims to identify combinations of symptoms that frequently appear together in the same disease.

## Method
Implement the Apriori algorithm to analyse the Disease Symptom dataset, identifying common combinations of symptoms that frequently co-occur within the same disease profile.


## 1. Setup and Environment


In [None]:
import sys
from pathlib import Path

# Add project root to Python path to import custom modules
project_root = Path().resolve().parent.parent
sys.path.append(str(project_root))

from src.processors.symptom_data_processor import SymptomDataProcessor
from src.analysis.symptom_pattern_miner import SymptomPatternMiner

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

print("Libraries and modules imported successfully.")


## 2. Data Loading and Preprocessing


In [None]:
# Define the path to the dataset using the project_root variable
# This ensures the path is always correct, regardless of where the notebook is run
DATA_PATH = project_root / 'data' / 'dataset.csv'

# Initialize and run the data processor
processor = SymptomDataProcessor(data_path=DATA_PATH)
transactions = processor.process_data()

print(f"Successfully processed {len(transactions)} transactions.")
print("Example transaction:", transactions[0])


## 3. Apriori Algorithm and Association Rule Mining


In [None]:
# Initialize the pattern miner.
# NOTE: A `min_support` of 0.01 is too low for this dataset and will likely cause a memory error, crashing the kernel.
# A value of 0.02 or higher is recommended for stable performance.
miner = SymptomPatternMiner(transactions, min_support=0.02)

# Mine for frequent itemsets
frequent_itemsets = miner.mine_frequent_itemsets()

# Generate association rules with a minimum confidence of 50%
rules = miner.generate_association_rules(metric="confidence", min_threshold=0.5)

# --- Save Results to CSV ---
# This allows the analysis section to load pre-computed results,
# either from this run or from a previous run of the `symptom_analysis.py` script.
output_dir = project_root / 'outputs'
output_dir.mkdir(exist_ok=True) # Ensure the directory exists

itemsets_path = output_dir / 'frequent_itemsets.csv'
rules_path = output_dir / 'association_rules.csv'

frequent_itemsets.to_csv(itemsets_path, index=False)
rules.to_csv(rules_path, index=False)

print("Frequent itemsets and association rules have been saved to the 'outputs' directory.")
print(f"Itemsets saved to: {itemsets_path}")
print(f"Rules saved to: {rules_path}")

## 4. Load and Analyze Results

This section loads the pre-computed results from the `outputs` directory. You can either generate these files by running the cell above, or by running the main analysis script: `python scripts/symptom_analysis.py`.

This allows you to separate the time-consuming mining process from the interactive analysis and visualization.


In [None]:
# --- Load Results from CSV ---
# Load the frequent itemsets and association rules from the files saved in the 'outputs' directory.

itemsets_path = project_root / 'outputs' / 'frequent_itemsets.csv'
rules_path = project_root / 'outputs' / 'association_rules.csv'

try:
    frequent_itemsets_from_csv = pd.read_csv(itemsets_path)
    rules_from_csv = pd.read_csv(rules_path)
    
    print("Successfully loaded pre-computed results from CSV files.")
    
    print("\nFrequent Itemsets (Top 10):")
    display(frequent_itemsets_from_csv.head(10))
    
    print("\nAssociation Rules (Top 10):")
    display(rules_from_csv.head(10))
    
except FileNotFoundError:
    print("CSV files not found. Please run the mining cell above or the `symptom_analysis.py` script first.")



### Explanation of Association Rule Metrics

For a rule **"IF {A} THEN {B}"**:

| Column | Simple Explanation | In Technical Terms |
| :--- | :--- | :--- |
| **`antecedents`** | The "IF" part of the rule. This is symptom set {A}. | `antecedents` |
| **`consequents`** | The "THEN" part of the rule. This is symptom set {B}. | `consequents` |
| **`antecedent support`** | How often symptom {A} appears in the entire dataset. | `support(A)` |
| **`consequent support`** | How often symptom {B} appears in the entire dataset. | `support(B)` |
| **`support`** | How often {A} and {B} appear **together** in the dataset. | `support(A U B)` |
| **`confidence`** | **The rule's reliability.** "If a patient has {A}, what's the probability they also have {B}?" Higher is better. | `support(A U B) / support(A)` |
| **`lift`** | **The rule's importance.** How much more likely {B} is to appear when {A} is present. `> 1` is good. Higher is better. | `confidence(A->B) / support(B)` |
| **`leverage`** | The difference between how often {A} and {B} appear together versus how often they would if they were independent. `> 0` means they appear together more than expected. | `support(A U B) - (support(A) * support(B))` |
| **`conviction`** | A measure of the rule's implication. A high value means the consequent {B} is highly dependent on the antecedent {A}. An `inf` (infinity) value is very strong. | `(1 - support(B)) / (1 - confidence(A->B))` |
| **`zhangs_metric`** | A more advanced measure of association that ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation). 0 indicates independence. | A value that considers both support and confidence. |
| **`kulczynski`** | The average of the two confidence scores (`A->B` and `B->A`). It's a symmetric measure of how strongly the two are related. | `0.5 * (confidence(A->B) + confidence(B->A))` |


## 5. Visualization


In [None]:
# This visualization uses the data loaded from the CSV files.
import ast

# Get top 10 most frequent itemsets from the DataFrame we loaded from the CSV
top_itemsets = frequent_itemsets_from_csv.nlargest(10, 'support')

# The 'itemsets' column is loaded as a string, e.g., "frozenset({'fatigue', 'vomiting'})".
# We need to convert this string back into a plottable format.
def format_itemset_str(itemset_str):
    try:
        # Use ast.literal_eval to safely evaluate the string representation
        s = ast.literal_eval(itemset_str)
        # It's likely a frozenset, so convert to list and join for a clean label
        return ', '.join(list(s))
    except (ValueError, SyntaxError):
        # If something goes wrong, return the original string
        return itemset_str

# Prepare data for plotting by creating a new, clean string column for the labels
top_itemsets['itemsets_str'] = top_itemsets['itemsets'].apply(format_itemset_str)

# Plotting
plt.figure(figsize=(12, 8))
sns.barplot(x='support', y='itemsets_str', data=top_itemsets, palette='viridis')
plt.title('Top 10 Most Frequent Symptom Combinations')
plt.xlabel('Support')
plt.ylabel('Symptom Sets')
plt.show()
