# Structural Parity in Cancer

By Annie Tran

I demonstrate below how simple rules created from theory can be used to structurally predict a benign or malignant diagnosis with 93%, 96%, and 98% accuracy on Wisconsin's Breast Cancer dataset.

No machine learning or black box approaches were used.

---


**Repository**:  
[github.com/nntrn/cancer][Repo]

**Dataset:**  
[Breast Cancer Wisconsin (Diagnostic)][Data]


[Repo]: https://github.com/nntrn/cancer
[Github]: https://nntrn.github.io
[Data]: https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic

## Setup

In [14]:
import pandas as pd

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 3)
pd.set_option('styler.format.precision',0)
pd.set_option("display.width", 90)

%precision 4

df = pd.read_csv('breast-cancer-wisconsin.csv')
# remove se columns
df = df.drop([y for y in df.keys() if '_se' in y],axis=1)

In [15]:
def accuracy_calc(data,pname="predicted",dname="diagnosis"):
  t,k,m = {},[],{'BB':0, 'MM':0, 'BM':0, 'MB':0}
  for actual, pred in zip(data[dname], data[pname]):
    key=f"{pred}.{actual}"
    if pred == 'B' or pred == 'M':
      m[actual+pred] += 1
    elif m.__contains__(key):
      m[key] += 1
    else:
      m.__setitem__(key,1)
      k.append(key)

  return {
    'matrix': m,
    'total': len(data),
    'wrong': m['BM']+m['MB'],
    'correct': m['BB']+m['MM'],
    'accuracy':(m['BB']+m['MM'])/(m['BM']+m['MB']+m['BB']+m['MM']),
    'b.accuracy': (m['BB']/(m['BB']+m['BM']))*100,
    'm.accuracy': (m['MM']/(m['MM']+m['MB']))*100,
    'test': [[x,m[x],(m[x]/len(data))] for x in k ]
  }

## Functions

### Core

In [16]:
def closure_score(row):
  try:
    return 1 - min(abs(row['radius_mean'] - row['perimeter_mean']/6.28) / row['radius_mean'], 1)
  except ZeroDivisionError:
    return 0

def constraint_score(row):
  return min(row['smoothness_mean']/0.15,row['symmetry_mean']/0.3,1)
  
def asymmetry_score(row):
  return int((row['fractal_dimension_mean'] > 0.075) or (row['texture_mean'] > 20))

### Derived

In [17]:
def generative_score(row): 
  fd = (row['fractal_dimension_mean']*row['fractal_dimension_worst'])
  return (row['area_mean']/fd)/1000

def area_integrity_score(row):
  return abs(row['area_mean']/row['area_worst'])

def area_relative_score(row):
  if row['area_worst']==row['area_mean']: return 1
  return (row['area_worst']-row['area_mean'])/row['area_mean']

def diversity_score(row):
  return 1-(row['fractal_dimension_mean']/row['fractal_dimension_worst'])

def growth_tension_score(row): 
  normalized_fractal_mean=row['fractal_dimension_mean']*100
  return row['area_mean']/normalized_fractal_mean

def abnormal_count(row):
  abnormal = 0
  if asymmetry_score(row) == 1:
    abnormal += 1
  if constraint_score(row) < 0.7:
    abnormal += 1
  if generative_score(row) > 200:
    abnormal += 1
  if area_relative_score(row) > 0.5: 
    abnormal += 1
  if row['smoothness_mean'] > 0.11:
    abnormal += 1
  if row['concave_points_mean'] >0.05:
    abnormal += 1
  if row['concave_points_worst'] >0.2:
    abnormal += 1
  return abnormal

df['closure'] = df.apply(closure_score, axis=1)
df['constraint'] = df.apply(constraint_score, axis=1)
df['asymmetry'] = df.apply(asymmetry_score, axis=1)
df['generative'] = df.apply(generative_score, axis=1)
df['integrity'] = df.apply(area_integrity_score, axis=1)
df['magnitude'] = df.apply(area_relative_score, axis=1)
df['diversity'] = df.apply(diversity_score, axis=1)
df['growth_tension'] = df.apply(growth_tension_score, axis=1)
df['abnormal'] = df.apply(abnormal_count, axis=1)

## Models


     Model   Rules   Accuracy  Classification Matrix                Correct
    ------- ------- ---------- ---------------------------------- ---------
       1       5       96%     BB: 343, MM: 204, BM: 14, MB: 8      547/569
       2       7       98%     BB: 350, MM: 208, BM: 7, MB: 4       558/569
       3       4       93%     BB: 333, MM: 198, BM: 24, MB: 14     531/569
  

### Model 1: 96%

Targets obvious and classic benign/malignant patterns first, then uses structure-based rules for ambiguous cases.

In [18]:
def predict(row): 
  if row['area_mean'] > 1000 or row['abnormal'] > 3:
    return 'M'
  elif row['area_mean'] < 550 and row['radius_worst'] < 15:
    return 'B'
  elif row['abnormal'] < 2 and row['closure'] > 0.96:
    return 'B'
  elif row['integrity'] > 0.8 and row['diversity'] < 0.25:
    return 'B'
  elif row['integrity'] > 0.8 and row['constraint'] < 0.55:
    return 'B'
  return 'M'

df['pred'] = df.apply(predict, axis=1)

accuracy_calc(df,"pred")

{'matrix': {'BB': 343, 'MM': 204, 'BM': 14, 'MB': 8},
 'total': 569,
 'wrong': 22,
 'correct': 547,
 'accuracy': 0.9613,
 'b.accuracy': 96.0784,
 'm.accuracy': 96.2264,
 'test': []}

### Model 2: 98%

Adds further feature checks for near-perfect classification and resolves almost all border cases.


In [19]:
def near_perfect_predict(row):
  if row['area_mean'] > 1000 or row['abnormal'] > 3 or row['area_worst'] > 1300:
    return 'M'
  elif row['area_mean'] < 550 and row['radius_worst'] < 15:
    return 'B'
  elif row['magnitude'] > 0.37 and row['smoothness_mean'] > 0.080:
    return 'M' 
  elif row['abnormal'] < 2 and row['closure'] > 0.96:
    return 'B'
  elif row['integrity'] > 0.8 and row['constraint'] < 0.615 and row['abnormal'] < 3:
    return 'B' 
  elif row['integrity'] > 0.79 and row['asymmetry'] < 1:
    return 'B'
  elif row['symmetry_mean'] < 0.3 and row['radius_mean'] < 13:
    return 'B'
  # Add new elif conditions here
  else:
    return 'M'

df['pred2'] = df.apply(near_perfect_predict, axis=1)
accuracy_calc(df,"pred2")

{'matrix': {'BB': 350, 'MM': 208, 'BM': 7, 'MB': 4},
 'total': 569,
 'wrong': 11,
 'correct': 558,
 'accuracy': 0.9807,
 'b.accuracy': 98.0392,
 'm.accuracy': 98.1132,
 'test': []}

**NOTE:**  
Big improvements can be made by adding rules that target 6 out of 7 'BM' predictions (actual=benign, predicted=malignant).


### Model 3: 93%

Uses only the strongest, minimal rules to show how much can be explained with as little as possible.

In [20]:
def min_abnormal_count(row):
  abnormal = 0
  if asymmetry_score(row) == 1:
    abnormal += 1
  if constraint_score(row) < 0.7:
    abnormal += 1
  if generative_score(row) > 200:
    abnormal += 1
  if area_relative_score(row) > 0.4: 
    abnormal += 1
  return abnormal

# comments below show correct/total
def minimal_predict(row): 
  if row['area_mean'] < 550 and row['radius_worst'] < 15:
    return 'B' # 263/269
  elif row['abnormal_m3'] < 2 and row['closure'] > 0.96:
    return 'B' # 233/236
  elif row['integrity'] > .8 and row['generative'] < 110: 
    return 'B' # 158/162
  elif row['integrity'] > .8 and row['constraint'] < 0.5:
    return 'B' # 56/58
  else:
    return 'M'

df['abnormal_m3'] = df.apply(min_abnormal_count, axis=1)

df['pred3'] = df.apply(minimal_predict, axis=1)
accuracy_calc(df,"pred3")

{'matrix': {'BB': 333, 'MM': 198, 'BM': 24, 'MB': 14},
 'total': 569,
 'wrong': 38,
 'correct': 531,
 'accuracy': 0.9332,
 'b.accuracy': 93.2773,
 'm.accuracy': 93.3962,
 'test': []}

In [21]:
keys=['id','diagnosis',
  'closure','constraint','asymmetry',
  'integrity','generative','abnormal']

df[keys].loc[df['pred'] != df['diagnosis']].head(5)

Unnamed: 0,id,diagnosis,closure,constraint,asymmetry,integrity,generative,abnormal
81,8611161,B,0.968,0.647,0,0.846,74.154,2
86,86135501,M,0.964,0.63,1,0.801,167.997,2
89,861598,B,0.958,0.705,0,0.811,121.239,2
91,861799,M,0.962,0.572,1,0.876,174.921,3
92,861853,B,0.983,0.462,0,0.664,167.164,2


## Examine dataset

In [22]:
print(df.groupby(['diagnosis', 'asymmetry']).agg({
    'smoothness_mean': ['mean', 'std', 'min', 'max'],
    'symmetry_mean': ['mean', 'std', 'min', 'max'],
    'texture_mean': ['mean', 'std', 'min', 'max'],
    'radius_mean': ['mean', 'std', 'min', 'max'],
    'perimeter_mean': ['mean', 'std', 'min', 'max']
}).T)

diagnosis                   B                M         
asymmetry                   0       1        0        1
smoothness_mean mean    0.093   0.092    0.102    0.103
                std     0.012   0.016    0.011    0.013
                min     0.064   0.053    0.078    0.074
                max     0.129   0.163    0.133    0.145
symmetry_mean   mean    0.174   0.176    0.189    0.195
                std     0.024   0.027    0.028    0.028
                min     0.117   0.106    0.150    0.131
                max     0.274   0.255    0.304    0.291
texture_mean    mean   16.209  22.488   17.982   23.137
                std     2.243   4.075    1.680    3.352
                min     9.710  12.870   11.890   10.380
                max    19.960  33.810   19.980   39.280
radius_mean     mean   12.318  11.686   17.609   17.401
                std     1.716   1.877    3.054    3.274
                min     8.196   6.981   11.760   10.950
                max    17.850  14.990   28.110  

In [23]:
# df.loc[(df['asymmetry'] >0) & (df['generative'] < 70 )]

In [24]:
# df.to_csv("output.csv")

## Rules

*Two moons on Saturn switch orbits every few years. This behavior appears anomalous individually but is stable when viewed as a pair*


### Strong signals

#### Benign

Oddness

* `row["asymmetry"] > 0` and `row["abnormal"] < 3` and `row["generative"] < 100`
* `row["abnormal"] < 2` and `row["asymmetry"] < 1`
* `row["area_mean"] < 525` and `row["radius_worst"] < 14` and `row["abnormal"] < 3`

High integrity

* `row["integrity"] > 0.8` and `row["diversity"] < 0.25`
* `row["integrity"] > 0.8` and `row["constraint"] < 0.55` and `row["abnormal"] < 3`
* `row["integrity"] > 0.8` and `row["asymmetry"] < 1`

Using mean

* `row['magnitude'] > 0.45` and `row['smoothness_mean'] > 0.091`
* `row["texture_mean"] > 22` and `row["perimeter_mean"] < 80`
* `row["symmetry_mean"] < 0.2` and `row["radius_mean"] < 13`
* `row["smoothness_mean"] < 0.1` and `row["texture_mean"] > 22` and `row["radius_mean"] < 15`


#### Malignant

* `row["area_mean"] > 1000`
* `row["abnormal"] > 3`
* `row["integrity"] < .6`


### Interpretting dual signals

Example using integrity (Id) + constraint (K) to demonstrate signal strength: 

    CityA: high Id / low K
    CityB: high Id / high K
    CityC: medium Id / medium K
    CityD: low Id (regardless of K)

    Between a safe city with good cops (B) and a safe city with bad cops (A),
    cityA is the stronger signal that the city is safe. 


## Notes

I assume a temporal relationship exists between `fractal_dimension_mean` and
`fractal_dimension_worst`, such that: 

    fractal_dimension_mean = past visits
    fractal_dimension_worst = most current visit

Definitions used

    Asymmetry: enables generativity
    Closure: determines whether growth can resolve
    Constraint: limits how far generativity can spread
    Integrity: stability, resistance to change
    Growth tension: stress/strain from expansion
    Diversity: biodiversity, entropy

This model implies:

    * There is no total order of malignancy
    * There is no single monotonic risk axis
    * Predictions fail when treated as rankings
    * Meaning arises from relational configuration

