<a href="https://colab.research.google.com/github/kxk302/MBA/blob/main/MBA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
!ls '/content/gdrive/MyDrive/Colab Notebooks/MBA_files'

bos_0223_0905_2863_7000  uke_0210_0910_2500_7000  ukl_0203_0926_2030_7000


In [None]:
import numpy as np
import pandas as pd

from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

# Create a single integer representing a variant at a specific position with a specific allele frequency
# Pivot the data so we have all sample variants on a single line
def preprocess_input_file(in_file):
  # Read the covid 19 sample file and keep only the relevant columns 
  df = pd.read_csv(in_file, sep='\t')[['Sample', 'POS', 'AF', 'FUNCLASS']]

  # Replace "." with "NONE" in FUNCLASS column. They both represent "Non-coding" variant
  df["FUNCLASS"] = df["FUNCLASS"].replace('.', 'NONE')

  # Bucket values in FUNCLASS and AF columns. We do not bucket the values in POS column.

  # Replace variants in FUNCLASS column with a distinct numeric value
  df.loc[df.FUNCLASS == "NONE", "FUNCLASS"] = 4
  df.loc[df.FUNCLASS == "MISSENSE", "FUNCLASS"] = 3
  df.loc[df.FUNCLASS == "NONSENSE", "FUNCLASS"] = 2 
  df.loc[df.FUNCLASS == "SILENT", "FUNCLASS"] = 1 

  # Replace allele frequency in AF column with a distict numeric value
  df.loc[df.AF >= 0.80, "AF"] = 3
  df.loc[(df.AF >= 0.20) & (df.AF < 0.80), "AF"] = 2
  df.loc[df.AF < 0.20, "AF"] = 1

  # Convert AF values to integer
  df = df.astype({"AF": int}) 

  # Create a new column called 'Label', which is a string concatentation of POS, FUNCLASS, and AF values. 
  # The idea is to represent each variant + allele frequency + position as a single integer, to be used in MBA 
  df["Label"] = df["POS"].astype(str) + df["FUNCLASS"].astype(str) + df["AF"].astype(str)

  # We donotneed POS, FUNCLASS, and AF columns anymore
  df = df[["Sample", "Label"]]
  
  # Add a new column called 'Value', prepopulated with 1
  df["Value"] = 1

  df = pd.pivot_table(df, index="Sample", columns="Label", values="Value")

  # Set all data frame nan (not a number) values to 0
  df = df.fillna(0)
  # Convert all data framevalues to integer
  df = df.astype(int) 

  return df

In [None]:
def get_association_rules(in_file, min_support=0.20, min_confidence=0.80, min_lift=1.0, min_conviction=1.0):
  # Preprocess the input file
  pif = preprocess_input_file(in_file)

  # Get frequent item sets, with support larger than min_support, using Apriori algorithm
  frequent_itemsets = apriori(pif, min_support=min_support, use_colnames=True)

  # Get association rules, with lift larger than min_lift  
  rules = association_rules(frequent_itemsets, metric="lift", min_threshold=min_lift)

  # Filter association rules, keeping rules with confidence larger than min_confidence
  rules = rules[ (rules['confidence'] >= min_confidence) & (rules['conviction'] >= min_conviction) ]

  return rules

In [None]:
pd.set_option('max_columns', 9, 'display.expand_frame_repr', False)

bos_rules = get_association_rules(in_file="https://github.com/galaxyproject/SARS-CoV-2/raw/master/data/var/bos_by_sample.tsv.gz", min_support=0.223, min_confidence=0.905, min_lift=2.863, min_conviction=7.0)
num_rules = bos_rules.shape[0]
print('Boston dataset association rules: ')
print(bos_rules.head(num_rules))
print('\n\n')
# bos_rules.to_csv('/content/gdrive/MyDrive/Colab Notebooks/MBA_files/bos_0223_0905_2863_7000.csv', sep=',')

uke_rules = get_association_rules(in_file="https://github.com/galaxyproject/SARS-CoV-2/raw/master/data/var/cog_20200917_by_sample.tsv.gz", min_support=0.21, min_confidence=0.91, min_lift=2.5, min_conviction=7.0)
num_rules = uke_rules.shape[0]
print('UK early dataset association rules: ')
print(uke_rules.head(num_rules))
print('\n\n')
# uke_rules.to_csv('/content/gdrive/MyDrive/Colab Notebooks/MBA_files/uke_0210_0910_2500_7000.csv', sep=',')

ukl_rules = get_association_rules(in_file="https://github.com/galaxyproject/SARS-CoV-2/raw/master/data/var/cog_20201120_by_sample.tsv.gz", min_support=0.203, min_confidence=0.926, min_lift=2.03, min_conviction=7.0)
num_rules = ukl_rules.shape[0]
print('Uk late dataset association rules: ')
print(ukl_rules.head(num_rules))
print('\n\n')
# ukl_rules.to_csv('/content/gdrive/MyDrive/Colab Notebooks/MBA_files/ukl_0203_0926_2030_7000.csv', sep=',')


---
**MBA parameter**s 

1.   **bos_rules**: *support*=0.223, *confidence*=0.905, *lift*=2.863, *conviction*=7.000
2.   **uke_rules**: *support*=0.210, *confidence*=0.910, *lift*=2.500, conviction *italicized text*=7.000
3.   **ukl_rules**: *support*=0.203, *confidence*=0.926, *lift*=2.030, *conviction*=7.000
---
The last digit of an entry is Allele Frequency (**AF**)

*   **3**: $>=$ 0.80    
*   **2**: $>=$ 0.20 and $<$ 0.80
*   **1**: $<$ 0.20
---
The digit before the last is **Funclass**

*  **4**: None
*  **3**: Missense
*  **2**: Nonsense
*  **1**: Silent
---
**Boston association rules** (12 rules)

**Antecedent** entries

* 3037**13**  (5 times)
* 14408**33** (5 times)
* 23403**33** (5 times)
* 26542**31** (All 12)

**Consequent** entries

* 3037**13** (5 times)
* 14408**33** (5 times)
* 23403**33** (5 times)
* 26545**31** (All 12)








---
**UK early association rules** (10 rules)

**Antecedent** entries

* 241**43**   (9 times)
* 14408**33** (3 times)
* 23403**33** (4 times)
* 28881**32** (All 10)

**Consequent** entries

* 3037**12**  (All 10)
* 14408**33** (3 times)
* 23403**33** (3 times)
* 28883**33** (All 10)

---
**UK late association rules** (11 rules)

**Antecedent** entries

* 204**42** (All 11)
* 445**13** (All 11)
* 6286**13** (3 times)
* 21255**13** (All 11)
* 23403**33** (3 times)
* 28932**32** (1 time)

**Consequent** entries

* 6286**13** (5 times)
* 22227**32** (All 11)
* 23403**33** (5 times)
* 27944**13** (10 times)
* 28932**32** (9 time)

---

**Entries that show up in antecedent/consequent across samples**

**Antecedent**

* 14408**33** (Boston, UKE)
* 23403**33** (Boston, UKE, UKL)

**Consequent**

* 3037**13** (Boston, UKE)
* 14408**33** (Boston, UKE)
* 23403**33** (Boston, UKE, UKL)

