In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import grasp
from grasp import GrASP, CustomAttribute, remove_specialized_patterns
from sklearn.model_selection import train_test_split

## Load the data
- Download and unzip the spam dataset **if you have not done this before**

In [None]:
import urllib.request
url = 'http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/smsspamcollection.zip'
filename = './data/smsspamcollection.zip'
urllib.request.urlretrieve(url, filename)

In [None]:
!unzip ./data/smsspamcollection.zip -d ./data

- Load the data

In [3]:
def get_data():
    f = open('data/SMSSpamCollection.txt', 'r')
    texts, labels = [], []
    for line in f:
        line = line.strip()
        tab_idx = line.index('\t')
        label = line[:tab_idx]
        text = line[tab_idx+1:]
        if label == 'ham':
            label = 0
        elif label == 'spam':
            label = 1
        else:
            raise Exception(f"Invalid label - {label}")
        texts.append(text)
        labels.append(label)
    return texts, labels

In [4]:
texts, labels = get_data()
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)
# len(texts), sum(labels), len(X_test), sum(y_test)

In [5]:
positive = [t for idx, t in enumerate(X_train) if y_train[idx]]
negative = [t for idx, t in enumerate(X_train) if not y_train[idx]]
print(f'Positive examples = {len(positive)}\nNegative examples = {len(negative)}')

Positive examples = 488
Negative examples = 3079


## Run GrASP

In [6]:
# Create the GrASP engine
grasp_model = GrASP(gaps_allowed = 2, num_patterns = 100, include_standard = ['TEXT', 'POS', 'HYPERNYM', 'SENTIMENT'],
                    correlation_threshold = 0.5, alphabet_size = 200)

In [7]:
# Fit GrASP to the dataset
the_patterns = grasp_model.fit_transform(positive, negative)

  0%|          | 0/488 [00:00<?, ?it/s]

Step 1: Create augmented texts


100%|██████████| 488/488 [00:20<00:00, 24.28it/s]
100%|██████████| 3079/3079 [01:22<00:00, 37.18it/s]


Step 2: Find frequent attributes


  1%|          | 8/1215 [00:00<00:16, 73.56it/s]

Total number of candidate alphabet = 1215, such as ['SPACY:POS-VERB', 'SPACY:POS-NOUN', 'SPACY:POS-PUNCT', 'SPACY:POS-PRON', 'SPACY:POS-ADV']
Step 3: Find alphabet set


100%|██████████| 1215/1215 [00:26<00:00, 45.70it/s]


Finding top k: 20 / 200
Finding top k: 40 / 200
Finding top k: 60 / 200
Finding top k: 80 / 200
Finding top k: 100 / 200
Finding top k: 120 / 200
Finding top k: 140 / 200
Finding top k: 160 / 200
Finding top k: 180 / 200
Finding top k: 200 / 200


  0%|          | 0/200 [00:00<?, ?it/s]

Total number of alphabet = 200
['SPACY:POS-NUM', 'SPACY:POS-PROPN', 'TEXT:call', 'TEXT:i', 'SPACY:POS-SYM', 'TEXT:free', 'TEXT:txt', 'TEXT:claim', 'TEXT:!', 'TEXT:mobile', 'HYPERNYM3:cost.n.01', 'SPACY:POS-ADP', 'HYPERNYM3:message.n.02', 'TEXT:to', 'TEXT:prize', 'HYPERNYM3:communication.n.02', 'HYPERNYM3:win.v.01', 'SENTIMENT:pos', 'HYPERNYM3:symbol.n.01', 'HYPERNYM3:statement.n.01', 'TEXT:your', 'SPACY:POS-ADJ', 'TEXT:or', 'TEXT:text', 'TEXT:stop', 'TEXT:-', 'TEXT:150p', 'TEXT:guaranteed', 'TEXT:urgent', 'HYPERNYM3:textbook.n.01', 'HYPERNYM3:act.n.02', 'TEXT:win', 'HYPERNYM3:challenge.v.01', 'TEXT:+', 'HYPERNYM3:written_communication.n.01', 'SPACY:POS-PRON', 'TEXT:16', 'TEXT:cash', 'HYPERNYM3:abstraction.n.06', 'TEXT:from', 'HYPERNYM3:assertion.n.01', 'TEXT:reply', 'TEXT:now', 'TEXT:.', 'TEXT:tone', 'TEXT:18', 'HYPERNYM3:acquisition.n.02', 'SPACY:POS-DET', 'TEXT:nokia', 'TEXT:a', 'HYPERNYM3:user.n.01', 'TEXT:our', 'HYPERNYM3:mobile.n.02', 'HYPERNYM3:minute.n.01', 'HYPERNYM3:person.n.0

100%|██████████| 200/200 [02:48<00:00,  1.18it/s]


Length 2 / 5; New candidates = 59900
Finding top k: 10 / 100
Finding top k: 20 / 100
Finding top k: 30 / 100
Finding top k: 40 / 100
Finding top k: 50 / 100
Finding top k: 60 / 100
Finding top k: 70 / 100
Finding top k: 80 / 100
Finding top k: 90 / 100
Finding top k: 100 / 100
Example of current patterns
Pattern: [['SPACY:POS-NUM']]
Window size: 3
Class: Negative
Precision: 0.526
Match: 936 (26.2%)
Gain = 0.223
Metric (global) = 0.223
[5m[7m[32mExamples[0m ~ Class Negative:
[5m[32m[MATCH][0m: Not planned yet : ) going to join company on jan [7m[36m5:['SPACY:POS-NUM'][0m only.don know what will happen after that . 
-------------------------
[5m[32m[MATCH][0m: I have [7m[36m2:['SPACY:POS-NUM'][0m docs appointments next week.:/ I 'm tired of them shoving stuff up me . Ugh why could n't I have had a normal body ? 
-------------------------
[5m[7m[31mCounterexamples[0m ~ Not class Negative:
[5m[32m[MATCH][0m: FreeMsg : Claim ur [7m[36m250:['SPACY:POS-NUM'][0m SMS m

100%|██████████| 70/70 [02:10<00:00,  1.86s/it]


Length 3 / 5; New candidates = 27723
Finding top k: 10 / 100
Finding top k: 20 / 100
Finding top k: 30 / 100
Finding top k: 40 / 100
Finding top k: 50 / 100
Finding top k: 60 / 100
Finding top k: 70 / 100
Finding top k: 80 / 100
Finding top k: 90 / 100
Finding top k: 100 / 100
Example of current patterns
Pattern: [['SPACY:POS-NUM']]
Window size: 3
Class: Negative
Precision: 0.526
Match: 936 (26.2%)
Gain = 0.223
Metric (global) = 0.223
[5m[7m[32mExamples[0m ~ Class Negative:
[5m[32m[MATCH][0m: [7m[36m1Apple:['SPACY:POS-NUM'][0m / Day = No Doctor . 1Tulsi Leaf / Day = No Cancer . 1Lemon / Day = No Fat . 1Cup Milk / day = No Bone Problms 3 Litres Watr / Day = No Diseases Snd ths 2 Whom U Care .. :- ) 
-------------------------
[5m[32m[MATCH][0m: HEY THERE BABE , HOW U DOIN ? WOT U UP [7m[36m2:['SPACY:POS-NUM'][0m 2NITE LOVE ANNIE X. 
-------------------------
[5m[7m[31mCounterexamples[0m ~ Not class Negative:
[5m[32m[MATCH][0m: Hello from Orange . For [7m[36m1:['S

100%|██████████| 36/36 [01:03<00:00,  1.76s/it]


Length 4 / 5; New candidates = 14334
Finding top k: 10 / 100
Finding top k: 20 / 100
Finding top k: 30 / 100
Finding top k: 40 / 100
Finding top k: 50 / 100
Finding top k: 60 / 100
Finding top k: 70 / 100
Finding top k: 80 / 100
Finding top k: 90 / 100
Finding top k: 100 / 100
Example of current patterns
Pattern: [['SPACY:POS-NUM']]
Window size: 3
Class: Negative
Precision: 0.526
Match: 936 (26.2%)
Gain = 0.223
Metric (global) = 0.223
[5m[7m[32mExamples[0m ~ Class Negative:
[5m[32m[MATCH][0m: Can help [7m[36mu:['SPACY:POS-NUM'][0m swoop by picking u up from wherever ur other birds r meeting if u want . 
-------------------------
[5m[32m[MATCH][0m: This is [7m[36mone:['SPACY:POS-NUM'][0m of the days you have a billion classes , right ? 
-------------------------
[5m[7m[31mCounterexamples[0m ~ Not class Negative:
[5m[32m[MATCH][0m: SMS SERVICES . for your inclusive text credits , pls goto www.comuk.net login= [7m[36m3qxj9:['SPACY:POS-NUM'][0m unsubscribe with ST

100%|██████████| 4/4 [00:07<00:00,  1.90s/it]


Length 5 / 5; New candidates = 1595
Finding top k: 10 / 100
Finding top k: 20 / 100
Finding top k: 30 / 100
Finding top k: 40 / 100
Finding top k: 50 / 100
Finding top k: 60 / 100
Finding top k: 70 / 100
Finding top k: 80 / 100
Finding top k: 90 / 100
Finding top k: 100 / 100
Example of current patterns
Pattern: [['SPACY:POS-NUM']]
Window size: 3
Class: Negative
Precision: 0.526
Match: 936 (26.2%)
Gain = 0.223
Metric (global) = 0.223
[5m[7m[32mExamples[0m ~ Class Negative:
[5m[32m[MATCH][0m: Ok no problem ... Yup i 'm going to sch at [7m[36m4:['SPACY:POS-NUM'][0m if i rem correctly ... 
-------------------------
[5m[32m[MATCH][0m: Single line with a big meaning : : : : : " Miss anything [7m[36m4:['SPACY:POS-NUM'][0m ur " Best Life " but , do n't miss ur best life for anything ... Gud nyt ... 
-------------------------
[5m[7m[31mCounterexamples[0m ~ Not class Negative:
[5m[32m[MATCH][0m: GENT ! We are trying to contact you . Last weekends draw shows that you won a

In [8]:
# Print the learned patterns
for idx, p in enumerate(the_patterns):
    print(f'Rank {idx+1}')
    print(p)

Rank 1
Pattern: [['SPACY:POS-NUM']]
Window size: 3
Class: Negative
Precision: 0.526
Match: 936 (26.2%)
Gain = 0.223
Metric (global) = 0.223
[5m[7m[32mExamples[0m ~ Class Negative:
[5m[32m[MATCH][0m: Ugh . Got ta drive back to [7m[36msd:['SPACY:POS-NUM'][0m from la . My butt is sore . 
-------------------------
[5m[32m[MATCH][0m: Haha ... Where got so fast lose weight , thk muz go [7m[36m4:['SPACY:POS-NUM'][0m a month den got effect ... Gee , later we go aust put bk e weight . 
-------------------------
[5m[7m[31mCounterexamples[0m ~ Not class Negative:
[5m[32m[MATCH][0m: Last Chance ! Claim ur Â£150 worth of discount vouchers today ! Text SHOP to [7m[36m85023:['SPACY:POS-NUM'][0m now ! SavaMob , offers mobile ! T Cs SavaMob POBOX84 , M263UZ . Â£3.00 Sub . 16 
-------------------------
[5m[32m[MATCH][0m: For ur chance to win a Â£250 wkly shopping spree TXT : SHOP to [7m[36m80878:['SPACY:POS-NUM'][0m . T's&C 's www.txt-2-shop.com custcare 08715705022 , 1x1

In [11]:
print(f'  #    class Cov(%)    Prec    Gain    Pattern')
for idx, p in enumerate(the_patterns):
    print(f'{idx+1:>3} {p.support_class}   {round(p.coverage*100, 1):>4}   {p.precision:.3f}   {p.metric:.3f}    {p.get_pattern_id()}')

  #    class Cov(%)    Prec    Gain    Pattern
  1 Negative   26.2   0.526   0.223    [['SPACY:POS-NUM']]
  2 Positive    8.9   0.806   0.175    [['SPACY:POS-PROPN'], ['SPACY:POS-NUM']]
  3 Positive    5.1   0.984   0.152    [['TEXT:call'], ['SPACY:POS-NUM']]
  4 Positive   14.5   0.579   0.148    [['SPACY:POS-PROPN'], ['SPACY:POS-PROPN']]
  5 Positive   11.8   0.613   0.129    [['SPACY:POS-ADP'], ['SPACY:POS-NUM']]
  6 Positive    6.8   0.811   0.129    [['SPACY:POS-NUM'], ['SPACY:POS-PROPN']]
  7 Negative   18.1   0.526   0.119    [['SPACY:POS-NOUN'], ['SPACY:POS-PROPN']]
  8 Positive    5.4   0.856   0.114    [['TEXT:to'], ['SPACY:POS-NUM']]
  9 Negative   37.8   0.690   0.112    [['SPACY:POS-PROPN']]
 10 Positive    7.6   0.702   0.105    [['SPACY:POS-NOUN'], ['SPACY:POS-NOUN'], ['SPACY:POS-NUM']]
 11 Positive    7.1   0.722   0.104    [['SPACY:POS-PROPN'], ['SPACY:POS-PROPN'], ['SPACY:POS-PROPN']]
 12 Positive    5.7   0.797   0.101    [['TEXT:.'], ['SPACY:POS-NUM']]
 13 Positive 

## Post-process the patterns

In [12]:
# Select only patterns of which precision is greater than 0.70
selected_patterns = [p for p in the_patterns if p.precision >= 0.70]
print(f'No. of remaining patterns = {len(selected_patterns)}')

No. of remaining patterns = 54


In [13]:
# For every pair of patterns (p1, p2), remove pattern p2 if there exists p1 in the patterns set such that p2 is a specialization of p1 and metric of p2 is lower than p1
selected_patterns = remove_specialized_patterns(selected_patterns, metric = lambda x: x.precision)
print(f'No. of remaining patterns = {len(selected_patterns)}')

No. of remaining patterns = 44


In [14]:
# Print the remaining patterns sorted by precision
selected_patterns = sorted(selected_patterns, key = lambda x: x.precision, reverse = True)
print(f'  #    class Cov(%)    Prec  Gain    Pattern')
for idx, p in enumerate(selected_patterns):
    print(f'{idx+1:>3} {p.support_class}   {round(p.coverage*100, 1):>4}   {p.precision:.3f}   {p.metric:.3f}    {p.get_pattern_id()}')

  #    class Cov(%)    Prec  Gain    Pattern
  1 Positive    2.2   1.000   0.064    [['TEXT:claim']]
  2 Positive    1.6   1.000   0.046    [['TEXT:prize']]
  3 Positive    1.9   0.985   0.053    [['TEXT:.'], ['TEXT:call'], ['SPACY:POS-NUM']]
  4 Positive    5.1   0.984   0.152    [['TEXT:call'], ['SPACY:POS-NUM']]
  5 Negative   37.2   0.983   0.067    [['TEXT:i']]
  6 Positive    1.5   0.982   0.043    [['SPACY:POS-NOUN'], ['SPACY:POS-NOUN'], ['TEXT:to'], ['SPACY:POS-NUM']]
  7 Positive    2.4   0.942   0.060    [['TEXT:mobile']]
  8 Positive    2.7   0.938   0.067    [['TEXT:txt']]
  9 Positive    1.7   0.934   0.041    [['SPACY:POS-NUM'], ['TEXT:now']]
 10 Positive    3.4   0.926   0.082    [['SPACY:POS-PROPN'], ['SPACY:POS-ADP'], ['SPACY:POS-NUM']]
 11 Positive    2.2   0.923   0.052    [['HYPERNYM3:communication.n.02'], ['SPACY:POS-NUM']]
 12 Positive    2.2   0.922   0.051    [['HYPERNYM3:win.v.01', 'SENTIMENT:pos']]
 13 Positive    1.6   0.914   0.037    [['SPACY:POS-PROPN'], [

## Save the patterns to a json file
We can use this json file as an input of the web demo tool for exploring the learned patterns and the training data

In [15]:
grasp_model.to_json('results/case_study_1.json', patterns = selected_patterns, comment = 'Rank and group patterns based on precision. The minimum precision was set at 0.70')

100%|██████████| 44/44 [00:00<00:00, 358.56it/s]
100%|██████████| 44/44 [00:00<00:00, 310.68it/s]


Successfully dump the results to results/case_study_1.json
