In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import grasptext
from grasptext import GrASP
from sklearn.model_selection import train_test_split

## Load the data
- We use the SMS Spam Collection dataset from the following paper. Please download and unzip it by running the two cells below **only if you have not done this before**.

```
Almeida, T. A., Hidalgo, J. M. G., & Yamakami, A. (2011, September). Contributions to the study of SMS spam filtering: new collection and results. In Proceedings of the 11th ACM symposium on Document engineering (pp. 259-262).
```

In [None]:
import urllib.request
url = 'http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/smsspamcollection.zip'
filename = './data/smsspamcollection.zip'
urllib.request.urlretrieve(url, filename)

In [None]:
!unzip ./data/smsspamcollection.zip -d ./data

- Load the data

In [5]:
def get_data():
    f = open('./data/SMSSpamCollection.txt', 'r')
    texts, labels = [], []
    for line in f:
        line = line.strip()
        tab_idx = line.index('\t')
        label = line[:tab_idx]
        text = line[tab_idx+1:]
        if label == 'ham':
            label = 0
        elif label == 'spam':
            label = 1
        else:
            raise Exception(f"Invalid label - {label}")
        texts.append(text)
        labels.append(label)
    return texts, labels

In [6]:
texts, labels = get_data()
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)
# len(texts), sum(labels), len(X_test), sum(y_test)

In [7]:
positive = [t for idx, t in enumerate(X_train) if y_train[idx]]
negative = [t for idx, t in enumerate(X_train) if not y_train[idx]]
print(f'Positive examples = {len(positive)}\nNegative examples = {len(negative)}')

Positive examples = 488
Negative examples = 3079


## Run GrASP

In [8]:
# Create the GrASP engine
grasp_model = GrASP(include_standard = ['TEXT', 'POS', 'HYPERNYM', 'SENTIMENT'],
                    num_patterns = 100, gaps_allowed = 2)

In [9]:
# Fit GrASP to the dataset
the_patterns = grasp_model.fit_transform(positive, negative)

  0%|          | 0/488 [00:00<?, ?it/s]

Step 1: Create augmented texts


100%|██████████| 488/488 [00:22<00:00, 21.91it/s]
100%|██████████| 3079/3079 [01:42<00:00, 29.92it/s]


Step 2: Find frequent attributes


  1%|          | 7/1215 [00:00<00:20, 59.23it/s]

Total number of candidate alphabet = 1215, such as ['SPACY:POS-VERB', 'SPACY:POS-NOUN', 'SPACY:POS-PUNCT', 'SPACY:POS-PRON', 'SPACY:POS-ADV']
Step 3: Find alphabet set


100%|██████████| 1215/1215 [00:28<00:00, 42.92it/s]


Finding top k: 10 / 100
Finding top k: 20 / 100
Finding top k: 30 / 100
Finding top k: 40 / 100
Finding top k: 50 / 100
Finding top k: 60 / 100
Finding top k: 70 / 100
Finding top k: 80 / 100
Finding top k: 90 / 100


  0%|          | 0/100 [00:00<?, ?it/s]

Finding top k: 100 / 100
Total number of alphabet = 100
['SPACY:POS-NUM', 'SPACY:POS-PROPN', 'TEXT:call', 'TEXT:i', 'SPACY:POS-SYM', 'TEXT:free', 'TEXT:txt', 'TEXT:claim', 'TEXT:!', 'TEXT:mobile', 'HYPERNYM:cost.n.01', 'SPACY:POS-ADP', 'HYPERNYM:message.n.02', 'TEXT:to', 'TEXT:prize', 'HYPERNYM:communication.n.02', 'HYPERNYM:win.v.01', 'SENTIMENT:pos', 'HYPERNYM:symbol.n.01', 'HYPERNYM:statement.n.01', 'TEXT:your', 'SPACY:POS-ADJ', 'TEXT:or', 'TEXT:text', 'TEXT:stop', 'TEXT:-', 'TEXT:150p', 'TEXT:guaranteed', 'TEXT:urgent', 'HYPERNYM:textbook.n.01', 'HYPERNYM:act.n.02', 'TEXT:win', 'HYPERNYM:oppose.v.01', 'TEXT:+', 'HYPERNYM:written_communication.n.01', 'SPACY:POS-PRON', 'TEXT:16', 'TEXT:cash', 'HYPERNYM:abstraction.n.06', 'TEXT:from', 'HYPERNYM:assertion.n.01', 'TEXT:reply', 'TEXT:now', 'TEXT:.', 'TEXT:tone', 'TEXT:18', 'HYPERNYM:acquisition.n.02', 'SPACY:POS-DET', 'TEXT:nokia', 'TEXT:a', 'HYPERNYM:user.n.01', 'TEXT:our', 'HYPERNYM:mobile.n.02', 'HYPERNYM:minute.n.01', 'HYPERNYM:perso

100%|██████████| 100/100 [00:55<00:00,  1.79it/s]


Length 2 / 5; New candidates = 14950
Finding top k: 10 / 100
Finding top k: 20 / 100
Finding top k: 30 / 100
Finding top k: 40 / 100
Finding top k: 50 / 100
Finding top k: 60 / 100
Finding top k: 70 / 100
Finding top k: 80 / 100
Finding top k: 90 / 100
Finding top k: 100 / 100
Example of current patterns
Pattern: [['SPACY:POS-NUM']]
Window size: 3
Class: Negative
Precision: 0.526
Match: 936 (26.2%)
Gain = 0.223
Metric (global) = 0.223
[5m[7m[32mExamples[0m ~ Class Negative:
[5m[32m[MATCH][0m: HIYA COMIN [7m[36m2:['SPACY:POS-NUM'][0m BRISTOL 1 ST WEEK IN APRIL . LES GOT OFF + RUDI ON NEW YRS EVE BUT I WAS SNORING.THEY WERE DRUNK ! U BAK AT COLLEGE YET ? MY WORK SENDS INK 2 BATH . 
-------------------------
[5m[32m[MATCH][0m: Ok . But i finish at [7m[36m6:['SPACY:POS-NUM'][0m . 
-------------------------
[5m[7m[31mCounterexamples[0m ~ Not class Negative:
[5m[32m[MATCH][0m: -PLS STOP bootydelious ( [7m[36m32/F:['SPACY:POS-NUM'][0m ) is inviting you to be her frie

100%|██████████| 68/68 [01:18<00:00,  1.16s/it]


Length 3 / 5; New candidates = 13344
Finding top k: 10 / 100
Finding top k: 20 / 100
Finding top k: 30 / 100
Finding top k: 40 / 100
Finding top k: 50 / 100
Finding top k: 60 / 100
Finding top k: 70 / 100
Finding top k: 80 / 100
Finding top k: 90 / 100
Finding top k: 100 / 100
Example of current patterns
Pattern: [['SPACY:POS-NUM']]
Window size: 3
Class: Negative
Precision: 0.526
Match: 936 (26.2%)
Gain = 0.223
Metric (global) = 0.223
[5m[7m[32mExamples[0m ~ Class Negative:
[5m[32m[MATCH][0m: We made it ! Eta at taunton is [7m[36m12:30:['SPACY:POS-NUM'][0m as planned , hope thatâ€˜s still okday ? ! Good to see you ! : -xx 
-------------------------
[5m[32m[MATCH][0m: Me also da , i feel yesterday night   wait til [7m[36m2day:['SPACY:POS-NUM'][0m night dear . 
-------------------------
[5m[7m[31mCounterexamples[0m ~ Not class Negative:
[5m[32m[MATCH][0m: Urgent ! call [7m[36m09061749602:['SPACY:POS-NUM'][0m from Landline . Your complimentary 4 * Tenerife Holida

100%|██████████| 36/36 [00:37<00:00,  1.04s/it]


Length 4 / 5; New candidates = 7137
Finding top k: 10 / 100
Finding top k: 20 / 100
Finding top k: 30 / 100
Finding top k: 40 / 100
Finding top k: 50 / 100
Finding top k: 60 / 100
Finding top k: 70 / 100
Finding top k: 80 / 100
Finding top k: 90 / 100


  0%|          | 0/5 [00:00<?, ?it/s]

Finding top k: 100 / 100
Example of current patterns
Pattern: [['SPACY:POS-NUM']]
Window size: 3
Class: Negative
Precision: 0.526
Match: 936 (26.2%)
Gain = 0.223
Metric (global) = 0.223
[5m[7m[32mExamples[0m ~ Class Negative:
[5m[32m[MATCH][0m: , ,   and   picking them up from various points | going [7m[36m2:['SPACY:POS-NUM'][0m yeovil | and they will do the motor project 4 3 hours | and then u take them home . || 12 2 5.30 max . || Very easy 
-------------------------
[5m[32m[MATCH][0m: Wot about on we d nite I am [7m[36m3:['SPACY:POS-NUM'][0m then but only til 9 ! 
-------------------------
[5m[7m[31mCounterexamples[0m ~ Not class Negative:
[5m[32m[MATCH][0m: FREE MESSAGE Activate your [7m[36m500:['SPACY:POS-NUM'][0m FREE Text Messages by replying to this message with the word FREE For terms & conditions , visit www.07781482378.com 
-------------------------
[5m[32m[MATCH][0m: from www . Applausestore.com MonthlySubscription@50p / msg max6/month T&CsC web

100%|██████████| 5/5 [00:05<00:00,  1.03s/it]


Length 5 / 5; New candidates = 994
Finding top k: 10 / 100
Finding top k: 20 / 100
Finding top k: 30 / 100
Finding top k: 40 / 100
Finding top k: 50 / 100
Finding top k: 60 / 100
Finding top k: 70 / 100
Finding top k: 80 / 100
Finding top k: 90 / 100
Finding top k: 100 / 100
Example of current patterns
Pattern: [['SPACY:POS-NUM']]
Window size: 3
Class: Negative
Precision: 0.526
Match: 936 (26.2%)
Gain = 0.223
Metric (global) = 0.223
[5m[7m[32mExamples[0m ~ Class Negative:
[5m[32m[MATCH][0m: True . It is passable . And if you get a high score and apply for phd , you get [7m[36m5years:['SPACY:POS-NUM'][0m of salary . So it makes life easier . 
-------------------------
[5m[32m[MATCH][0m: Rose for red , red for blood , blood for heart , heart for u. But u for me .... Send tis to all ur friends .. Including me .. If u like me .. If u get back , [7m[36m1-u:['SPACY:POS-NUM'][0m r poor in relation ! 2-u need some 1 to support 3-u r frnd 2 many 4-some1 luvs u 5 + - some1 is pra

In [10]:
# Print the learned patterns
for idx, p in enumerate(the_patterns):
    print(f'Rank {idx+1}')
    print(p)

Rank 1
Pattern: [['SPACY:POS-NUM']]
Window size: 3
Class: Negative
Precision: 0.526
Match: 936 (26.2%)
Gain = 0.223
Metric (global) = 0.223
[5m[7m[32mExamples[0m ~ Class Negative:
[5m[32m[MATCH][0m: [7m[36m8:['SPACY:POS-NUM'][0m at the latest , g 's still there if you can scrounge up some ammo and want to give the new ak a try 
-------------------------
[5m[32m[MATCH][0m: Ok . Not much to do here though . H&M Friday , ca nt wait . Dunno wot the hell i m gon na do for another [7m[36m3:['SPACY:POS-NUM'][0m weeks ! Become a slob- oh wait , already done that ! 
-------------------------
[5m[7m[31mCounterexamples[0m ~ Not class Negative:
[5m[32m[MATCH][0m: Great News ! Call FREEFONE [7m[36m08006344447:['SPACY:POS-NUM'][0m to claim your guaranteed Â£1000 CASH or Â£2000 gift . Speak to a live operator NOW ! 
-------------------------
[5m[32m[MATCH][0m: You have [7m[36m1:['SPACY:POS-NUM'][0m new voicemail . Please call 08719181503 
-------------------------
Rank

In [11]:
print(f'  #    class Cov(%)    Prec    Gain    Pattern')
for idx, p in enumerate(the_patterns):
    print(f'{idx+1:>3} {p.support_class}   {round(p.coverage*100, 1):>4}   {p.precision:.3f}   {p.metric:.3f}    {p.get_pattern_id()}')

  #    class Cov(%)    Prec    Gain    Pattern
  1 Negative   26.2   0.526   0.223    [['SPACY:POS-NUM']]
  2 Positive    8.9   0.806   0.175    [['SPACY:POS-PROPN'], ['SPACY:POS-NUM']]
  3 Positive    5.1   0.984   0.152    [['TEXT:call'], ['SPACY:POS-NUM']]
  4 Positive   14.5   0.579   0.148    [['SPACY:POS-PROPN'], ['SPACY:POS-PROPN']]
  5 Positive   11.8   0.613   0.129    [['SPACY:POS-ADP'], ['SPACY:POS-NUM']]
  6 Positive    6.8   0.811   0.129    [['SPACY:POS-NUM'], ['SPACY:POS-PROPN']]
  7 Negative   18.1   0.526   0.119    [['SPACY:POS-NOUN'], ['SPACY:POS-PROPN']]
  8 Positive    5.4   0.856   0.114    [['TEXT:to'], ['SPACY:POS-NUM']]
  9 Negative   37.8   0.690   0.112    [['SPACY:POS-PROPN']]
 10 Positive    7.6   0.702   0.105    [['SPACY:POS-NOUN'], ['SPACY:POS-NOUN'], ['SPACY:POS-NUM']]
 11 Positive    7.1   0.722   0.104    [['SPACY:POS-PROPN'], ['SPACY:POS-PROPN'], ['SPACY:POS-PROPN']]
 12 Positive    5.7   0.797   0.101    [['TEXT:.'], ['SPACY:POS-NUM']]
 13 Positive 

## Post-process the patterns

In [12]:
# Select only patterns of which precision is greater than 0.70
selected_patterns = [p for p in the_patterns if p.precision >= 0.70]
print(f'No. of remaining patterns = {len(selected_patterns)}')

No. of remaining patterns = 57


In [14]:
# For every pair of patterns (p1, p2), remove pattern p2 if there exists p1 in the patterns set such that p2 is a specialization of p1 and metric of p2 is lower than p1
selected_patterns = grasptext.remove_specialized_patterns(selected_patterns, metric = lambda x: x.precision)
print(f'No. of remaining patterns = {len(selected_patterns)}')

No. of remaining patterns = 47


In [15]:
# Print the remaining patterns sorted by precision
selected_patterns = sorted(selected_patterns, key = lambda x: x.precision, reverse = True)
print(f'  #    class Cov(%)    Prec  Gain    Pattern')
for idx, p in enumerate(selected_patterns):
    print(f'{idx+1:>3} {p.support_class}   {round(p.coverage*100, 1):>4}   {p.precision:.3f}   {p.metric:.3f}    {p.get_pattern_id()}')

  #    class Cov(%)    Prec  Gain    Pattern
  1 Positive    2.2   1.000   0.064    [['TEXT:claim']]
  2 Positive    1.6   1.000   0.046    [['TEXT:prize']]
  3 Positive    1.2   1.000   0.036    [['TEXT:call'], ['SPACY:POS-NUM'], ['SPACY:POS-ADP']]
  4 Positive    1.2   1.000   0.036    [['SPACY:POS-PROPN'], ['SPACY:POS-NOUN'], ['TEXT:.'], ['SPACY:POS-NUM']]
  5 Positive    1.9   0.985   0.053    [['TEXT:.'], ['TEXT:call'], ['SPACY:POS-NUM']]
  6 Positive    5.1   0.984   0.152    [['TEXT:call'], ['SPACY:POS-NUM']]
  7 Negative   37.2   0.983   0.067    [['TEXT:i']]
  8 Positive    1.5   0.982   0.043    [['SPACY:POS-NOUN'], ['SPACY:POS-NOUN'], ['TEXT:to'], ['SPACY:POS-NUM']]
  9 Positive    2.4   0.942   0.060    [['TEXT:mobile']]
 10 Positive    2.7   0.938   0.067    [['TEXT:txt']]
 11 Positive    1.7   0.934   0.041    [['SPACY:POS-NUM'], ['TEXT:now']]
 12 Positive    3.4   0.926   0.082    [['SPACY:POS-PROPN'], ['SPACY:POS-ADP'], ['SPACY:POS-NUM']]
 13 Positive    2.2   0.923   0

## Save the patterns to a json file
We can use this json file as an input of the web demo tool for exploring the learned patterns and the training data

In [16]:
grasp_model.to_json('results/case_study_1.json', patterns = selected_patterns, comment = 'Rank and group patterns based on precision. The minimum precision was set at 0.70')

100%|██████████| 47/47 [00:00<00:00, 372.92it/s]
100%|██████████| 47/47 [00:00<00:00, 333.53it/s]


Successfully dump the results to results/case_study_1.json
