In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import grasp
from grasp import GrASP
from sklearn.model_selection import train_test_split

## Load the data
- We use the SMS Spam Collection dataset from the following paper. Please download and unzip it by running the two cells below **only if you have not done this before**.

```
Almeida, T. A., Hidalgo, J. M. G., & Yamakami, A. (2011, September). Contributions to the study of SMS spam filtering: new collection and results. In Proceedings of the 11th ACM symposium on Document engineering (pp. 259-262).
```

In [None]:
import urllib.request
url = 'http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/smsspamcollection.zip'
filename = './data/smsspamcollection.zip'
urllib.request.urlretrieve(url, filename)

In [None]:
!unzip ./data/smsspamcollection.zip -d ./data

- Load the data

In [3]:
def get_data():
    f = open('data/SMSSpamCollection.txt', 'r')
    texts, labels = [], []
    for line in f:
        line = line.strip()
        tab_idx = line.index('\t')
        label = line[:tab_idx]
        text = line[tab_idx+1:]
        if label == 'ham':
            label = 0
        elif label == 'spam':
            label = 1
        else:
            raise Exception(f"Invalid label - {label}")
        texts.append(text)
        labels.append(label)
    return texts, labels

In [4]:
texts, labels = get_data()
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)
# len(texts), sum(labels), len(X_test), sum(y_test)

In [5]:
positive = [t for idx, t in enumerate(X_train) if y_train[idx]]
negative = [t for idx, t in enumerate(X_train) if not y_train[idx]]
print(f'Positive examples = {len(positive)}\nNegative examples = {len(negative)}')

Positive examples = 488
Negative examples = 3079


## Run GrASP

In [6]:
# Create the GrASP engine
grasp_model = GrASP(include_standard = ['TEXT', 'POS', 'HYPERNYM', 'SENTIMENT'],
                    num_patterns = 100, gaps_allowed = 2)

In [7]:
# Fit GrASP to the dataset
the_patterns = grasp_model.fit_transform(positive, negative)

  0%|          | 0/488 [00:00<?, ?it/s]

Step 1: Create augmented texts


100%|██████████| 488/488 [00:20<00:00, 23.36it/s]
100%|██████████| 3079/3079 [01:26<00:00, 35.76it/s]


Step 2: Find frequent attributes


  1%|          | 7/1215 [00:00<00:18, 64.99it/s]

Total number of candidate alphabet = 1215, such as ['SPACY:POS-VERB', 'SPACY:POS-NOUN', 'SPACY:POS-PUNCT', 'SPACY:POS-PRON', 'SPACY:POS-ADV']
Step 3: Find alphabet set


100%|██████████| 1215/1215 [00:27<00:00, 44.34it/s]


Finding top k: 10 / 100
Finding top k: 20 / 100
Finding top k: 30 / 100
Finding top k: 40 / 100
Finding top k: 50 / 100
Finding top k: 60 / 100
Finding top k: 70 / 100
Finding top k: 80 / 100
Finding top k: 90 / 100


  0%|          | 0/100 [00:00<?, ?it/s]

Finding top k: 100 / 100
Total number of alphabet = 100
['SPACY:POS-NUM', 'SPACY:POS-PROPN', 'TEXT:call', 'TEXT:i', 'SPACY:POS-SYM', 'TEXT:free', 'TEXT:txt', 'TEXT:claim', 'TEXT:!', 'TEXT:mobile', 'HYPERNYM:cost.n.01', 'SPACY:POS-ADP', 'HYPERNYM:message.n.02', 'TEXT:to', 'TEXT:prize', 'HYPERNYM:communication.n.02', 'HYPERNYM:win.v.01', 'SENTIMENT:pos', 'HYPERNYM:symbol.n.01', 'HYPERNYM:statement.n.01', 'TEXT:your', 'SPACY:POS-ADJ', 'TEXT:or', 'TEXT:text', 'TEXT:stop', 'TEXT:-', 'TEXT:150p', 'TEXT:guaranteed', 'TEXT:urgent', 'HYPERNYM:textbook.n.01', 'HYPERNYM:act.n.02', 'TEXT:win', 'HYPERNYM:contest.v.01', 'TEXT:+', 'HYPERNYM:written_communication.n.01', 'SPACY:POS-PRON', 'TEXT:16', 'TEXT:cash', 'HYPERNYM:abstraction.n.06', 'TEXT:from', 'HYPERNYM:assertion.n.01', 'TEXT:reply', 'TEXT:now', 'TEXT:.', 'TEXT:tone', 'TEXT:18', 'HYPERNYM:acquisition.n.02', 'SPACY:POS-DET', 'TEXT:nokia', 'TEXT:a', 'HYPERNYM:user.n.01', 'TEXT:our', 'HYPERNYM:mobile.n.02', 'HYPERNYM:minute.n.01', 'HYPERNYM:pers

100%|██████████| 100/100 [00:55<00:00,  1.80it/s]


Length 2 / 5; New candidates = 14950
Finding top k: 10 / 100
Finding top k: 20 / 100
Finding top k: 30 / 100
Finding top k: 40 / 100
Finding top k: 50 / 100
Finding top k: 60 / 100
Finding top k: 70 / 100
Finding top k: 80 / 100
Finding top k: 90 / 100
Finding top k: 100 / 100
Example of current patterns
Pattern: [['SPACY:POS-NUM']]
Window size: 3
Class: Negative
Precision: 0.526
Match: 936 (26.2%)
Gain = 0.223
Metric (global) = 0.223
[5m[7m[32mExamples[0m ~ Class Negative:
[5m[32m[MATCH][0m: I thk Ã¼ got ta go home by urself . Cos i 'll b going out shopping [7m[36m4:['SPACY:POS-NUM'][0m my frens present . 
-------------------------
[5m[32m[MATCH][0m: Ill be at yours in about [7m[36m3:['SPACY:POS-NUM'][0m mins but look out for me 
-------------------------
[5m[7m[31mCounterexamples[0m ~ Not class Negative:
[5m[32m[MATCH][0m: [7m[36m88066:['SPACY:POS-NUM'][0m FROM 88066 LOST 3POUND HELP 
-------------------------
[5m[32m[MATCH][0m: FREE MESSAGE Activate your

100%|██████████| 68/68 [00:59<00:00,  1.14it/s]


Length 3 / 5; New candidates = 13344
Finding top k: 10 / 100
Finding top k: 20 / 100
Finding top k: 30 / 100
Finding top k: 40 / 100
Finding top k: 50 / 100
Finding top k: 60 / 100
Finding top k: 70 / 100
Finding top k: 80 / 100
Finding top k: 90 / 100


  0%|          | 0/36 [00:00<?, ?it/s]

Finding top k: 100 / 100
Example of current patterns
Pattern: [['SPACY:POS-NUM']]
Window size: 3
Class: Negative
Precision: 0.526
Match: 936 (26.2%)
Gain = 0.223
Metric (global) = 0.223
[5m[7m[32mExamples[0m ~ Class Negative:
[5m[32m[MATCH][0m: Thank You for calling . Forgot to say Happy Onam to you Sirji . I am fine here and remembered you when i met an insurance person . Meet You in Qatar Insha Allah . Rakhesh , [7m[36mex:['SPACY:POS-NUM'][0m Tata AIG who joined TISSCO , Tayseer . 
-------------------------
[5m[32m[MATCH][0m: Ever green quote ever told by Jerry in cartoon " A Person Who Irritates u Always Is the [7m[36mone:['SPACY:POS-NUM'][0m Who Loves u Vry Much But Fails to Express It ... ! .. ! ! :-) :-) gud nyt 
-------------------------
[5m[7m[31mCounterexamples[0m ~ Not class Negative:
[5m[32m[MATCH][0m: Guess what ! Somebody you know secretly fancies you ! Wanna find out who it is ? Give us a call on [7m[36m09065394514:['SPACY:POS-NUM'][0m From Landl

100%|██████████| 36/36 [00:30<00:00,  1.18it/s]


Length 4 / 5; New candidates = 7137
Finding top k: 10 / 100
Finding top k: 20 / 100
Finding top k: 30 / 100
Finding top k: 40 / 100
Finding top k: 50 / 100
Finding top k: 60 / 100
Finding top k: 70 / 100
Finding top k: 80 / 100
Finding top k: 90 / 100


  0%|          | 0/5 [00:00<?, ?it/s]

Finding top k: 100 / 100
Example of current patterns
Pattern: [['SPACY:POS-NUM']]
Window size: 3
Class: Negative
Precision: 0.526
Match: 936 (26.2%)
Gain = 0.223
Metric (global) = 0.223
[5m[7m[32mExamples[0m ~ Class Negative:
[5m[32m[MATCH][0m: I love [7m[36mu:['SPACY:POS-NUM'][0m 2 babe ! R u sure everything is alrite . Is he being an idiot ? Txt bak girlie 
-------------------------
[5m[32m[MATCH][0m: No dice , art class [7m[36m6:['SPACY:POS-NUM'][0m thru 9 :( thanks though . Any idea what time I should come tomorrow ? 
-------------------------
[5m[7m[31mCounterexamples[0m ~ Not class Negative:
[5m[32m[MATCH][0m: Can U get [7m[36m2:['SPACY:POS-NUM'][0m phone NOW ? I wanna chat 2 set up meet Call me NOW on 09096102316 U can cum here 2moro Luv JANE xx CallsÂ£1/minmoremobsEMSPOBox45PO139WA 
-------------------------
[5m[32m[MATCH][0m: SMSSERVICES . for yourinclusive text credits , pls goto www.comuk.net login= [7m[36m3qxj9:['SPACY:POS-NUM'][0m unsubscrib

100%|██████████| 5/5 [00:04<00:00,  1.20it/s]


Length 5 / 5; New candidates = 994
Finding top k: 10 / 100
Finding top k: 20 / 100
Finding top k: 30 / 100
Finding top k: 40 / 100
Finding top k: 50 / 100
Finding top k: 60 / 100
Finding top k: 70 / 100
Finding top k: 80 / 100
Finding top k: 90 / 100
Finding top k: 100 / 100
Example of current patterns
Pattern: [['SPACY:POS-NUM']]
Window size: 3
Class: Negative
Precision: 0.526
Match: 936 (26.2%)
Gain = 0.223
Metric (global) = 0.223
[5m[7m[32mExamples[0m ~ Class Negative:
[5m[32m[MATCH][0m: Your opinion about me ? [7m[36m1:['SPACY:POS-NUM'][0m . Over 2 . Jada 3 . Kusruthi 4 . Lovable 5 . Silent 6 . Spl character 7 . Not matured 8 . Stylish 9 . Simple Pls reply .. 
-------------------------
[5m[32m[MATCH][0m: Yup i thk cine is better cos no need [7m[36m2:['SPACY:POS-NUM'][0m go down 2 plaza mah . 
-------------------------
[5m[7m[31mCounterexamples[0m ~ Not class Negative:
[5m[32m[MATCH][0m: 1st wk FREE ! Gr8 tones str8 [7m[36m2:['SPACY:POS-NUM'][0m u each wk .

In [8]:
# Print the learned patterns
for idx, p in enumerate(the_patterns):
    print(f'Rank {idx+1}')
    print(p)

Rank 1
Pattern: [['SPACY:POS-NUM']]
Window size: 3
Class: Negative
Precision: 0.526
Match: 936 (26.2%)
Gain = 0.223
Metric (global) = 0.223
[5m[7m[32mExamples[0m ~ Class Negative:
[5m[32m[MATCH][0m: Not planned yet : ) going to join company on jan [7m[36m5:['SPACY:POS-NUM'][0m only.don know what will happen after that . 
-------------------------
[5m[32m[MATCH][0m: Ard [7m[36m6:['SPACY:POS-NUM'][0m like dat lor . 
-------------------------
[5m[7m[31mCounterexamples[0m ~ Not class Negative:
[5m[32m[MATCH][0m: sexy sexy cum and text me i m wet and warm and ready for some porn ! u up for some fun ? THIS MSG IS FREE RECD MSGS 150P INC VAT [7m[36m2:['SPACY:POS-NUM'][0m CANCEL TEXT STOP 
-------------------------
[5m[32m[MATCH][0m: Someone has contacted our dating service and entered your phone because they fancy you ! To find out who it is call from a landline [7m[36m09111032124:['SPACY:POS-NUM'][0m . PoBox12n146tf150p 
-------------------------
Rank 2
Patter

Rank 86
Pattern: [['SPACY:POS-PROPN'], ['SPACY:POS-NOUN'], ['TEXT:.']]
Window size: 5
Class: Positive
Precision: 0.621
Match: 145 (4.1%)
Gain = 0.039
Metric (global) = 0.039
[5m[7m[32mExamples[0m ~ Class Positive:
[5m[32m[MATCH][0m: Congrats ! 2 mobile 3 G [7m[36mVideophones:['SPACY:POS-PROPN'][0m [7m[36mR:['SPACY:POS-NOUN'][0m yours [7m[36m.:['TEXT:.'][0m call 09061744553 now ! videochat wid ur mates , play java games , Dload polyH music , noline rentl . bx420 . ip4 . 5we . 150pm 
-------------------------
[5m[32m[MATCH][0m: PRIVATE ! Your 2003 Account Statement for 07815296484 shows 800 un - redeemed [7m[36mS.I.M.:['SPACY:POS-PROPN'][0m [7m[36mpoints:['SPACY:POS-NOUN'][0m [7m[36m.:['TEXT:.'][0m Call 08718738001 Identifier Code 41782 Expires 18/11/04 
-------------------------
[5m[7m[31mCounterexamples[0m ~ Not class Positive:
[5m[32m[MATCH][0m: [7m[36mTee:['SPACY:POS-PROPN'][0m [7m[36mhee:['SPACY:POS-NOUN'][0m [7m[36m.:['TEXT:.'][0m Off to 

In [9]:
print(f'  #    class Cov(%)    Prec    Gain    Pattern')
for idx, p in enumerate(the_patterns):
    print(f'{idx+1:>3} {p.support_class}   {round(p.coverage*100, 1):>4}   {p.precision:.3f}   {p.metric:.3f}    {p.get_pattern_id()}')

  #    class Cov(%)    Prec    Gain    Pattern
  1 Negative   26.2   0.526   0.223    [['SPACY:POS-NUM']]
  2 Positive    8.9   0.806   0.175    [['SPACY:POS-PROPN'], ['SPACY:POS-NUM']]
  3 Positive    5.1   0.984   0.152    [['TEXT:call'], ['SPACY:POS-NUM']]
  4 Positive   14.5   0.579   0.148    [['SPACY:POS-PROPN'], ['SPACY:POS-PROPN']]
  5 Positive   11.8   0.613   0.129    [['SPACY:POS-ADP'], ['SPACY:POS-NUM']]
  6 Positive    6.8   0.811   0.129    [['SPACY:POS-NUM'], ['SPACY:POS-PROPN']]
  7 Negative   18.1   0.526   0.119    [['SPACY:POS-NOUN'], ['SPACY:POS-PROPN']]
  8 Positive    5.4   0.856   0.114    [['TEXT:to'], ['SPACY:POS-NUM']]
  9 Negative   37.8   0.690   0.112    [['SPACY:POS-PROPN']]
 10 Positive    7.6   0.702   0.105    [['SPACY:POS-NOUN'], ['SPACY:POS-NOUN'], ['SPACY:POS-NUM']]
 11 Positive    7.1   0.722   0.104    [['SPACY:POS-PROPN'], ['SPACY:POS-PROPN'], ['SPACY:POS-PROPN']]
 12 Positive    5.7   0.797   0.101    [['TEXT:.'], ['SPACY:POS-NUM']]
 13 Positive 

## Post-process the patterns

In [10]:
# Select only patterns of which precision is greater than 0.70
selected_patterns = [p for p in the_patterns if p.precision >= 0.70]
print(f'No. of remaining patterns = {len(selected_patterns)}')

No. of remaining patterns = 57


In [11]:
# For every pair of patterns (p1, p2), remove pattern p2 if there exists p1 in the patterns set such that p2 is a specialization of p1 and metric of p2 is lower than p1
selected_patterns = grasp.remove_specialized_patterns(selected_patterns, metric = lambda x: x.precision)
print(f'No. of remaining patterns = {len(selected_patterns)}')

No. of remaining patterns = 47


In [12]:
# Print the remaining patterns sorted by precision
selected_patterns = sorted(selected_patterns, key = lambda x: x.precision, reverse = True)
print(f'  #    class Cov(%)    Prec  Gain    Pattern')
for idx, p in enumerate(selected_patterns):
    print(f'{idx+1:>3} {p.support_class}   {round(p.coverage*100, 1):>4}   {p.precision:.3f}   {p.metric:.3f}    {p.get_pattern_id()}')

  #    class Cov(%)    Prec  Gain    Pattern
  1 Positive    2.2   1.000   0.064    [['TEXT:claim']]
  2 Positive    1.6   1.000   0.046    [['TEXT:prize']]
  3 Positive    1.2   1.000   0.036    [['TEXT:call'], ['SPACY:POS-NUM'], ['SPACY:POS-ADP']]
  4 Positive    1.2   1.000   0.036    [['SPACY:POS-PROPN'], ['SPACY:POS-NOUN'], ['TEXT:.'], ['SPACY:POS-NUM']]
  5 Positive    1.9   0.985   0.053    [['TEXT:.'], ['TEXT:call'], ['SPACY:POS-NUM']]
  6 Positive    5.1   0.984   0.152    [['TEXT:call'], ['SPACY:POS-NUM']]
  7 Negative   37.2   0.983   0.067    [['TEXT:i']]
  8 Positive    1.5   0.982   0.043    [['SPACY:POS-NOUN'], ['SPACY:POS-NOUN'], ['TEXT:to'], ['SPACY:POS-NUM']]
  9 Positive    2.4   0.942   0.060    [['TEXT:mobile']]
 10 Positive    2.7   0.938   0.067    [['TEXT:txt']]
 11 Positive    1.7   0.934   0.041    [['SPACY:POS-NUM'], ['TEXT:now']]
 12 Positive    3.4   0.926   0.082    [['SPACY:POS-PROPN'], ['SPACY:POS-ADP'], ['SPACY:POS-NUM']]
 13 Positive    2.2   0.923   0

## Save the patterns to a json file
We can use this json file as an input of the web demo tool for exploring the learned patterns and the training data

In [13]:
grasp_model.to_json('results/case_study_1.json', patterns = selected_patterns, comment = 'Rank and group patterns based on precision. The minimum precision was set at 0.70')

100%|██████████| 47/47 [00:00<00:00, 380.03it/s]
100%|██████████| 47/47 [00:00<00:00, 362.51it/s]


Successfully dump the results to results/case_study_1.json
