In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import grasp
from grasp import GrASP, CustomAttribute, remove_specialized_patterns
from sklearn.model_selection import train_test_split
from typing import Iterable, List, Set, Callable, Optional, Union, Sequence

## Load the data
- We use the **IBM Debater® - Evidence Sentences** dataset from the following paper which can be downloaded [here](https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Argument%20Detection). (Please select the ACL 2018 dataset)
```
Shnarch, E., Alzate, C., Dankin, L., Gleize, M., Hou, Y., Choshen, L., ... & Slonim, N. (2018, July). Will it blend? blending weak and strong labeled data in a neural network for argumentation mining. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 599-605).
```

- Load the data

In [4]:
import pandas as pd
def get_data(split = 'train'):
    df = pd.read_csv(f'data/IBMDebaterEvidenceSentences/{split}.csv')
    texts = df['candidate masked'].tolist()
    labels = list(map(int, df['label'].tolist()))
    return texts, labels

In [6]:
X_train, y_train = get_data(split = 'train')
positive = [t for idx, t in enumerate(X_train) if y_train[idx]]
negative = [t for idx, t in enumerate(X_train) if not y_train[idx]]
print(len(positive), len(negative))

1499 2566


## (Optional) Add a custom attribute
This attribute checks whether a word is in a given list of argumentative words or not.

In [7]:
try:
    ARGUMENTATIVE_LEXICON = [line.strip().lower() for line in open('data/argumentative_unigrams_lexicon_shortlist.txt', 'r') if line.strip() != '']
    def _argumentative_extraction(text: str, tokens: List[str]) -> List[Set[str]]:
        tokens = map(str.lower, tokens)
        ans = []
        for t in tokens:
            t_ans = []
            if t.lower() in ARGUMENTATIVE_LEXICON:
                t_ans.append('Yes')
            ans.append(set(t_ans))
        return ans

    def _argumentative_translation(attr:str, 
                          is_complement:bool = False) -> str:
        word = attr.split(':')[1]
        assert word == 'Yes'
        return 'an argumentative word'

    ArgumentativeAttribute = CustomAttribute(name = 'ARGUMENTATIVE', extraction_function = _argumentative_extraction, translation_function = _argumentative_translation)

except:
    ArgumentativeAttribute = None

print(ArgumentativeAttribute)

ARGUMENTATIVE


## Run GrASP

In [8]:
# Create the GrASP engine
if ArgumentativeAttribute is not None:
    grasp_model = GrASP(gaps_allowed = 2, num_patterns = 100, include_standard = ['LEMMA', 'POS', 'NER', 'HYPERNM', 'SENTIMENT'],
                        include_custom = [ArgumentativeAttribute],
                        correlation_threshold = 0.5, alphabet_size = 100)
else:
    grasp_model = GrASP(gaps_allowed = 2, num_patterns = 100, include_standard = ['LEMMA', 'POS', 'NER', 'HYPERNM', 'SENTIMENT'],
                        correlation_threshold = 0.5, alphabet_size = 100)

In [9]:
# Fit GrASP to the dataset
the_patterns = grasp_model.fit_transform(positive, negative)

  0%|          | 6/1499 [00:00<00:28, 52.77it/s]

Step 1: Create augmented texts


100%|██████████| 1499/1499 [00:29<00:00, 50.33it/s]
100%|██████████| 2566/2566 [00:48<00:00, 52.92it/s]


Step 2: Find frequent attributes


  1%|          | 6/742 [00:00<00:17, 42.66it/s]

Total number of candidate alphabet = 742, such as ['SPACY:POS-PUNCT', 'LEMMA:topic_concept', 'SPACY:POS-VERB', 'SPACY:POS-PROPN', 'SPACY:POS-NOUN']
Step 3: Find alphabet set


100%|██████████| 742/742 [00:28<00:00, 26.32it/s]


Finding top k: 10 / 100
Finding top k: 20 / 100
Finding top k: 30 / 100
Finding top k: 40 / 100
Finding top k: 50 / 100
Finding top k: 60 / 100
Finding top k: 70 / 100
Finding top k: 80 / 100
Finding top k: 90 / 100
Finding top k: 100 / 100


  0%|          | 0/100 [00:00<?, ?it/s]

Total number of alphabet = 100
['LEMMA:that', 'LEMMA:]', 'LEMMA:study', 'LEMMA:risk', 'ARGUMENTATIVE:Yes', 'LEMMA:ban', 'LEMMA:energy', 'LEMMA:reduce', 'SPACY:NER-PERCENT', 'LEMMA:believe', 'LEMMA:electricity', 'LEMMA:focus', 'LEMMA:second', 'LEMMA:of', 'LEMMA:conclude', 'LEMMA:effective', 'LEMMA:find', 'LEMMA:wind', 'LEMMA:hiv', 'LEMMA:support', 'LEMMA:oppose', 'LEMMA:report', 'LEMMA:use', 'LEMMA:"', 'LEMMA:be', 'LEMMA:cancer', 'LEMMA:poll', 'LEMMA:feature', 'LEMMA:google', 'LEMMA:renewable', 'LEMMA:advocate', 'LEMMA:increase', 'LEMMA:disease', 'LEMMA:evidence', 'LEMMA:service', 'LEMMA:capacity', 'LEMMA:parent', 'SPACY:POS-ADP', 'SPACY:POS-DET', 'LEMMA:her', 'LEMMA:argue', 'SPACY:POS-NUM', 'LEMMA:say', 'LEMMA:popular', 'LEMMA:transmission', 'LEMMA:offer', 'LEMMA:rate', 'LEMMA:health', 'LEMMA:survey', 'LEMMA:comprehensive', 'LEMMA:according', 'LEMMA:state', 'SPACY:POS-VERB', 'LEMMA:show', 'LEMMA:us', 'LEMMA:suggest', 'LEMMA:emission', 'SPACY:NER-DATE', 'LEMMA:infection', 'LEMMA:recomme

100%|██████████| 100/100 [01:18<00:00,  1.27it/s]


Length 2 / 5; New candidates = 14950
Finding top k: 10 / 100
Finding top k: 20 / 100
Finding top k: 30 / 100
Finding top k: 40 / 100
Finding top k: 50 / 100
Finding top k: 60 / 100
Finding top k: 70 / 100
Finding top k: 80 / 100
Finding top k: 90 / 100
Finding top k: 100 / 100
Example of current patterns
Pattern: [['LEMMA:that', 'SPACY:POS-ADP']]
Window size: 3
Class: Positive
Precision: 0.546
Match: 1239 (30.5%)
Gain = 0.042
Metric (global) = 0.042
[5m[7m[32mExamples[0m ~ Class Positive:
[5m[32m[MATCH][0m: Brockman has argued [7m[36mthat:['SPACY:POS-ADP', 'LEMMA:that'][0m gender equality is developing only slowly within the field of law , and that more can be done to eliminate TOPIC_CONCEPT in this area . 
-------------------------
[5m[32m[MATCH][0m: Although the Brazilian Government , the Catholic Church , and the United Nations , argued in favor of TOPIC_CONCEPT , it was argued successfully [7m[36mthat:['SPACY:POS-ADP', 'LEMMA:that'][0m guns are needed for personal s

100%|██████████| 72/72 [01:44<00:00,  1.45s/it]


Length 3 / 5; New candidates = 14204
Finding top k: 10 / 100
Finding top k: 20 / 100
Finding top k: 30 / 100
Finding top k: 40 / 100
Finding top k: 50 / 100
Finding top k: 60 / 100
Finding top k: 70 / 100
Finding top k: 80 / 100
Finding top k: 90 / 100
Finding top k: 100 / 100
Example of current patterns
Pattern: [['LEMMA:that', 'SPACY:POS-ADP']]
Window size: 3
Class: Positive
Precision: 0.546
Match: 1239 (30.5%)
Gain = 0.042
Metric (global) = 0.042
[5m[7m[32mExamples[0m ~ Class Positive:
[5m[32m[MATCH][0m: A study shows [7m[36mthat:['SPACY:POS-ADP', 'LEMMA:that'][0m the lifespan of elephants in European TOPIC_CONCEPT is about half as long as those living in protected areas in Africa and Asia [ REF ] . 
-------------------------
[5m[32m[MATCH][0m: In 1990 , the Supreme Court of Canada upheld the law which bans public solicitation of TOPIC_CONCEPT , arguing [7m[36mthat:['SPACY:POS-ADP', 'LEMMA:that'][0m the law had the goal to abolish TOPIC_CONCEPT , which was a valid go

100%|██████████| 33/33 [00:32<00:00,  1.03it/s]


Length 4 / 5; New candidates = 6548
Finding top k: 10 / 100
Finding top k: 20 / 100
Finding top k: 30 / 100
Finding top k: 40 / 100
Finding top k: 50 / 100
Finding top k: 60 / 100
Finding top k: 70 / 100
Finding top k: 80 / 100
Finding top k: 90 / 100


  0%|          | 0/9 [00:00<?, ?it/s]

Finding top k: 100 / 100
Example of current patterns
Pattern: [['LEMMA:that', 'SPACY:POS-ADP']]
Window size: 3
Class: Positive
Precision: 0.546
Match: 1239 (30.5%)
Gain = 0.042
Metric (global) = 0.042
[5m[7m[32mExamples[0m ~ Class Positive:
[5m[32m[MATCH][0m: In August 2008 , a group of college presidents calling itself the Amethyst Initiative asserted [7m[36mthat:['SPACY:POS-ADP', 'LEMMA:that'][0m lowering TOPIC_CONCEPT to 18 ( presumably ) was one way to curb the " culture of dangerous binge drinking " among college students [ REF ] . 
-------------------------
[5m[32m[MATCH][0m: The Washington Post reported in 2003 [7m[36mthat:['SPACY:POS-ADP', 'LEMMA:that'][0m Lois Boland ( USPTO Director of International Relations ) said " that TOPIC_CONCEPT runs counter to the mission of WIPO , which is to promote intellectual - property rights . " 
-------------------------
[5m[7m[31mCounterexamples[0m ~ Not class Positive:
[5m[32m[MATCH][0m: The lawsuit notes [7m[36mtha

100%|██████████| 9/9 [00:06<00:00,  1.34it/s]


Length 5 / 5; New candidates = 1788
Finding top k: 10 / 100
Finding top k: 20 / 100
Finding top k: 30 / 100
Finding top k: 40 / 100
Finding top k: 50 / 100
Finding top k: 60 / 100
Finding top k: 70 / 100
Finding top k: 80 / 100
Finding top k: 90 / 100
Finding top k: 100 / 100
Example of current patterns
Pattern: [['LEMMA:that', 'SPACY:POS-ADP']]
Window size: 3
Class: Positive
Precision: 0.546
Match: 1239 (30.5%)
Gain = 0.042
Metric (global) = 0.042
[5m[7m[32mExamples[0m ~ Class Positive:
[5m[32m[MATCH][0m: Sullivan is a vocal critic of the Environmental Protection Agency , claiming [7m[36mthat:['SPACY:POS-ADP', 'LEMMA:that'][0m TOPIC_CONCEPT being pushed by the Obama Administration are harmful to the U.S. economy . 
-------------------------
[5m[32m[MATCH][0m: The Anti - Defamation League has stated [7m[36mthat:['SPACY:POS-ADP', 'LEMMA:that'][0m " TOPIC_CONCEPT is a contemporary form of the classic anti - Semitic doctrine of the evil , manipulative and threatening world

In [10]:
# Print the learned patterns
for idx, p in enumerate(the_patterns):
    print(f'Rank {idx+1}')
    print(p)

Rank 1
Pattern: [['LEMMA:that', 'SPACY:POS-ADP']]
Window size: 3
Class: Positive
Precision: 0.546
Match: 1239 (30.5%)
Gain = 0.042
Metric (global) = 0.042
[5m[7m[32mExamples[0m ~ Class Positive:
[5m[32m[MATCH][0m: Zeigler believed Riley should be disciplined to say " Sir " and " Ma'am " to adults , and [7m[36mthat:['SPACY:POS-ADP', 'LEMMA:that'][0m TOPIC_CONCEPT was the best means of disciplining a child . 
-------------------------
[5m[32m[MATCH][0m: Researchers found [7m[36mthat:['SPACY:POS-ADP', 'LEMMA:that'][0m a removal of TOPIC_CONCEPT policies from all colleges and universities would result in a significant drop in minority presence in all institutions of higher education , upholding TOPIC_CONCEPT role in diversity in the American classroom [ REF ] . 
-------------------------
[5m[7m[31mCounterexamples[0m ~ Not class Positive:
[5m[32m[MATCH][0m: In a column published in the New York Times on June 15 , 2011 Kristof argued [7m[36mthat:['SPACY:POS-ADP', 'LE

In [12]:
print(f'  #    class Cov(%)    Prec    Gain    Pattern')
for idx, p in enumerate(the_patterns):
    print(f'{idx+1:>3} {p.support_class}   {round(p.coverage*100, 1):>4}   {p.precision:.3f}   {p.metric:.3f}    {p.get_pattern_id()}')

  #    class Cov(%)    Prec    Gain    Pattern
  1 Positive   30.5   0.546   0.042    [['LEMMA:that', 'SPACY:POS-ADP']]
  2 Negative   42.7   0.546   0.017    [['LEMMA:]']]
  3 Negative   27.7   0.518   0.015    [['LEMMA:that'], ['SPACY:POS-VERB']]
  4 Positive    8.9   0.579   0.013    [['ARGUMENTATIVE:Yes'], ['LEMMA:that']]
  5 Negative   26.6   0.529   0.012    [['SPACY:POS-VERB'], ['SPACY:POS-ADP'], ['SPACY:POS-VERB']]
  6 Positive    5.2   0.627   0.011    [['LEMMA:study']]
  7 Positive    2.5   0.728   0.010    [['LEMMA:risk']]
  8 Negative   27.6   0.541   0.009    [['ARGUMENTATIVE:Yes'], ['SPACY:POS-ADP']]
  9 Positive    4.9   0.607   0.009    [['LEMMA:ban']]
 10 Positive    4.8   0.607   0.009    [['ARGUMENTATIVE:Yes'], ['ARGUMENTATIVE:Yes']]
 11 Positive   10.5   0.525   0.009    [['LEMMA:that', 'SPACY:POS-ADP'], ['SPACY:POS-DET']]
 12 Positive    6.0   0.578   0.008    [['LEMMA:energy']]
 13 Positive    7.7   0.546   0.008    [['ARGUMENTATIVE:Yes'], ['SPACY:POS-ADP'], ['SPA

## Save the patterns to a json file
We can use this json file as an input of the web demo tool for exploring the learned patterns and the training data

In [13]:
grasp_model.to_json('results/case_study_2.json')

100%|██████████| 100/100 [00:00<00:00, 278.50it/s]
100%|██████████| 100/100 [00:00<00:00, 231.03it/s]


Successfully dump the results to results/case_study_2.json
