# Argument Map Class

This is some sample code I wrote to represent argument maps from [this research project](https://www.researchgate.net/publication/359025097_argBERT_Using_Taxonomic_Distance_to_Recommend_Post_Placement_on_Deliberation_Maps). An argument map is a tree-structured network of arguments of type ISSUEs, IDEAs, PROs, and CONs. The following code is a Python class that reads from a TSV file consisting of arguments and their children and constructs a tree representation of the map, ensuring the tree is valid (e.g. contains no cycles). The custom `__str__` function pretty prints the Argument Map.

The research project revolved around one primary goal: Given a new post as input, create a short list of suggestions that includes, with high probability, an accurate parent for a new post. 

In order to do this, we trained a model to predict the `taxonomic distance` between two posts in an argument map, which essentially means the length of the shortest path between two posts. We would reccomend posts with the smallest predicted taxonomic distance from a new post as parents. This sample code also generates training data for this model through the `create_training_data` function

My role in this research project consisted of brainstorming the approach to the problem, implementing different approaches in code and conducting experiments, and writing the research report. [Dr. Mark Klein](https://mitsloan.mit.edu/staff/directory/mark-klein) advised me throughout this process.

[My Github](https://github.com/lievan)

In [117]:
import csv
from collections.abc import Callable

class ArgMap:
    def __init__(self, map_path: str):
        self.tree = {}
        self.types = {'MAP', 'ISSUE', 'IDEA', 'PRO', 'CON'}
        self.root = None

        # first initialize the tree
        with open(map_path, 'r') as f:
            reader = csv.reader(f, delimiter='\t')
            next(reader) #skip header
            for entity, arg_type, name, description, children in reader:
    
                ret = self._invalid_post(entity, arg_type, name, description)
                if ret:
                    raise ValueError(ret)
  
                if not children:
                    children = set()
                else:
                    children = set(children.split(' '))

                if arg_type == 'MAP':
                    self.root = entity #we've found our root
  
                node = {'type': arg_type, 
                        'name': name, 
                        'description': description, 
                        'children': children, 
                        'parent': None}

                self.tree[entity] = node
  
        #if self.root is still None, we failed to find a root
        if not self.root:
            raise ValueError("Missing root.")

        #populate parent pointers for each node
        self._populate_parents()

        # check for cycles or multiple roots
        self.validate()

    def add_post(self, entity: str, arg_type: str, name: str, 
                        description: str, parent: str) -> None:
        ret = self._invalid_post(entity, arg_type, name, description)
        if ret:
            raise ValueError(ret)

        if parent not in self.tree:
            raise ValueError('parent not found in map when adding new post')

        self.tree[parent]['children'].add(entity)

        node = {'type': arg_type, 
                'name': name, 
                'description': description, 
                'children': set(), 
                'parent': parent}
  
        self.tree[entity] = node
        self.validate()

    def create_training_data(self, save_path: str) -> None:
        # generates all pairs of posts with their taxonomic distances
        # and saves this data to `save_path`
        entities = list(self.tree.keys())
        data = []
        for i in range(len(entities)):
            for j in range(i + 1, len(entities)):
                p1, p2 = entities[i], entities[j]
                p1_sample  = "{} {} {}".format(self.tree[p1]['type'], 
                                              self.tree[p1]['name'], 
                                              self.tree[p1]['description'])
  
                p2_sample = "{} {} {}".format(self.tree[p2]['type'], 
                                              self.tree[p2]['name'], 
                                              self.tree[p2]['description'])
                dist = self.taxonomic_distance(p1, p2)
                data.append([p1_sample, p2_sample, dist])
        with open(save_path, 'w') as f:
            writer = csv.writer(f, delimiter='\t')
            writer.writerow(['p1', 'p2', 'distance'])
            for row in data:
                writer.writerow(row)

    def taxonomic_distance(self, p1: str, p2: str) -> int:
        # given nodes p1 and p2, returns shortest path 
        # distance between these two nodes
        # we find the shortest path by tracing the path 
        # from p1 and p2 to the root. We then
        # use these paths to find the lowest common 
        # ancestor of p1 and p2
        # the shortest path is then 
        # dist(LCA(p1, p2), p1) + dist(LCA(p1, p2), p2)

        def path_to_root(post):
            path = []
            curlen = 0
            while post in self.tree:
                path.append((post, curlen))
                curlen += 1
                post = self.tree[post]['parent']
            return path

        #returned path is in order of closest to furthest to p1
        p1_path = path_to_root(p1)

        #turn p2_path into dict for faster lookup time
        p2_path = {post: curlen for post, curlen in path_to_root(p2)}

        for entity, length in p1_path:
            if entity in p2_path:
                #found LCA(p1, p2)
                return length + p2_path[entity]
        return -1

    def find_similar(self, 
                    new_post_name: str, new_post_desc: str, 
                    dist_fn: Callable, top_k = 1) -> list:
        # simple search through all posts to find the 
        # most relevant post using distance metric defined by dist_fn
        preds = []
        for entity in self.tree:
            pred = dist_fn(self.tree[entity]['name'], 
                           self.tree[entity]['description'], 
                           new_post_name, 
                           new_post_desc)
            preds.append((entity, pred))
        return sorted(preds, key = lambda x:x[1])[:top_k]

    def validate(self) -> None:
        # the 'MAP' type signifies the 
        if self._multiple_roots():
            raise ValueError("Multiple roots detected")
        if self._cycle_detection():
            raise ValueError("Cycle detected")

    def _populate_parents(self) -> None:
        # populate the parent pointers within the tree
        for entity in self.tree:
            children = self.tree[entity]['children']
            for child in children:
                if child not in self.tree:
                    raise ValueError('child: {} not found in map'.format(child))
                if self.tree[child]['type'] == 'MAP':
                    raise ValueError('child: {} cant be type MAP'.format(child))
                if self.tree[child]['parent']:
                    raise ValueError('child: {} has > 1 parent'.format(child))
                self.tree[child]['parent'] = entity

    def _cycle_detection(self) -> bool:
        visited = set()
        def dfs(entity):
            if entity in visited:
                return True
            visited.add(entity)
            for neighbor in self.tree[entity]['children']:
                if dfs(neighbor):
                    return True
            return False
        return dfs(self.root)

    def _multiple_roots(self) -> bool:
        for entity in self.tree:
            # only need to check for in-degree = 0 and type != MAP -- 
            # other root error checking done in __init__
            node = self.tree[entity]
            if not node['parent'] and node['type'] != 'MAP':
                return True
        return False

    def _invalid_post(self, entity: str, arg_type: str, 
                      name: str, description: str) -> str:
        if self.root and arg_type == 'MAP':
            return 'Multiple roots detected.'
        if not entity:
            return 'Entity missing from argument'
        if entity in self.tree:
            return 'Entity of {} already exists'.format(entity)
        if arg_type not in self.types:
            return 'Invalid argument type'
        if not name and not description:
            return 'Argument needs at least one of name and description'
        return None

    def __str__(self) -> str:
        #creates string representation of map for visualization
        tree_strs = []
        def build_str(entity, level):
            curpost = self.tree[entity]
            spaces = "   " * level
            template = "{}---- {} {} {} || {} "
            tree_strs.append(template.format(spaces, 
                                              entity, 
                                              curpost['type'], 
                                              curpost['name'], 
                                              curpost['description']))
            for child in curpost['children']:
                build_str(child, level + 1)
        build_str(self.root, 0)
        return "\n".join(tree_strs)

### Argument Map File -> ArgMap

Here is a TSV file representation of an argument map

In [118]:
import pandas as pd
pd.read_csv('tiny_map.tsv', delimiter='\t')

Unnamed: 0,Entity,Type,Name,Description,Children
0,0,MAP,This is the root,I am root,1 2
1,1,IDEA,Idea 1,This is idea 1,3 4
2,2,IDEA,idea 2,This is idea 2,5 6
3,3,PRO,Pro for idea 1,idea 1 is great idea,
4,4,CON,Con for idea 1,idea 1 is not so great,
5,5,PRO,Pro for idea 2,yea idea 2 is so great,
6,6,PRO,Pro for idea 2,idea 2 is really great,7
7,7,ISSUE,are you sure idea 2 is that great?,cuz I am not that sure,


Read from the file and print the map

In [119]:
mp = ArgMap('tiny_map.tsv')
print(mp)

---- 0 MAP This is the root || I am root 
   ---- 1 IDEA Idea 1 || This is idea 1 
      ---- 4 CON Con for idea 1 || idea 1 is not so great 
      ---- 3 PRO Pro for idea 1 || idea 1 is great idea 
   ---- 2 IDEA idea 2 || This is idea 2 
      ---- 5 PRO Pro for idea 2 || yea idea 2 is so great 
      ---- 6 PRO Pro for idea 2 || idea 2 is really great 
         ---- 7 ISSUE are you sure idea 2 is that great? || cuz I am not that sure 


Add a post, and the map will be updated

In [120]:
mp.add_post('8', 'ISSUE', 'question for idea 1', 'where did you come up with this idea?', '1')
print(mp)

---- 0 MAP This is the root || I am root 
   ---- 1 IDEA Idea 1 || This is idea 1 
      ---- 4 CON Con for idea 1 || idea 1 is not so great 
      ---- 8 ISSUE question for idea 1 || where did you come up with this idea? 
      ---- 3 PRO Pro for idea 1 || idea 1 is great idea 
   ---- 2 IDEA idea 2 || This is idea 2 
      ---- 5 PRO Pro for idea 2 || yea idea 2 is so great 
      ---- 6 PRO Pro for idea 2 || idea 2 is really great 
         ---- 7 ISSUE are you sure idea 2 is that great? || cuz I am not that sure 


In [121]:
assert mp.taxonomic_distance('2', '1') == 2
assert mp.taxonomic_distance('0', '7') == 3
assert mp.taxonomic_distance('6', '7') == 1
assert mp.taxonomic_distance('6', '6') == 0
assert mp.taxonomic_distance('5', '7') == 3
assert mp.taxonomic_distance('3', '7') == 5

Training data is created by taking all possible unique pairs of posts and finding the taxonomic distance between the two posts

In [122]:
mp.create_training_data('sample_training_data.tsv')
df = pd.read_csv('sample_training_data.tsv', delimiter='\t')
df.sample(5)

Unnamed: 0,p1,p2,distance
35,ISSUE are you sure idea 2 is that great? cuz I...,ISSUE question for idea 1 where did you come u...,5
1,MAP This is the root I am root,IDEA idea 2 This is idea 2,1
19,IDEA idea 2 This is idea 2,ISSUE are you sure idea 2 is that great? cuz I...,2
29,CON Con for idea 1 idea 1 is not so great,ISSUE question for idea 1 where did you come u...,2
8,IDEA Idea 1 This is idea 1,IDEA idea 2 This is idea 2,2


Let's try the same thing on a larger map

In [123]:
df = pd.read_csv('warming_map.tsv', delimiter='\t')
print("Num posts: {}".format(len(df)))
df.sample(5)

Num posts: 133


Unnamed: 0,Entity,Type,Name,Description,Children
5,E-3OXYV4-589,PRO,greenhouse gases increases are due to human ac...,Humans are emitting more than twice as much ...,E-3NNLOF-754 E-3NNLOF-756
7,E-3NNLOF-755,CON,Human emissions are not balanced by natural ab...,The CO2 that nature emits (from the ocean an...,
47,E-3OXYV4-614,CON,this causal link has not been proven,&lt;a target=newin href=,
59,E-3NNLOF-693,PRO,rising tropopause,The height of the tropopause (the boundary bet...,
106,E-3NNLOF-610,PRO,climate dynamics are insufficiently understood,The Earth&apos;s climate is an incredibly comp...,


In [124]:
warming_map = ArgMap('warming_map.tsv')
warming_map.create_training_data('warming_training_data.tsv')

In [125]:
df = pd.read_csv('warming_training_data.tsv', delimiter='\t')
df.sample(5)

Unnamed: 0,p1,p2,distance
3226,PRO satellite transmissions can cause warming ...,"IDEA The El Nino cycle The El Nino, a natura...",3
5241,PRO cosmic influx is declining NIL,PRO trend in record hot days versus record col...,7
7229,PRO Our oceans are getting warmer Our oceans a...,CON ice melting puts the polar bear at grave r...,9
7302,PRO surface temperature records show a warming...,CON GreenlandÃÂs interior ice sheet has grow...,5
2227,PRO albedo decline is causing global warming NIL,CON This trend reversed around 1990 The redu...,3


Such a dataset would then be used to train a model to perform the prediction of ```Model(Post, New Post) = Taxonomic Distance```

## Searching the argument map

Let's use the ```find_similar``` function in the ```ArgumentMap``` class to discover similar posts on a map in relation to a new post. The ```find_similar``` function takes a potential new post as input as well as a function that defines the similarity metric between posts. <br> The following ```SemanticSearcher``` class uses [GloVe](https://nlp.stanford.edu/projects/glove/) word vectors to calculate the semantic similarity between two pieces of text. Word vectors provide vector representations of words, and similarity is calculated using cosine distance between vectors.

In [126]:
import numpy as np
from scipy import spatial
import gensim
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
stopword = set(stopwords.words('english'))

class SemanticSearcher:
    def __init__(self, path):
        self.embeddings_dict = {}
        print('Loading word vectors...')
        with open(path, 'r') as f:
            for line in f:
                values = line.split()
                split = len(values) - 300
                word = values[split - 1]
                vector = np.asarray(values[split:], "float32")
                self.embeddings_dict[word] = vector
        print('Word vectors loaded')

    def get_mean_vector(self, text):
        words = self.preprocess_text(text)

        #only include words we have word vectors for
        words = filter(lambda word: word in self.embeddings_dict, words)
        word_vectors = [self.embeddings_dict[word] for word in words]

        return np.sum(word_vectors, axis=0)

    def predict_distance(self, post1, post2):
        v0 = self.get_mean_vector(post1)
        v1 = self.get_mean_vector(post2)
        similarity = 1 - spatial.distance.cosine(v0, v1)
        return similarity

    def preprocess_text(self, text):
        # helper method to preprocess text
        text = gensim.utils.simple_preprocess(text)
        return filter(lambda tok: tok not in stopword, text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [127]:
searcher = SemanticSearcher('glove.6B.300d.txt')

Loading word vectors...
Word vectors loaded


In [128]:
print(searcher.predict_distance("I want to go to the store", "I wish to travel to shop"))
print(searcher.predict_distance("I want to go to the store", "Soccer is my favorite sport"))

0.7566205263137817
0.40228015184402466


Define our similar metric. We could replace this similar metric with any other model we have trained

In [129]:
def similarity_metric(p1_name, p1_desc, p2_name, p2_desc):
    p1_str = "{} {}".format(p1_name, p1_desc)
    p2_str = "{} {}".format(p2_name, p2_desc)
    return 1 - searcher.predict_distance(p1_str, p2_str)

Perform a search

In [130]:
name = "Climate change is anthropogenic"
desc = "humans release a ton of carbon emissions into the atmosphere"
for post_entity, _ in warming_map.find_similar(name, 
                                               desc, 
                                               similarity_metric, 
                                               top_k = 5):
    print(warming_map.tree[post_entity])

{'type': 'PRO', 'name': 'greenhouse gases cause global warming', 'description': 'greenhouse gases  increase the greenhouse effect that warms the planet.&#10;&#10;&#10;See  [E-3NNLOF-696] for additional arguments on this point.&#10;&#10;&#10;&lt;a target=newin href=', 'children': set(), 'parent': 'E-3MAORN-138'}
{'type': 'PRO', 'name': 'GHGs trap heat, warming the planet', 'description': '  Sunlight passes through our atmosphere and warms the earth. Greenhouse gases in the lower atmosphere reduce the amount of heat that is lost back into space, acting in effect like the glass in a greenhouse. ', 'children': set(), 'parent': 'E-3OXYV4-620'}
{'type': 'IDEA', 'name': 'Increases in greenhouse gases (GHGs).', 'description': '  Major greenhouse gases include Carbon Dioxide (CO2) as well as the more potent, but much rarer, Methane.&#10;&#10;&#10;&lt;a target=newin href= target=newin href=', 'children': {'E-3OXYV4-622', 'E-3OXYV4-620', 'E-3NNLOF-605'}, 'parent': 'E-3NNLOF-526'}
{'type': 'CON', 