# Omar Ali's Submission

## Plan for Cube

> For the consideration of time, I have pre-generated RST trees from a selection of instances within the provided dataset. In practice this would be scaled up to an automated generation cycle per instance whereby, for each instance added to the database, and RST tree would be generated.
> 
> For each instance, the RST tree would be weighted with the most important parts considered highly above the less important sections. Following a similar model to text summarisation, the least important aspects will be removed.
> 
> With the most important parts identified we can train a model to determine what features are present within topics that are indicative of emerging changes required by policymakers. This model allows for our analysis to be scaled to different instances that are coming through every day. 
> 
> The hypothesis is that there is some level of consistency within information indicative of a policy change. I hope that the added level of rhetorical structure theory and its segmentation will allow for these parts to be highlighted and learned from.
> 
> In future, I may also consider using the relationships between spans of text as an indication to emerging policies. For example relationships that represent attributions or cause-and-effect might likely be an indication of emerging changes to policy.

## Implementation
1. Weighting logic to rank our RST trees
2. Data taken from the data set processes using HILDA (and RST parser)

Output:
1. Ranked, indexed list of the most important, salient points taken from the text that are likely to indicate policy changes.

### Importing dependencies needed

In [None]:
!pip install transformers

### RST Ranker Logic
This Section of the code aims to rank EDUs based on their importance with the text. Text must first be processed using HILDA and stored.

In [None]:
import csv
import re 
import json
def convert_str_RST_to_tup(rst_str = None):
    rst_str = rst_str.replace('ParseTree', '')
    print (rst_str)
    rst_str = eval(rst_str)
    return rst_str

w = []
# The two weighting schemes defined. HEAVY removes all SATELITES
# (think text summerization), LIGHT only intensifies and subdues NUC's and SAT's.

HEAVY = (1, 0)
LIGHT = (1.4, 0.6)
STAGE_2 = (1, 1)

# Recursively weight the the tuple-tree (nested tuples). 
def weight_RST(weights, tup_obj, weight=1, weighting_scheme=LIGHT):
    # print ("Begin Weighting Process....")
    NUC = '[N]'
    SAT = '[S]'
    NA  = 'n/a'
    relation = tup_obj[0]
    subtree = tup_obj[1]
    relation_type = re.findall(r'\[[a-zA-Z]\]', relation)

    if relation == NA:
        weights.append([subtree[0], weighting_scheme[0]])
        return

    if isinstance(subtree[0], str):
        if relation_type[0] == NUC:
            weights.append ([subtree[0], weight*weighting_scheme[0]])
        if relation_type[0] == SAT:
            weights.append ([subtree[0], weight*weighting_scheme[1]])
    if isinstance(subtree[1], str):
        if relation_type[1] == NUC:
            weights.append ([subtree[1], weight*weighting_scheme[0]])
        if relation_type[1] == SAT:
            weights.append ([subtree[1], weight*weighting_scheme[1]])
    if isinstance(subtree[1], tuple):
        if relation_type[1] == NUC:
            weight_RST(weights, subtree[1], weight*weighting_scheme[0])
        if relation_type[1] == SAT:
            weight_RST(weights, subtree[1], weight*weighting_scheme[1])
    if isinstance(subtree[0], tuple):
        if relation_type[0] == NUC:
            weight_RST(weights, subtree[0], weight*weighting_scheme[0])
        if relation_type[0] == SAT:
            weight_RST(weights, subtree[0], weight*weighting_scheme[1])
    # print ("FINISHED Weighting Process....")

### Representation of outputs

> Below is a graph that describes which parts of the text are the most important for each instance.
> 
> We threshold this information and use it to train our subsequent model for detection of emerging policy changing data.