## Sentence Parsing
This notebook takes several sample sentences and give the corresponding parsing (sentence strucutre) and extract pred of each sent as its feature. 

If you can not load "en_core_web_sm", you can try the following code (cited from [stackoverflow](https://stackoverflow.com/questions/66149878/e053-could-not-read-config-cfg-resumeparser)):
 - pip install nltk
 - pip install spacy==2.3.5
 - pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz


In [1]:
import spacy
import pandas as pd
nlp = spacy.load("en_core_web_sm")
from spacy import displacy
import nltk
from collections import defaultdict

In [2]:
data_path = "../../data/Sample comments.csv"
df = pd.read_csv(data_path)
df

Unnamed: 0,Unique ID,Domain,Comment,Tags
0,6070f5c0800d871e0c75d919,Vaccine,I got my jab on March 29. Your literature says...,Vaccine effectiveness / delayed dosage
1,606ac6aa8d190c273ca7ebe3,Vaccine,How reliable the shipment is ?? \nSpending on ...,Data and tracking vaccines
2,601c05426c4b8d189822fcec,Vaccine,Critical missing info:\nFed Govt needs to make...,Data and tracking vaccines
3,604e366623caed19c087f936,Travel,When coming from Portugal and the itinerary ...,Hotels
4,604498689a91901f24b82c39,Travel,Pre-entry test requirements:\nYou must show pr...,Testing


In [3]:
def parsing_features_extract(df):
    """
    Parameters:
    ------------
        df: dataframe
    Return:
    ------------
        all_preds: list of preds in each comment instance 
            exmaple:
            [
            [pred1,pred2,pred3], # comment 1
            [pred1,pred2], # comment 2
            [pred1,pred2,pred3,pred4] # comment 3
            ]
        all_chunks: list of chunks in each comment instance. all_chunks has n elements(n comments/instances), each element has n sentences, each sentence has n chunks.
            example:
            [   
                [   # comment 1 has 3 sents
                    [[token1,token2],[token3,token4],[token5],[token6]], #comment 1 sent 1
                    [[token1],[token2,token3]], #comment 1 sent 2
                    [[token1]] #comment 1 sent 3
                ],
                [   # comment 2 has 2 sents
                    [],
                    []
                ],
                [   # comment 3 has 4 sents
                    [],
                    [],
                    [],
                    []
                ],
            ]
    
    """
    all_preds = []
    all_chunks = []
    for i in range(len(df)):
        comment = df.iloc[i]["Comment"]
        sents = []
        for sent in nltk.sent_tokenize(comment):
            sents.extend(sent.split("\n"))
        preds = [] # for each comment (A comment may include more than one sents)
        chunks = [] # for each comment
        # for sent in one comment
        for sent in sents:
            sent = nlp(sent)
            chunks.append(chunking(sent))
            preds.append([token for token in sent if token.dep_ == "ROOT"][0])
            displacy.render(sent, style="dep")
        all_preds.append(preds)
        all_chunks.append(chunks)
    return all_preds, all_chunks

In [4]:
def chunking(doc):
    """
    Get chunks for each sentence(not each comment) One comment/instance may include more than one sentences.
    
    Parameters:
    -----------
        sent: an nlp returned object
    Return:
    -----------
        chunks: list of tokens of chunks
    """
    dic = dict()
    chunks = []
    for token in doc:
        if token.i == token.head.i:
            dic["ROOT"] = [child.text for child in token.children]
        else:
            dic[token.text] = [child.text for child in token.children]
    for i in dic["ROOT"]:
        chunks.append(get_chunk(i,dic,[]))
    return chunks

In [5]:
def get_chunk(token, dic, lst):
    """
    A recursive function that get all the dependent tokens of a give goken
    """
    if not dic[token]:
        lst.append(token)
    else:
        lst.append(token)
        for i in dic[token]:
            get_chunk(i,dic,lst)
    return lst

In [6]:
parsing_features_extract(df)

([[got, says, got, change, Sinclair, 01/03/4],
  [is, is, make, Are, technology, Are, realize, need, guy],
  [info, needs, percent, Distribution, distributed],
  [coming, -frankfurt, Toronto, EDMONTON, done, done],
  [requirements, show, number, Resolved]],
 [[[['I'], ['jab', 'my'], ['on', 'March', '29'], ['.']],
   [['literature', 'Your'],
    ['need',
     'I',
     'shot',
     'my',
     '3rd',
     'within',
     'weeks',
     '3',
     'of',
     'first',
     'my',
     'undergoing',
     'as',
     'I',
     'was',
     'treatment',
     'cancer',
     'causing',
     'immunosuppression'],
    ['.']],
   [['I'],
    ['jab', 'my', 'at', 'Mike', 'St'],
    ["'s"],
    ['stated',
     'from',
     'which',
     'it',
     'was',
     'is',
     'shot',
     'my',
     '2nd',
     'due',
     'July',
     '16'],
    ['.']],
   [['Can'], ['you'], ['this'], ['to', 'April', '19'], ['.']],
   [['HILDY'], ['.']],
   []],
  [[['reliable', 'How'], ['shipment', 'the'], ['?'], ['?']],
   [[