# Natural language processing wit spacy

Gracias al procesamiento del lenguaje natural (PLN), voy a extraer a partir de un texto triple (sujeto, objeto, relación) para identificar tanto los nodos como las relaciones que incluirá el grafo de conocimineto.

## Librerias

Existen diferentes librerías en Python para PLN, entre las más conocidas estan NLTK y spaCy. Para este trabajo se usará spaCy (https://spacy.io/), la última versión que hay a 7 de diciembre de 2021, es la 3.2.0.También, será necesario instalar el modelo del idioma que se va a utilizar. Las fuentes que se quieren analizar están en inglés, así que se descargará el vocabulario en inglés. SpaCy da la opción de descargar diferentes tamaños del vocabulario:
_sm nos proporciona la funcionalidad básica para PNL con pequeño tamaño. El principal inconveniente de este vocabulario es que no es demasiado bueno en la creación de word vectors, por lo que voy a elegir un tamaño medio del vocabulario (_md)

In [1]:
!pip install spacy==3.2.0
!python -m spacy download en_core_web_md


Collecting spacy==3.2.0
  Downloading spacy-3.2.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.0 MB)
[K     |████████████████████████████████| 6.0 MB 5.0 MB/s 
Collecting thinc<8.1.0,>=8.0.12
  Downloading thinc-8.0.13-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (628 kB)
[K     |████████████████████████████████| 628 kB 43.4 MB/s 
Collecting spacy-legacy<3.1.0,>=3.0.8
  Downloading spacy_legacy-3.0.8-py2.py3-none-any.whl (14 kB)
Collecting langcodes<4.0.0,>=3.2.0
  Downloading langcodes-3.3.0-py3-none-any.whl (181 kB)
[K     |████████████████████████████████| 181 kB 60.0 MB/s 
Collecting spacy-loggers<2.0.0,>=1.0.0
  Downloading spacy_loggers-1.0.1-py3-none-any.whl (7.0 kB)
Collecting typer<0.5.0,>=0.3.0
  Downloading typer-0.4.0-py3-none-any.whl (27 kB)
Collecting srsly<3.0.0,>=2.4.1
  Downloading srsly-2.4.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (451 kB)
[K     |████████████████████████████████| 451 kB 35.1 MB/s 
[?25hCollecting p

In [7]:
import spacy
import re
import time

In [3]:
print(spacy.__version__)


3.2.0


In [4]:
non_nc = spacy.load('en_core_web_md')

nlp = spacy.load('en_core_web_md')
nlp.add_pipe('merge_noun_chunks')

SUBJECTS = ["nsubj", "nsubjpass", "csubj", "csubjpass", "agent", "expl"]
VERBS = ['ROOT', 'advcl']
OBJECTS = ["dobj", "dative", "attr", "oprd", 'pobj']
ENTITY_LABELS = ['PERSON', 'NORP', 'GPE', 'ORG', 'FAC', 'LOC', 'PRODUCT', 'EVENT', 'WORK_OF_ART']

In [5]:
def remove_special_characters(text):
    
    regex = re.compile(r'[\n\r\t]')
    clean_text = regex.sub(" ", text)
    
    return clean_text


def remove_stop_words_and_punct(text, print_text=False):
    
    result_ls = []
    rsw_doc = non_nc(text)
    
    for token in rsw_doc:
        if print_text:
            print(token, token.is_stop)
            print('--------------')
        if not token.is_stop and not token.is_punct:
            result_ls.append(str(token))
    
    result_str = ' '.join(result_ls)

    return result_str


def create_svo_lists(doc, print_lists):
    
    subject_ls = []
    verb_ls = []
    object_ls = []

    for token in doc:
        if token.dep_ in SUBJECTS:
            subject_ls.append((token.lower_, token.idx))
        elif token.dep_ in VERBS:
            verb_ls.append((token.lemma_, token.idx))
        elif token.dep_ in OBJECTS:
            object_ls.append((token.lower_, token.idx))

    if print_lists:
        print('SUBJECTS: ', subject_ls)
        print('VERBS: ', verb_ls)
        print('OBJECTS: ', object_ls)
    
    return subject_ls, verb_ls, object_ls


def remove_duplicates(tup, tup_posn):
    
    check_val = set()
    result = []
    
    for i in tup:
        if i[tup_posn] not in check_val:
            result.append(i)
            check_val.add(i[tup_posn])
            
    return result


def remove_dates(tup_ls):
    
    clean_tup_ls = []
    for entry in tup_ls:
        if not entry[2].isdigit():
            clean_tup_ls.append(entry)
    return clean_tup_ls


def create_svo_triples(text, print_lists=False):
    
    clean_text = remove_special_characters(text)
    doc = nlp(clean_text)
    subject_ls, verb_ls, object_ls = create_svo_lists(doc, print_lists=print_lists)
    
    graph_tup_ls = []
    dedup_tup_ls = []
    clean_tup_ls = []
    
    for subj in subject_ls: 
        for obj in object_ls:
            
            dist_ls = []
            
            for v in verb_ls:
                
                # Assemble a list of distances between each object and each verb
                dist_ls.append(abs(obj[1] - v[1]))
                
            # Get the index of the verb with the smallest distance to the object 
            # and return that verb
            index_min = min(range(len(dist_ls)), key=dist_ls.__getitem__)
            
            # Remve stop words from subjects and object.  Note that we do this a bit
            # later down in the process to allow for proper sentence recognition.

            no_sw_subj = remove_stop_words_and_punct(subj[0])
            no_sw_obj = remove_stop_words_and_punct(obj[0])
            
            # Add entries to the graph iff neither subject nor object is blank
            if no_sw_subj and no_sw_obj:
                tup = (no_sw_subj, verb_ls[index_min][0], no_sw_obj)
                graph_tup_ls.append(tup)
        
        #clean_tup_ls = remove_dates(graph_tup_ls)
    
    dedup_tup_ls = remove_duplicates(graph_tup_ls, 2)
    clean_tup_ls = remove_dates(dedup_tup_ls)
    
    return clean_tup_ls

In [17]:
input_string ='''Powdery mildew
Disease symptoms
The fungus is an obligate pathogen which can attack all green parts of the vine.
Symptoms of this disease are frequently confused with those of powdery mildew. Infected leaves develop pale yellow-green lesions which gradually turn brown. Severely infected leaves often drop prematurely.
Infected petioles, tendrils, and shoots often curl, develop a shepherd's crook, and eventually turn brown and die.
Young berries are highly susceptible to infection and are often covered with white fruiting structures of the fungus. Infected older berries of white cultivars may turn dull gray-green, whereas those of black cultivars turn pinkish red.
Survival and spread
The fungus overwinters mainly in the fallen leaves which are the source of primary infection. Secondary infection occurs by motile zoospores by splashing rain.
Favourable conditions
The most serious outbreaks have been found to occur when a wet winter is followed by a wet spring and a warm summer with intermittent rains
Anthracnose
Disease symptoms
Powdery mildew, caused by the fungus Uncinulanecator, can infect all green tissues of the grapevine.'''

initial_tup_ls = create_svo_triples(input_string, print_lists=True)


SUBJECTS:  [('powdery mildew disease', 0), ('the fungus', 32), ('which', 67), ('symptoms', 113), ('infected leaves', 192), ('which', 242), ('severely infected leaves', 270), ('infected petioles', 319), ('shoots', 352), ('young berries', 434), ('infected older berries', 552), ('those', 628), ('which', 742), ('secondary infection', 785), ('the most serious outbreaks', 873), ('a wet winter', 930), ('by', 955), ('anthracnose disease', 1013), ('by', 1065)]
VERBS:  [('be', 43), ('confuse', 153), ('develop', 208), ('drop', 301), ('curl', 365), ('be', 448), ('turn', 598), ('turn', 653), ('survival', 671), ('occur', 805), ('favourable condition', 851), ('find', 910), ('follow', 946)]
OBJECTS:  [('an obligate pathogen', 46), ('all green parts', 84), ('the vine', 103), ('this disease', 125), ('those', 167), ('powdery mildew', 176), ('pale yellow-green lesions', 216), ("a shepherd's crook", 379), ('brown', 419), ('infection', 474), ('white fruiting structures', 511), ('the fungus', 540), ('white c

In [18]:
initial_tup_ls[:26]

[('powdery mildew disease', 'be', 'obligate pathogen'),
 ('powdery mildew disease', 'be', 'green parts'),
 ('powdery mildew disease', 'confuse', 'vine'),
 ('powdery mildew disease', 'confuse', 'disease'),
 ('powdery mildew disease', 'confuse', 'powdery mildew'),
 ('powdery mildew disease', 'develop', 'pale yellow green lesions'),
 ('powdery mildew disease', 'curl', 'shepherd crook'),
 ('powdery mildew disease', 'be', 'brown'),
 ('powdery mildew disease', 'be', 'infection'),
 ('powdery mildew disease', 'be', 'white fruiting structures'),
 ('powdery mildew disease', 'turn', 'fungus'),
 ('powdery mildew disease', 'turn', 'white cultivars'),
 ('powdery mildew disease', 'turn', 'green'),
 ('powdery mildew disease', 'turn', 'black cultivars'),
 ('powdery mildew disease', 'turn', 'pinkish red'),
 ('powdery mildew disease', 'survival', 'fungus overwinters'),
 ('powdery mildew disease', 'survival', 'fallen leaves'),
 ('powdery mildew disease', 'occur', 'source'),
 ('powdery mildew disease', 'oc