# Natural language processing wit spacy

Gracias al procesamiento del lenguaje natural (PLN), voy a extraer a partir de un texto triple (sujeto, objeto, relación) para identificar tanto los nodos como las relaciones que incluirá el grafo de conocimineto.

## Librerias

Existen diferentes librerías en Python para PLN, entre las más conocidas estan NLTK y spaCy. Para este trabajo se usará spaCy (https://spacy.io/), la última versión que hay a 7 de diciembre de 2021, es la 3.2.0.También, será necesario instalar el modelo del idioma que se va a utilizar. Las fuentes que se quieren analizar están en inglés, así que se descargará el vocabulario en inglés. SpaCy da la opción de descargar diferentes tamaños del vocabulario:
_sm nos proporciona la funcionalidad básica para PNL con pequeño tamaño. El principal inconveniente de este vocabulario es que no es demasiado bueno en la creación de word vectors, por lo que voy a elegir un tamaño medio del vocabulario (_md)

In [1]:
!pip install spacy==3.2.0
!python -m spacy download en_core_web_md


Collecting spacy==3.2.0
  Downloading spacy-3.2.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.0 MB)
[K     |████████████████████████████████| 6.0 MB 35.7 MB/s 
Collecting thinc<8.1.0,>=8.0.12
  Downloading thinc-8.0.13-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (628 kB)
[K     |████████████████████████████████| 628 kB 53.3 MB/s 
[?25hCollecting langcodes<4.0.0,>=3.2.0
  Downloading langcodes-3.3.0-py3-none-any.whl (181 kB)
[K     |████████████████████████████████| 181 kB 74.0 MB/s 
[?25hCollecting spacy-loggers<2.0.0,>=1.0.0
  Downloading spacy_loggers-1.0.1-py3-none-any.whl (7.0 kB)
Collecting srsly<3.0.0,>=2.4.1
  Downloading srsly-2.4.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (451 kB)
[K     |████████████████████████████████| 451 kB 51.5 MB/s 
Collecting spacy-legacy<3.1.0,>=3.0.8
  Downloading spacy_legacy-3.0.8-py2.py3-none-any.whl (14 kB)
Collecting pathy>=0.3.5
  Downloading pathy-0.6.1-py3-none-any.whl (42 kB)
[K     |███

In [2]:
import spacy
import re

In [3]:


print(spacy.__version__)
nlp = spacy.load('en_core_web_md')

3.2.0


In [13]:
def remove_space_characters(input_text):
  regex = re.compile(r"[\n\t\r]")
  return regex.sub(" ",input_text)

SUBJECTS = ["nsubj", "nsubjpass", "csubj", "csubjpass", "agent", "expl"]
VERBS = ['ROOT', 'advcl']
OBJECTS = ["dobj", "dative", "attr", "oprd", 'pobj']
ENTITY_LABELS = ['PERSON', 'NORP', 'GPE', 'ORG', 'FAC', 'LOC', 'PRODUCT', 'EVENT', 'WORK_OF_ART']

input_string ='''Powdery mildew
Disease symptoms
The fungus is an obligate pathogen which can attack all green parts of the vine.
Symptoms of this disease are frequently confused with those of powdery mildew. Infected leaves develop pale yellow-green lesions which gradually turn brown. Severely infected leaves often drop prematurely.
Infected petioles, tendrils, and shoots often curl, develop a shepherd's crook, and eventually turn brown and die.
Young berries are highly susceptible to infection and are often covered with white fruiting structures of the fungus. Infected older berries of white cultivars may turn dull gray-green, whereas those of black cultivars turn pinkish red.
Survival and spread
The fungus overwinters mainly in the fallen leaves which are the source of primary infection. Secondary infection occurs by motile zoospores by splashing rain.
Favourable conditions
The most serious outbreaks have been found to occur when a wet winter is followed by a wet spring and a warm summer with intermittent rains
Anthracnose
Disease symptoms
Powdery mildew, caused by the fungus Uncinulanecator, can infect all green tissues of the grapevine.'''

format_text = remove_space_characters(input_string)

doc = nlp(format_text)

subject_ls = []
verb_ls = []
object_ls = []

for token in doc:
    if token.dep_ in SUBJECTS:
      subject_ls.append((token.lower_, token.idx))
    elif token.dep_ in VERBS:
      verb_ls.append((token.lemma_, token.idx))
    elif token.dep_ in OBJECTS:
      object_ls.append((token.lower_, token.idx))

print('SUBJECTS: ', subject_ls)
print('VERBS: ', verb_ls)
print('OBJECTS: ', object_ls)


SUBJECTS:  [('disease', 15), ('fungus', 36), ('which', 67), ('symptoms', 113), ('leaves', 201), ('which', 242), ('leaves', 288), ('petioles', 328), ('shoots', 352), ('berries', 440), ('berries', 567), ('those', 628), ('which', 742), ('infection', 795), ('outbreaks', 890), ('winter', 936), ('by', 955), ('disease', 1025), ('by', 1065)]
VERBS:  [('be', 43), ('confuse', 153), ('develop', 208), ('drop', 301), ('curl', 365), ('be', 448), ('turn', 598), ('turn', 653), ('survival', 671), ('occur', 805), ('condition', 862), ('find', 910), ('follow', 946)]
OBJECTS:  [('pathogen', 58), ('parts', 94), ('vine', 107), ('disease', 130), ('those', 167), ('mildew', 184), ('lesions', 234), ('crook', 392), ('brown', 419), ('infection', 474), ('structures', 526), ('fungus', 544), ('cultivars', 584), ('green', 613), ('cultivars', 643), ('red', 666), ('overwinters', 702), ('leaves', 735), ('source', 756), ('infection', 774), ('zoospores', 822), ('rain', 845), ('spring', 964), ('rains', 1007), ('mildew', 105