<h1 style="text-align:center;"><strong>Generación del corpus<strong></h1>

### 1) Importar librerías

#### Librerías para la extracción de la data de Neo4j

In [1]:
from py2neo import Graph
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

#### Librerías para el preprocesamiento de la data

In [2]:
from unidecode import unidecode
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Nicolas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
from string import punctuation 
from nltk.corpus import stopwords
from textblob import TextBlob
import json
import pickle

### 2) Extraer la data válida para el Corpus de Neo4j

#### Configurar la conexión a la base de datos de Neo4j

In [4]:
graph = Graph("bolt://localhost:7687", auth=("neo4j", "narias"))

#### Extracción de la data para el Corpus

###### Por artículo

In [None]:
query = """
match (ar:Article {corpus:'True'})
optional match (ar)-[r:USES]-(to:Topic)
RETURN ar.scopus_id, ar.title, ar.abstract, collect(to.name) as topics
"""
res = graph.run(query)

In [None]:
data = []
for item in res:
    article = {}
    article['doc_id'] = item.data()['ar.scopus_id']
    article['doc'] = (item.data()['ar.title'] + " " + item.data()['ar.abstract'] + " " + ' '.join(item.data()['topics'])).rstrip()
    data.append(article)

###### Por autor

In [5]:
query = """
match (au:Author)-[:WROTE]->(ar:Article {corpus:'True'})
optional match (ar)-[r:USES]-(to:Topic)
return au.scopus_id, collect(distinct([ar.title, ar.abstract])) as articles, collect(to.name) as topics
"""
res = graph.run(query)

In [6]:
data = []
for item in res:
    author = {}
    author['doc_id'] = item.data()['au.scopus_id']
    author['doc']  = (' '.join([' '.join(article) for article in item.data()['articles']]) + " " + ' '.join(item.data()['topics'])).rstrip()
    data.append(author)

In [7]:
len(data)

37569

In [8]:
data[1]

{'doc_id': '57581708700',
 'doc': "Facility Layout Design in Textile MSMEs. Literature Review of Resilient Indicators The capacity to respond and adapt to risks and problems in an organization is critical for business success. Any type of weakness causes inefficient use of resources. On the contrary, flexible facilities can ensure the continuity of operations in the face of disruptive events, which significantly harm the company. However, flexibility is not achieved only with the optimization of facilities, as resilient approaches can enhance it. This research synthesizes the variables and indicators with greater use in three different areas, business resilience, textile industry, and the facility layout problem (FLP). A systematic literature review was conducted, considering 99 studies published in 2010–2021. The documents were analyzed using the Atlas.ti software, a 4W (When, Who, What, and Where) analysis, and three research questions posed through the PICO strategy. The findings in

### 3) Preprocesamiento de la data

#### Crear tokenizer

In [9]:
tokenizer = TfidfVectorizer().build_tokenizer()

#### Definición de las stop_words

In [10]:
stop_words = [unidecode(stopW) for stopW in stopwords.words('english')]
non_words = list(punctuation)
non_words.extend(['¿', '¡', '...', '..'])
stop_words = stop_words + non_words

#### Generación del corpus

In [11]:
for article in data:
    article['preprocessed_doc'] = ' '.join([word.lower() for word in tokenizer(unidecode(article['doc'])) if word.lower() not in stop_words])

In [12]:
data[3]

{'doc_id': '55960099200',
 'doc': "Facility Layout Design in Textile MSMEs. Literature Review of Resilient Indicators The capacity to respond and adapt to risks and problems in an organization is critical for business success. Any type of weakness causes inefficient use of resources. On the contrary, flexible facilities can ensure the continuity of operations in the face of disruptive events, which significantly harm the company. However, flexibility is not achieved only with the optimization of facilities, as resilient approaches can enhance it. This research synthesizes the variables and indicators with greater use in three different areas, business resilience, textile industry, and the facility layout problem (FLP). A systematic literature review was conducted, considering 99 studies published in 2010–2021. The documents were analyzed using the Atlas.ti software, a 4W (When, Who, What, and Where) analysis, and three research questions posed through the PICO strategy. The findings in

### 4) Almacenamiento del corpus

#### Definir la ruta del corpus

In [13]:
corpusPath = 'resources/corpus/'

#### Definir el nombre del corpus

In [14]:
modelo = 'tf-idf'
version = 'v10.0'
corpusName = corpusPath+'corpus-'+modelo+"-"+version+".pkl"

#### Almacenar la data preprocesada en un pickle

In [15]:
pd.DataFrame(data)[['doc_id', 'preprocessed_doc']].to_pickle(corpusName)

In [16]:
dfCorpus = pd.read_pickle(corpusName)

In [17]:
dfCorpus.iloc[[37525]]['preprocessed_doc']

37525    clinical trial bleomycin sulfate treatment cer...
Name: preprocessed_doc, dtype: object

In [18]:
dfCorpus

Unnamed: 0,doc_id,preprocessed_doc
0,57781304700,metamodeling audio signals design process enco...
1,57581708700,facility layout design textile msmes literatur...
2,57219597377,facility layout design textile msmes literatur...
3,55960099200,facility layout design textile msmes literatur...
4,57777188600,facility layout design textile msmes literatur...
...,...,...
37564,56836372700,fate wildlife galapagos islands 15th september...
37565,35252391200,chronic puerperal inversion uterus case puerpe...
37566,6507581924,molecular weight plasma substituting effective...
37567,24782346900,einfache wageburetten und ihr zweckmassiger ge...
