# Generación de los dataset

La generación del conjunto de datos para el entrenamiento es una tarea compleja. En este caso, se ha abordado de la siguiente manera:
1.	Selección de Acuerdos a Nivel de Servicios. 
2.	Emplear la biblioteca spacy para dividir el texto en oraciones.
3.	Generar un csv con las sentencias y las categorías definidas.
4.  Tras la clasificación algunas de las sentencias se han dividido o unificado.

Comenzamos con la instalación e importación.

In [None]:
!pip install pandas
!python3 -m spacy download en_core_web_sm

In [None]:
import spacy
import pandas as pd

nlp = spacy.load('en_core_web_sm', disable=['ner'])

Convertimos todos los textos a String para realizar una tokenización por oración con spacy.

In [3]:
def convert_list_to_string(list):
    str = ""
    for l in list:
        str = str.replace("\n", " ") + l
    return str

In [4]:
agreement1 = convert_list_to_string(open("slas/aws_ec2_sla_may_2022.txt", encoding="utf8").readlines())
agreement2 = convert_list_to_string(open("slas/google_engine_sla.txt", encoding="utf8").readlines())
agreement3 = convert_list_to_string(open("slas/oracle_cloud_infrastructure_database_december_2022.txt", encoding="utf8").readlines())
agreement4 = convert_list_to_string(open("slas/google_engine_sla.txt", encoding="utf8").readlines())
agreement5 = convert_list_to_string(open("slas/microsoft_azure_kubernetes_sla_march_2020.txt", encoding="utf8").readlines())

In [5]:
print("Procesando con Spacy...")
textos_train = [agreement1, agreement2, agreement3, agreement4]
textos_test = [agreement5]
docs_train = list(nlp.pipe(textos_train))
docs_test = list(nlp.pipe(textos_test))
print("Hecho.")

Procesando con Spacy...
Hecho.


Obtenemos un listado de sentencias a clasificar para el entrenamiento del modelo.

In [6]:
sents_train = []

for i in range(len(docs_train)):
  for sent in list(docs_train[i].sents):
      sentence = str(sent)
      sents_train.append(sentence)

Obtenemos un listado de sentencias a clasificar para la validación del modelo.

In [7]:
sents_test = []
for sent in list(docs_test[0].sents):
    sentence = str(sent)
    sents_test.append(sentence)

Generación de los datasets.

In [10]:
df = pd.DataFrame(sents_train, columns=["text", ])
df["service"] = 0
df["metric"] = 0
df["objetive"] = 0
df["remedies"] = 0
df["claim"] = 0
df["exception"] = 0
df["definition"] = 0
df.fillna(0, inplace=True)
df.to_csv("slas/train_empty.csv", index=False, encoding='utf-8')
print(df.head())

                                                text  service  metric  \
0                         Last Updated: May 25, 2022        0       0   
1  This Amazon Compute Service Level Agreement (t...        0       0   
2  In the event of a conflict between the terms o...        0       0   
3  Capitalized terms used herein but not defined ...        0       0   
4  *For purposes of this SLA, Amazon EC2 includes...        0       0   

   objetive  remedies  claim  exception  definition  
0         0         0      0          0           0  
1         0         0      0          0           0  
2         0         0      0          0           0  
3         0         0      0          0           0  
4         0         0      0          0           0  


In [11]:
df = pd.DataFrame(sents_test, columns=["text", ])
df["service"] = 0
df["metric"] = 0
df["objetive"] = 0
df["remedies"] = 0
df["claim"] = 0
df["exception"] = 0
df["definition"] = 0
df.fillna(0, inplace=True)
df.to_csv("slas/validation_empty.csv", index=False, encoding='utf-8')
print(df.head())

                                                text  service  metric  \
0  SLA for Azure Kubernetes Service (AKS) Last up...        0       0   
1  For customers who have purchased an Azure Kube...        0       0   
2  The availability of the agent nodes in your AK...        0       0   
3  Please see the Virtual Machines SLA for more d...        0       0   
4  Introduction  This Service Level Agreement for...        0       0   

   objetive  remedies  claim  exception  definition  
0         0         0      0          0           0  
1         0         0      0          0           0  
2         0         0      0          0           0  
3         0         0      0          0           0  
4         0         0      0          0           0  
