<br><br><br>
<h2><font color="#004D7F" size=4>Análisis de tópicos de APIs</font></h2>



<h1><font color="#004D7F" size=5>Swager Topic Modeling</font></h1>

<br><br>
<div style="text-align: right">

<font color="#004D7F" size=3>Universidad Politécnica de Madrid</font>

</div>

---

<a id="indice"></a>
<h2><font color="#004D7F" size=5>Índice</font></h2>
<br>


* [1 justificación científica](#section1)
* [2. Hipotesis,Objetivos](#section2)
* [3. Revisión de antecedentes teóricos y experimentales](#section3)
* [4. Presentación de datos teóricos y experimentales](#section4)
    * [4.1. Preprocesamiento: limpieza de datos](#section4.1)
    * [4.2. Creación de una matriz de datos](#section4.2)
    * [4.3. Entrenamiento del modelo](#section4.3)
    * [4.4. Estudio del modelo](#section4.4)
    * [4.5. Análisis de todo el texto por partido](#section4.5)
    * [4.6. TF-IDF](#section4.6)
    * [4.7. Algoritmo TextRank](#section4.7)
    * [4.8. Algoritmo Rake](#section4.8)
* [5 .Discusión y resultados](#section5)
* [6 .Conclusiones](#section6)
* [7. Bibliografia](#section7)
---

<a id="section4"></a> 
## <font color="#004D7F">4. Presentación de datos teóricos y experimentales</font>
<br>

## <font color="#004D7F"> Instalación de paquetes necesarios </font>

In [None]:
#!pip install nltk
#!pip install pandas --upgrade
#!pip install pyLDAvis

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns;
sns.set()

#%config InlineBackend.figure_format = 'retina'
%matplotlib inline

# Oculta warnings
import warnings
warnings.simplefilter('ignore')

# spacy for lemmatization 
import spacy 

# Plotting tools 
import pyLDAvis
#import pyLDAvis.gensim

#Gesim 
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel

#from IPython.core.display import display, HTML
#display(HTML("<style>.container { width:95% !important; }</style>"))

In [None]:
num_docs = 5000 
np.random.seed(0)

# Lee los datos
df_apis = pd.read_csv('frame.csv', sep='---#---', names=['URI', 'description'])
df_apis = df_apis.mask(df_apis.eq('None')).dropna()
df_apis = df_apis.reset_index(drop=True)
df_apis.head(1)

Unnamed: 0,URI,description
0,"""./openapi-directory/APIs/citycontext.com/1.0....","""2.0.api.citycontext.com./v1.City Context prov..."


<div style="text-align: right">
<a href="#indice"><font size=5><i class="fa fa-arrow-circle-up" aria-hidden="true" style="color:#004D7F"></i></font></a>
</div>

---

<a id="section2"></a> 
## <font color="#004D7F"> 4.2.1. Preprocesamiento: limpieza de datos</font>
<br>

### <font color="#004D7F"> Eliminación de URL y etiquetas HTML</font>
<br>

Expresiones regulares para eliminar URL y etiquetas

In [None]:
import re

def removeHTML(text):
    # Elmina las etiquetas HTML
    text = re.sub('<[^>]*>',' ', text)
    text = re.sub('\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*',' ', text)
    #remove all single characters
    text = re.sub('\s+[a-zA-Z]\s+', ' ', text)
    #Remove all the special characters
    text = re.sub(r'\W', ' ', text)
    # Elimina los caracteres que no son alfabéticos
    text = re.sub('[\W]+',' ',text.lower())
    text = re.sub('[\d]+',' ',text.lower())
    #Substituting multiple spaces with single space
    text= re.sub(r'\s+', ' ', text, flags=re.I)
    #text = re.sub('^https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE) 
    # Devuelve el texto
    return text

# Test
texto = '0 09 Eliminar http://www.upm.es/     de cualquier URL </a> y caracteres %$ @#'
print(removeHTML(texto))

 eliminar de cualquier url caracteres 


### <font color="#004D7F"> Eliminación de números y símbolos de puntuación </font>
<br>

LDA (y otros métodos basados en _Bag of Words)_ dividen los documentos en palabras o _tokens._ Por defecto, cualquier secuencia de caracteres limitada por espacios, se considera un _token._ Esto incluye signos de puntuación, números, etc. 

En este caso concreto, de cara a modelar los temas que de los que se constituyen los documentos, resultan irrelevantes tanto los números como los símbolos de puntuación, por lo que serán descartados. Es posible seleccionar los _tokens_ de interés mediante una expresión regular con el objeto `RegexpTokenizer` de la librería `nltk`. Éste toma como argumento en su construcción la expresión regular, y devuelve solamente aquellos _tokens_ que emparejan con la expresión.
 

In [None]:
import nltk
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'[a-zA-Z][a-zA-Z][a-zA-Z]+') # tres letras mínimo

### <font color="#004D7F"> Eliminación de _stop words_ </font>
<br>

Las _stop words_ son las palabras más comunes de cada idioma, y generalmente no contienen información relevante. En este caso particular, no aportan información útil de cara a determinar los grupos que constituyen los documentos, por lo que se eliminarán. 

Mediante la función `stopwords.words` es posible obtener la lista de palabras en un idioma determinado.

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
from nltk.corpus import stopwords

stopwords_en = stopwords.words('english')
## Add the filler words
new_words=('ct','upon','isc','due','per', 'um', 'uh', 'er', 'ah', 'isc', 'bio', 'interger', 'Integer',
           'boolean', 'Boolean', 'float', 'Float', 'string', 'String', 'true', 'True', 'false', 'False',
           'object', 'Object')
#('the','This', 'The','ct','upon','isc','due','per', 'here', 'Here','much',
#          'Much', 'um', 'uh', 'er', 'ah', 'like', 'Like', 'Likewise', 'likewise', 'okay', 'right',
#           'know', 'force', 'forcing', 'forced', 'Force', 'Forces', 'forces', 'totally', 'execute', 'change', 
#           'executing', 'executed', 'Execute', 'Executors', 'based', 'Based', 'addressed', 'Addressed',
#           'address', 'Address', 'exempting', 'Exempting', 'exempt', 'Exempt', 'promote', 'Promote',
#          'agreed', 'Agreed', 'lower', 'Lower', 'isc', 'bio', 'expected', 'Expected', 'trend', 'Trend')
for i in new_words:
    stopwords_en.append(i)
print(list(stopwords_en))
#print(stopwords_en[:20])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

### <font color="#004D7F"> Lematización </font>
<br>

Un lema es una palabra que posee un significado concreto, y que representa un grupo de palabras con resultado similar. Por ejemplo, `gustar` es el lema de todos los tiempos verbales (`gustó`, `gustará`, etc). La lematización consiste en la sustitución de cada palabra por su lema. Con ello se reduce sustancialmente el tamaño del vocabulario, preservando la mayor parte (si no toda) de la información relevante. 

En este caso se utilizará el objeto `nltl.stem.wordnet.WordNetLemmatizer`. Éste se basa en _WordNet,_ que es una base de datos léxica de palabras. Éstas se agrupan en conjuntos de sinónimos (_synsets)_ que expresan un concepto. A su vez, se establecen relaciones conceptuales y léxicas entre los distintos  _synsets._ 

<div class="alert alert-block alert-warning">
    
<i class="fa fa-exclamation-circle" aria-hidden="true"></i>
__Importante__: Es necesario descargar _WordNet_ antes de utilizar el lematizador. 
</div>

In [None]:
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### <font color="#004D7F"> Limpieza de los datos </font>

In [None]:
def clean(text):
    tokens = [lemmatizer.lemmatize(token) for token in tokenizer.tokenize(text) if token not in stopwords_en]
    return " ".join(tokens)    

<font color="#004D7F"> <i class="fa fa-pencil-square-o" aria-hidden="true" style="color:#113D68"></i> </font>  Aplicar la función de limpieza a la columna _texto_ del conjunto de datos `df_people_train`, y almacenar el resultado en una nueva columna denominada `clean text` (requiere varias decenas de segundos).

In [None]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [None]:
data['description'][51]

' of pre university education is providing certificates of ii puc class xii march july and march examination marksheets state board department of pre university education karnataka openapi to verify class xii marksheet hscer date of birth in dd mm yyyy format string full name sunil kumar string enter pass year mar string enter reg no string aadhaar number string object components schemas consentartifactschema the format of the certificate in response string a unique transaction id for this request in uuid format it is used for tracking the request f f c b dfc c a f uuid string object request format response body contains contents of the certificate in pdf format the certificate data in response body in pdf xml or json format as requested in format parameter components responses error components responses error components responses error components responses error components responses error components responses error components responses error class xii marksheet string string object ba

In [None]:
from time import time
data = np.nan
data
start = time()
data = df_apis.copy()
data['description'] = data['description'].apply(removeHTML)
#data['clean_text'] = data['description'].map(clean)
print("Tiempo: {:0.3f}s.".format(time() - start))
#data["clean_text"][1]
data["description"][51]

Tiempo: 32.203s.


' of pre university education is providing certificates of ii puc class xii march july and march examination marksheets state board department of pre university education karnataka openapi to verify class xii marksheet hscer date of birth in dd mm yyyy format string full name sunil kumar string enter pass year mar string enter reg no string aadhaar number string object components schemas consentartifactschema the format of the certificate in response string a unique transaction id for this request in uuid format it is used for tracking the request f f c b dfc c a f uuid string object request format response body contains contents of the certificate in pdf format the certificate data in response body in pdf xml or json format as requested in format parameter components responses error components responses error components responses error components responses error components responses error components responses error components responses error class xii marksheet string string object ba

In [None]:
data['clean_text'] = data['description'].map(clean)
print("Tiempo: {:0.3f}s.".format(time() - start))
data["clean_text"][51]

Tiempo: 74.425s.


'pre university education providing certificate puc class xii march july march examination marksheets state board department pre university education karnataka openapi verify class xii marksheet hscer date birth yyyy format full name sunil kumar enter pas year mar enter reg aadhaar number component schema consentartifactschema format certificate response unique transaction request uuid format used tracking request dfc uuid request format response body contains content certificate pdf format certificate data response body pdf xml json format requested format parameter component response error component response error component response error component response error component response error component response error component response error class xii marksheet bad request unauthorized access record found internal server error bad gateway service unavailable gateway timeout march senion school certificate examination integer integer integer integer integer integer economics subject array 

<a id="section41"></a> 
# <font color="#004D7F"> 4.1. TF-IDF</font>
<br>

<a id="section41"></a> 
## <font color="#004D7F"> 4.1. TF-IDF con frases</font>
<br>

In [None]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
data['stemmed'] = data.clean_text.map(lambda x: ' '.join([stemmer.stem(y) for y in x.split(' ')]))
data.stemmed.head()

0    api citycontext com citi context provid straig...
1    via rapidapi com advic igor rodionov dynamicdo...
2    manag azur com domainregistrationprovid api cl...
3    axesso use api fetch inform amazon product apa...
4    product environ environ api team gdsteam inter...
Name: stemmed, dtype: object

### Con Stemmed

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

df = data.copy()

n = 5 # top n TF-IDF words

tfidf = TfidfVectorizer(token_pattern=r"\w+") # no words are left out
X = tfidf.fit_transform(df['stemmed'])
ind = (-X.todense()).argpartition(n)[:, :n]
top_words = pd.Series(
    map(
        lambda words_values: dict(zip(*words_values)),
        zip(
            np.array(tfidf.get_feature_names())[ind],
            np.asarray(np.take_along_axis(X, ind, axis=1).todense()),
        ),
    ),
)

data['tfidf_stemmed'] = np.nan
for key in top_words.keys():
        data['tfidf_stemmed'][key]=list(top_words[key].keys())
        
#data

In [None]:
data['tfidf_stemmed'][51]

['marksheet', 'format', 'compon', 'certif', 'kumar']

### Lematize

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

df = data.copy()

n = 5 # top n TF-IDF words

tfidf = TfidfVectorizer(token_pattern=r"\w+",ngram_range=(1,2)) # no words are left out
X = tfidf.fit_transform(df['clean_text'])
ind = (-X.todense()).argpartition(n)[:, :n]
top_words = pd.Series(
    map(
        lambda words_values: dict(zip(*words_values)),
        zip(
            np.array(tfidf.get_feature_names())[ind],
            np.asarray(np.take_along_axis(X, ind, axis=1).todense()),
        ),
    ),
)

data['tfidf_lemmatize'] = np.nan
for key in top_words.keys():
        data['tfidf_lemmatize'][key]=list(top_words[key].keys())
        
#data

In [None]:
data['tfidf_lemmatize'][51]

['component response',
 'certificate',
 'response error',
 'error component',
 'format']

---
<div style="text-align: right">
<a href="#indice"><font size=5><i class="fa fa-arrow-circle-up" aria-hidden="true" style="color:#004D7F"></i></font></a>
</div>

---

<a id="section4.8"></a> 
# <font color="#004D7F"> 4.8. Algoritmo Yake</font>
<br>

In [None]:
!pip install yake



In [None]:
import yake
kw_extractor = yake.KeywordExtractor(top=5, stopwords=stopwords_en)
keywords = kw_extractor.extract_keywords(data['clean_text'][0])
for kw, v in keywords:
  print("Keyphrase: ",kw, ": score", v)

Keyphrase:  search radius integer : score 5.0232015010041464e-05
Keyphrase:  ofsted report outstanding : score 9.552080195872224e-05
Keyphrase:  report outstanding inadequate : score 9.82977356604936e-05
Keyphrase:  last ofsted report : score 9.86100375855199e-05
Keyphrase:  search radius park : score 9.869949513795672e-05


In [None]:
kw_extractor = yake.KeywordExtractor(top=5, stopwords=stopwords_en)
i=0
data['yake'] = np.nan
for val in data['clean_text']:
    #print('\n\n------------texto del swager para API  '+ str(i) + ' ---------------------')
    #print(val)
    keywords = kw_extractor.extract_keywords(val)
    l = list()
    for kw, v in keywords:
        #print("Keyphrase: ",kw, ": score", v)
        #print(type(kw))
        l.append(kw)
        #print(l)
    data['yake'][i] = l
    i=i+1

In [None]:
data['yake'][51]

['component response error',
 'response error component',
 'error component response',
 'integer integer integer',
 'utc includes miliseconds']

<a id="section4.9"></a> 
# <font color="#004D7F"> 4.9. Algoritmo TextRank</font>
<br>

In [None]:
#!pip install summa
#import pandas as pd
#data = pd.read_csv('data.csv', sep="¬", engine="python")



La fila 43 la tengo que eliminar porque se traga toda la memoria RAM

In [None]:
#data1 = data.copy()
#data = data1.copy()
data.drop(data.index[43], inplace=True)
data.reset_index(drop=True, inplace=True)
data

Unnamed: 0,URI,description,clean_text,stemmed,tfidf_stemmed,tfidf_lemmatize,yake
0,"""./openapi-directory/APIs/citycontext.com/1.0....",api citycontext com v city context provides s...,api citycontext com city context provides stra...,api citycontext com citi context provid straig...,"[search, park, radius, integ, school]","[radius, school, search radius, integer, park]","[search radius integer, ofsted report outstand..."
1,"""./openapi-directory/APIs/rapidapi.com/dynamic...",via rapidapi com advicement io igor rodionov ...,via rapidapi com advicement igor rodionov dyna...,via rapidapi com advic igor rodionov dynamicdo...,"[latex, advic, templat, compil, rapidapi]","[latex, compile, document, template, rapidapi]","[company pty ltd, doc url expires, numerical p..."
2,"""./openapi-directory/APIs/azure.com/web-Domain...",management azure com domainregistrationprovid...,management azure com domainregistrationprovide...,manag azur com domainregistrationprovid api cl...,"[domain, microsoft, domainregistr, metric, res...","[microsoft domain, domain, microsoft, microsof...","[microsoft domain domain, domain microsoft dom..."
3,"""./openapi-directory/APIs/axesso.de/1.0.0/open...",axesso de use this api to fetch information t...,axesso use api fetch information amazon produc...,axesso use api fetch inform amazon product apa...,"[keyword, product, found, amazon, int]","[product found, found product, found, product,...","[found product found, product successfully fou..."
4,"""./openapi-directory/APIs/api.gov.uk/vehicle-e...",production environment environment api team g...,production environment environment api team gd...,product environ environ api team gdsteam inter...,"[vehicl, date, registr, dvla, errorrespons]","[registration, vehicle, registration number, d...","[component schema errorresponse, response comp..."
...,...,...,...,...,...,...,...
62,"""./openapi-directory/APIs/amazonaws.com/sagema...",v amazon augmented ai runtime amazon augmente...,amazon augmented runtime amazon augmented amaz...,amazon augment runtim amazon augment amazon ad...,"[human, loop, compon, amz, amazon]","[human loop, loop, human, component, amz]","[component parameter amz, loop component schem..."
63,"""./openapi-directory/APIs/ideaconsult.net/nano...",nanoreg database database database database d...,nanoreg database database database database da...,nanoreg databas databas databas databas databa...,"[form, substanc, search, mincount, queri]","[search, substance, form, type term, integer f...","[integer form page, page integer form, page qu..."
64,"""./openapi-directory/APIs/twitter.com/labs/1.1...",twitter api developers reference labs v twitt...,twitter api developer reference lab twitter de...,twitter api develop refer lab twitter develop ...,"[tweet, compon, schema, user, rule]","[user, component, component schema, tweet, sch...","[array component schema, component schema prob..."
65,"""./openapi-directory/APIs/infermedica.com/v2/s...",api infermedica com v infermedica empower you...,api infermedica com infermedica empower health...,api infermedica com infermedica empow healthca...,"[symptom, definit, year, observ, age]","[age value, age, observation, query age, symptom]","[unit age value, query age value, age value in..."


In [None]:
from summa import keywords

data['textrank'] = np.nan
i = 0
for val in data['clean_text']:
    #print('\n\n------------texto del swager para API  '+ str(i) + ' ---------------------')
    #print(val)
    TR_keywords = keywords.keywords(val, scores=True)
    #print(TR_keywords[0:5])
    #print('\n')
    #print(keywords[:10])
    l = list()
    for kw, v in TR_keywords[0:5]:
        #print("Keyphrase: ",kw, ": score", v)
        #print(type(kw))
        l.append(kw)
    #    j=j+1
    #    if (j==5):
    #        break
    #    print(l)
    data['textrank'][i] = l
    i=i+1

<a id="section4.9"></a> 
# <font color="#004D7F"> 4.9. Algoritmo Rake</font>
<br>

Extraído de aquí https://programmerbackpack.com/automated-python-keywords-extraction-textrank-vs-rake/

Rake significa Extracción automática rápida de palabras clave y es un algoritmo muy potente y rápido para la extracción de palabras clave y frases clave. El algoritmo parece demasiado simple para ser verdad, pero creo que es genial exactamente a través de esa simplicidad.

<a id="section4.8"></a> 
# <font color="#004D7F"> 4.8. Algoritmo Multirake</font>
<br>

In [None]:
#!pip install multi_rake

In [None]:
from multi_rake import Rake
kw_extractor = Rake(max_words=3)
i=0
data['multirake'] = np.nan
for val in data['clean_text']:
    #print('\n\n------------texto del swager para API  '+ str(i) + ' ---------------------')
    #print(val)
    keywords = kw_extractor.apply(val)
    print('\n')
    #print(keywords[:10])
    l = list()
    j=0
    for kw, v in keywords:
        #print("Keyphrase: ",kw, ": score", v)
        #print(type(kw))
        l.append(kw)
        j=j+1
        if (j==5):
            break
    #    print(l)
    data['multirake'][i] = l
    i=i+1









































































































































In [None]:
data['yake'][0]

['search radius integer',
 'ofsted report outstanding',
 'report outstanding inadequate',
 'last ofsted report',
 'search radius park']

## 4.10 KeyBERT -> Extrae keywords usando BERT
(https://github.com/MaartenGr/KeyBERT)

In [None]:
#!pip install KeyBERT
#!pip install ipywidgets

In [None]:
#!jupyter nbextension enable --py widgetsnbextension
#!pip install KeyBERT
#!pip install ipywidgets

In [None]:
from keybert import KeyBERT
kw_model_dist = KeyBERT('distilbert-base-nli-mean-tokens')
kw_model_all = KeyBERT(model='all-mpnet-base-v2')

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.99k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/550 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/450 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Con algoritmo distilbert-base-nli-mean-tokens

In [None]:
i=0

data['BERT_dist'] = np.nan
for val in data['clean_text']:
    print('\n\n------------texto del swager para API  '+ str(i) + ' ---------------------')
    #print(val)
    keywords = kw_model_dist.extract_keywords(val, keyphrase_ngram_range=(1, 3), stop_words='english',top_n= 5)
    print('\n')
    keywords_list= list(dict(keywords).keys())
    #print(keywords_list)
    #l = list()
    #j=0
    #for kw, v in keywords:
        #print("Keyphrase: ",kw, ": score", v)
        #print(type(kw))
    #    l.append(kw)
    #    j=j+1
    #    if (j==5):
    #        break
    #    print(l)
    data['BERT_dist'][i] = keywords_list
    i=i+1









































































































































## Con algoritmo all-mpnet-base-v2

In [None]:

i=0

data['BERT_all'] = np.nan
for val in data['clean_text']:
    print('\n\n------------texto del swager para API  '+ str(i) + ' ---------------------')
    #print(val)
    keywords = kw_model_all.extract_keywords(val, keyphrase_ngram_range=(1, 3), stop_words='english',top_n= 5)
    print('\n')
    keywords_list= list(dict(keywords).keys())
    #print(keywords_list)
    #l = list()
    #j=0
    #for kw, v in keywords:
        #print("Keyphrase: ",kw, ": score", v)
        #print(type(kw))
    #    l.append(kw)
    #    j=j+1
    #    if (j==5):
    #        break
    #    print(l)
    data['BERT_all'][i] = keywords_list
    i=i+1









































































































































In [None]:
i=0

data['BERT_all'] = np.nan
for val in data['clean_text']:
    print('\n\n------------texto del swager para API  '+ str(i) + ' ---------------------')
    #print(val)
    keywords = kw_model_all.extract_keywords(val, keyphrase_ngram_range=(1, 3), stop_words='english',top_n= 5)
    print('\n')
    keywords_list= list(dict(keywords).keys())
    #print(keywords_list)
    #l = list()
    #j=0
    #for kw, v in keywords:
        #print("Keyphrase: ",kw, ": score", v)
        #print(type(kw))
    #    l.append(kw)
    #    j=j+1
    #    if (j==5):
    #        break
    #    print(l)
    data['BERT_all'][i] = keywords_list
    i=i+1data

Unnamed: 0,URI,description,clean_text,stemmed,tfidf_stemmed,tfidf_lemmatize,yake,textrank,multirake,BERT_dist,BERT_all
0,"""./openapi-directory/APIs/citycontext.com/1.0....",api citycontext com v city context provides s...,api citycontext com city context provides stra...,api citycontext com citi context provid straig...,"[search, park, radius, integ, school]","[radius, school, search radius, integer, park]","[search radius integer, ofsted report outstand...","[integer, query, definition, statistic school,...","[inspection report date, api citycontext, city...","[postcode search radius, radius integer search...","[city context api, api citycontext com, api ci..."
1,"""./openapi-directory/APIs/rapidapi.com/dynamic...",via rapidapi com advicement io igor rodionov ...,via rapidapi com advicement igor rodionov dyna...,via rapidapi com advic igor rodionov dynamicdo...,"[latex, advic, templat, compil, rapidapi]","[latex, compile, document, template, rapidapi]","[company pty ltd, doc url expires, numerical p...","[template latex, compile, compiler, header, ad...","[table stretching multiple, generation templat...","[optimized interactive pdfs, dynamic pdf docum...","[advicement dynamicdocs api, dynamicdocs api, ..."
2,"""./openapi-directory/APIs/azure.com/web-Domain...",management azure com domainregistrationprovid...,management azure com domainregistrationprovide...,manag azur com domainregistrationprovid api cl...,"[domain, microsoft, domainregistr, metric, res...","[microsoft domain, domain, microsoft, microsof...","[microsoft domain domain, domain microsoft dom...","[domain, microsoft, operation, apis resource p...","[display portal property, domain update existi...","[azure com domainregistrationprovider, validat...","[azure com domainregistrationprovider, com dom..."
3,"""./openapi-directory/APIs/axesso.de/1.0.0/open...",axesso de use this api to fetch information t...,axesso use api fetch information amazon produc...,axesso use api fetch inform amazon product apa...,"[keyword, product, found, amazon, int]","[product found, found product, found, product,...","[found product found, product successfully fou...","[query, request, requested, amazon product, ar...","[sort option, axesso]","[keyword search amazon, integer relevanceblend...","[swagger request amazon, api openapi swagger, ..."
4,"""./openapi-directory/APIs/api.gov.uk/vehicle-e...",production environment environment api team g...,production environment environment api team gd...,product environ environ api team gdsteam inter...,"[vehicl, date, registr, dvla, errorrespons]","[registration, vehicle, registration number, d...","[component schema errorresponse, response comp...","[date, specification dvla vehicle, error, api,...",[],"[schema errorresponse vehicle, component schem...","[schema errorresponse vehicle, api vehicle enq..."
...,...,...,...,...,...,...,...,...,...,...,...
62,"""./openapi-directory/APIs/amazonaws.com/sagema...",v amazon augmented ai runtime amazon augmente...,amazon augmented runtime amazon augmented amaz...,amazon augment runtim amazon augment amazon ad...,"[human, loop, compon, amz, amazon]","[human loop, loop, human, component, amz]","[component parameter amz, loop component schem...","[amazon, human, loop information, return resou...","[arn flow definition, interact amazon programm...","[amazon sagemaker developer, developer guide a...","[schema flowdefinitionarn amazon, workflow nee..."
63,"""./openapi-directory/APIs/ideaconsult.net/nano...",nanoreg database database database database d...,nanoreg database database database database da...,nanoreg databas databas databas databas databa...,"[form, substanc, search, mincount, queri]","[search, substance, form, type term, integer f...","[integer form page, page integer form, page qu...","[query type, form, structure substance, field,...","[publicname publicname owner, integer form, am...","[search enanomapper database, parameter ambitd...","[substance according search, substance query s..."
64,"""./openapi-directory/APIs/twitter.com/labs/1.1...",twitter api developers reference labs v twitt...,twitter api developer reference lab twitter de...,twitter api develop refer lab twitter develop ...,"[tweet, compon, schema, user, rule]","[user, component, component schema, tweet, sch...","[array component schema, component schema prob...","[component schema, tweeted, user, array, rule ...","[addordeleterules dry run, url marked sensitiv...","[twitter api developer, lab twitter developer,...","[rule tweet apis, tweet apis related, twitter ..."
65,"""./openapi-directory/APIs/infermedica.com/v2/s...",api infermedica com v infermedica empower you...,api infermedica com infermedica empower health...,api infermedica com infermedica empow healthca...,"[symptom, definit, year, observ, age]","[age value, age, observation, query age, symptom]","[unit age value, query age value, age value in...","[definition, public array, observation, observ...","[int query age, text return rationale, lab tes...","[infermedica empower healthcare, diagnostic en...","[api infermedica com, api infermedica api, ins..."


In [None]:
data.to_csv('data.csv', index=False, header=True, sep="#")

data1 = data.copy()
del data1['clean_text']
del data1['description']
del data1['stemmed']
data1.to_csv('data_Andrea.csv', index=False, sep=';')

<div style="text-align: right">
<a href="#indice"><font size=5><i class="fa fa-arrow-circle-up" aria-hidden="true" style="color:#004D7F"></i></font></a>
</div>

---

<div style="text-align: right"> <font size=6><i class="fa fa-coffee" aria-hidden="true" style="color:#004D7F"></i> </font></div>

- <a id="refe1">[ref 1]</a> https://brenocon.com/oconnor+stewart+smith.irevents.acl2013.pdf
- <a id="refe1">[ref 2]</a>https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0