# Descripción

Primero, se trabaja con similitud usando una métrica: jaccard_similarity u se usan dos versiones:
+ Con el texto sin tokenizar
+ Con el texto tokenizado

Luego se trabaja con técnicas para calcular la similitud entre frases y términos:

+ Coincidencia de subcadenas
+ Búsqueda de prefijos/sufijos
+ Coincidencia fuzzy
+ Coincidencia basada en stemming
+ Usando expresiones regulares

# Descarga la data

In [1]:
# Crear la carpeta data si no existe
!mkdir -p data

# Descargar el archivo y guardarlo en la carpeta data
!wget -P data https://raw.githubusercontent.com/jamezahidalgo/datos/master/Concentracio%CC%81n_de_Minerales.csv


--2024-08-18 23:55:17--  https://raw.githubusercontent.com/jamezahidalgo/datos/master/Concentracio%CC%81n_de_Minerales.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5764305 (5.5M) [text/plain]
Saving to: ‘data/Concentración_de_Minerales.csv’


2024-08-18 23:55:17 (62.8 MB/s) - ‘data/Concentración_de_Minerales.csv’ saved [5764305/5764305]



# Instalación de librerías

In [2]:
!pip install nltk scikit-learn



In [3]:
!pip install fuzzywuzzy
!pip install python-Levenshtein

Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl.metadata (4.9 kB)
Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0
Collecting python-Levenshtein
  Downloading python_Levenshtein-0.25.1-py3-none-any.whl.metadata (3.7 kB)
Collecting Levenshtein==0.25.1 (from python-Levenshtein)
  Downloading Levenshtein-0.25.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.3 kB)
Collecting rapidfuzz<4.0.0,>=3.8.0 (from Levenshtein==0.25.1->python-Levenshtein)
  Downloading rapidfuzz-3.9.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading python_Levenshtein-0.25.1-py3-none-any.whl (9.4 kB)
Downloading Levenshtein-0.25.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (177 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.4/177.4 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading rapi

# Carga de librerías necesarias

In [4]:
import numpy as np
import pandas as pd
import nltk
import string
import re

from fuzzywuzzy import fuzz
from nltk.stem import PorterStemmer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import jaccard_score

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from collections import Counter

In [5]:
# Descargar recursos de NLTK
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

# Carga de la data

In [6]:
# Carga de datos
data = pd.read_csv("data/Concentración_de_Minerales.csv")
data.shape

(3192, 33)

# Preprocesamiento inicial

In [7]:
data.columns = data.columns.str.lower().str.replace(" ", "_", regex=True)
data.columns

Index(['#', 'jurisdiction', 'kind', 'display_key', 'lens_id',
       'publication_date', 'publication_year', 'application_number',
       'application_date', 'priority_numbers', 'earliest_priority_date',
       'title', 'abstract', 'applicants', 'inventors', 'owners', 'url',
       'document_type', 'has_full_text', 'cites_patent_count',
       'cited_by_patent_count', 'simple_family_size', 'extended_family_size',
       'sequence_count', 'cpc_classifications', 'ipcr_classifications',
       'us_classifications', 'npl_citation_count',
       'npl_resolved_citation_count', 'npl_resolved_lens_id(s)',
       'npl_resolved_external_id(s)', 'npl_citations', 'legal_status'],
      dtype='object')

In [8]:
# Extracción de las variables de interes
data_target = data[['title', 'abstract','publication_date', 'cpc_classifications']]

In [9]:
data_target.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3192 entries, 0 to 3191
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   title                3192 non-null   object
 1   abstract             3165 non-null   object
 2   publication_date     3192 non-null   object
 3   cpc_classifications  2472 non-null   object
dtypes: object(4)
memory usage: 99.9+ KB




---
**Observación**

---
Existen patentes que no registran abstract




In [10]:
data_target[data_target['abstract'].isnull()]

Unnamed: 0,title,abstract,publication_date,cpc_classifications
15,A METHOD AND AN ARRANGEMENT FOR CONTROLLING OF...,,2021-04-21,B02C23/02;;B02C25/00;;B02C25/00;;B02C17/1805;;...
184,BLASTING SYSTEM AND METHOD,,2023-05-10,F42D1/00;;F42D3/04;;F42D3/04;;G06N3/08;;G06T17...
271,ORE DRESSING MULTI-PRODUCTION-INDEX OPTIMIZATI...,,2021-01-13,G05B13/04;;B03B13/00;;B07B13/18;;G05B17/02;;G0...
287,MINERAL FLOTATION SEPARATION,,2023-06-22,B03D1/02;;G06N20/00;;G06N5/01
317,"SYSTEM, METHOD AND APPARATUS FOR PLANNING DRIV...",,2024-01-03,G01C21/3407;;G01C21/3423;;G01C21/3492;;G08G1/0...
577,METHOD AND APPARATUS FOR THE CONTROL OF A FLOT...,,2021-02-24,B03D1/028;;G05B19/401;;B03D1/028;;B07C5/34;;G0...
715,SYSTEM AND METHOD FOR PLANT CONTROL BASED ON F...,,2020-10-14,G05B19/41865;;G05B2219/32252;;G05B23/0291;;G05...
804,DYNAMIC CASCADED CLUSTERING FOR DYNAMIC VNF,,2020-11-18,H04L45/00;;H04L45/76;;H04L47/783;;H04L45/00;;H...
947,CERMET OR CEMENTED CARBIDE MIXTURE AND THREE D...,,2022-02-23,C22C1/051;;C22C29/06;;B24D3/008;;B33Y10/00;;B3...
1051,AUTOMATICALLY SCANNING AND REPRESENTING AN ENV...,,2023-09-27,G05D1/0248;;E21D9/006;;E21B7/022;;E21D9/003;;G...


# Funciones generales

In [11]:
stop_words = set(stopwords.words("english"))

def preprocesar_texto(texto : str):
  palabras = texto.split()
  palabras_filtradas = [palabra.lower() for palabra in palabras if palabra.lower() not in stop_words]
  texto_filtrado = ' '.join(palabras_filtradas)

  return texto_filtrado

# Revisión de registros repetidos

In [12]:
def preprocess_text_basic(text : str):
	if not isinstance(text, float):
		return text.lower()
	return text

In [13]:
data_target['title'] = data['title'].apply(preprocess_text_basic)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_target['title'] = data['title'].apply(preprocess_text_basic)


In [14]:
cuenta_repetidos = data_target['title'].value_counts()
repetidos = cuenta_repetidos[cuenta_repetidos > 1].index.tolist()
print(repetidos)
print(f"Se registran {len(repetidos)} registros repetidos")

['data visualization platform for event-based behavior clustering', 'systems and methods for radiant gas dynamic mining of permafrost for propellant extraction', 'system and method for adjusting leaching operations based on leach analytic data', 'mining electric locomotive power supply system', 'ore pulp ph value detection method and device and computer readable storage medium', 'system and method for determining a location of ore in a stockpile', 'leaching agent migration test system and leaching agent migration conversion test system', 'data model for mining', 'removal of amines from aqueous streams', 'underground mining methods via boreholes and multilateral blast-holes', 'embedded electromagnetic multi-vertical-ring high-gradient magnetic separator', 'mechanical automatic ore crushing device', 'evaluation and/or adaptation of industrial and/or technical process models', 'methods and systems for managing energy consumption of cryptocurrency mining', 'intelligent explosion-proof dies

In [15]:
data_target[data_target['title'] == 'data visualization platform for event-based behavior clustering']

Unnamed: 0,title,abstract,publication_date,cpc_classifications
1838,data visualization platform for event-based be...,A platform for processing event traces to gene...,2021-01-07,G06F16/358;;G06F16/26;;G06F16/2465;;G06F16/287...
1917,data visualization platform for event-based be...,A platform for processing event traces to gene...,2023-08-15,G06F16/358;;G06F16/26;;G06F16/2465;;G06F16/287...
2153,data visualization platform for event-based be...,A platform for processing event traces to gene...,2022-10-12,G16H50/70;;G06N20/00;;G16H40/20;;G06N3/126;;G0...


In [16]:
data_target[data_target['title'] == 'systems and methods for radiant gas dynamic mining of permafrost for propellant extraction']

Unnamed: 0,title,abstract,publication_date,cpc_classifications
1796,systems and methods for radiant gas dynamic mi...,Systems and methods are disclosed for mining l...,2023-08-15,E21B7/15;;E21B36/04;;E21C37/16;;E21C51/00;;E21...
1818,systems and methods for radiant gas dynamic mi...,Systems and methods are disclosed for mining l...,2022-03-17,E21B7/15;;E21B36/04;;E21C37/16;;E21C51/00;;E21...
2123,systems and methods for radiant gas dynamic mi...,Systems and methods are disclosed for mining l...,2020-02-13,E21B7/15;;E21B36/04;;E21C37/16;;E21C51/00;;E21...



---


**Observación**


---
Se aprecian registros repetidos, son las mismas patentes en distintos años de publicación



# Define términos excluyentes

In [17]:
excluyentes = ['data mining', 'blockchain mining', 'mining transaction', 'transaction',
               'transactional','pattern mining', 'information mining', 'word mining']

# Similitud usando 'jaccard_similarity'

In [18]:
def jaccard_similarity(str1 : str, str2 :str):
    """
    Calcula la similitud de Jaccard entre dos frases.

    Args:
        str1 (str): La primera frase.
        str2 (str): La segunda frase.

    Returns:
        float: La similitud de Jaccard entre las dos frases.
    """
    # Convertir las frases a minúsculas y tokenizar en palabras
    set1 = set(str1.lower().split())
    set2 = set(str2.lower().split())

    # Calcular la intersección y la unión de los conjuntos de palabras
    intersection = set1.intersection(set2)
    union = set1.union(set2)

    # Calcular la similitud de Jaccard
    similarity = len(intersection) / len(union)
    return similarity

def jaccard_similarity_tfidf(str1, str2):
    """
    Calcula la similitud de Jaccard entre dos frases usando TfidfVectorizer.

    Args:
        str1 (str): La primera frase.
        str2 (str): La segunda frase.

    Returns:
        float: La similitud de Jaccard entre las dos frases.
    """
    # Crear el vectorizador TF-IDF
    vectorizer = TfidfVectorizer()

    # Transformar las frases en matrices TF-IDF
    tfidf_matrix = vectorizer.fit_transform([str1, str2])

    # Convertir las matrices TF-IDF en vectores binarios (presencia/ausencia de palabras)
    binary_matrix = (tfidf_matrix > 0).astype(int)

    # Extraer los vectores binarios
    vector1 = binary_matrix.toarray()[0]
    vector2 = binary_matrix.toarray()[1]

    # Calcular la similitud de Jaccard
    similarity = jaccard_score(vector1, vector2)
    return similarity

## Experimento considerando las stopwords

In [19]:
lst_similitudes = []
for frase in data_target['title']:
  for termino in excluyentes:
    sim1 = jaccard_similarity(frase, termino)
    sim2 = jaccard_similarity_tfidf(frase, termino)
    lst_similitudes.append((frase, termino, sim1, sim2))

data_similitudes = pd.DataFrame(lst_similitudes, columns = ['frase', 'termino', 'jaccard_simple', 'jaccard_tifdVectorizer'])

In [20]:
data_similitudes[data_similitudes['jaccard_tifdVectorizer'] > 0]

Unnamed: 0,frase,termino,jaccard_simple,jaccard_tifdVectorizer
8,process for integrating the mining and process...,data mining,0.066667,0.068966
9,process for integrating the mining and process...,blockchain mining,0.032258,0.033333
10,process for integrating the mining and process...,mining transaction,0.032258,0.033333
13,process for integrating the mining and process...,pattern mining,0.032258,0.033333
14,process for integrating the mining and process...,information mining,0.032258,0.033333
...,...,...,...,...
25458,mining gas monitoring sensor,mining transaction,0.200000,0.200000
25461,mining gas monitoring sensor,pattern mining,0.200000,0.200000
25462,mining gas monitoring sensor,information mining,0.200000,0.200000
25463,mining gas monitoring sensor,word mining,0.200000,0.200000


In [21]:
data_similitudes[data_similitudes['jaccard_tifdVectorizer'] > 0.3].groupby("termino").size()

Unnamed: 0_level_0,0
termino,Unnamed: 1_level_1
blockchain mining,3
data mining,6
information mining,2
mining transaction,2
pattern mining,3
word mining,2


In [22]:
data_similitudes[data_similitudes['jaccard_tifdVectorizer'] >= 0.2].groupby("termino").size()

Unnamed: 0_level_0,0
termino,Unnamed: 1_level_1
blockchain mining,40
data mining,59
information mining,41
mining transaction,39
pattern mining,39
word mining,39


In [23]:
data_similitudes[data_similitudes['jaccard_tifdVectorizer'] >= 0.1].groupby("termino").size()

Unnamed: 0_level_0,0
termino,Unnamed: 1_level_1
blockchain mining,323
data mining,518
information mining,342
mining transaction,315
pattern mining,315
word mining,316


In [24]:
data_similitudes[(data_similitudes['jaccard_tifdVectorizer'] > 0.3) & (data_similitudes['termino'] == 'blockchain mining')]

Unnamed: 0,frase,termino,jaccard_simple,jaccard_tifdVectorizer
22761,mining robot system based on blockchain,blockchain mining,0.333333,0.333333
23073,mining robot,blockchain mining,0.333333,0.333333
24153,mining sensor,blockchain mining,0.333333,0.333333


## Experimento DESCARTANDO las stopwords

In [25]:
data_target['title'] = data_target['title'].apply(preprocesar_texto)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_target['title'] = data_target['title'].apply(preprocesar_texto)


In [26]:
lst_similitudes = []
for frase in data_target['title']:
  for termino in excluyentes:
    sim1 = jaccard_similarity(frase, termino)
    sim2 = jaccard_similarity_tfidf(frase, termino)
    lst_similitudes.append((frase, termino, sim1, sim2))
    """
    print(f" Similitud entre '{termino}' y '{frase}'")
    print("\t{0:<45} : {1:.2f}".format("Similitud de Jaccard simple", sim1))
    print("\t{0:<45} : {1:.2f}".format("Similitud de Jaccard usando TifdVectorizer", sim2))
    """
data_similitudes_sin_stopwords = pd.DataFrame(lst_similitudes, columns = ['frase', 'termino', 'jaccard_simple', 'jaccard_tifdVectorizer'])

In [27]:
data_similitudes_sin_stopwords[data_similitudes_sin_stopwords['jaccard_tifdVectorizer'] > 0]

Unnamed: 0,frase,termino,jaccard_simple,jaccard_tifdVectorizer
8,process integrating mining processing together...,data mining,0.083333,0.083333
9,process integrating mining processing together...,blockchain mining,0.040000,0.040000
10,process integrating mining processing together...,mining transaction,0.040000,0.040000
13,process integrating mining processing together...,pattern mining,0.040000,0.040000
14,process integrating mining processing together...,information mining,0.040000,0.040000
...,...,...,...,...
25458,mining gas monitoring sensor,mining transaction,0.200000,0.200000
25461,mining gas monitoring sensor,pattern mining,0.200000,0.200000
25462,mining gas monitoring sensor,information mining,0.200000,0.200000
25463,mining gas monitoring sensor,word mining,0.200000,0.200000


In [28]:
data_similitudes_sin_stopwords[data_similitudes_sin_stopwords['jaccard_tifdVectorizer'] >= 0.2].groupby("termino").size()

Unnamed: 0_level_0,0
termino,Unnamed: 1_level_1
blockchain mining,57
data mining,99
information mining,62
mining transaction,54
pattern mining,54
word mining,54


In [29]:
data_similitudes_sin_stopwords[(data_similitudes_sin_stopwords['jaccard_tifdVectorizer'] > 0.3) & (data_similitudes_sin_stopwords['termino'] == 'blockchain mining')]

Unnamed: 0,frase,termino,jaccard_simple,jaccard_tifdVectorizer
22761,mining robot system based blockchain,blockchain mining,0.4,0.4
23073,mining robot,blockchain mining,0.333333,0.333333
24153,mining sensor,blockchain mining,0.333333,0.333333


In [30]:
data_similitudes[(data_similitudes['jaccard_tifdVectorizer'] > 0.3) & (data_similitudes['termino'] == 'blockchain mining')]

Unnamed: 0,frase,termino,jaccard_simple,jaccard_tifdVectorizer
22761,mining robot system based on blockchain,blockchain mining,0.333333,0.333333
23073,mining robot,blockchain mining,0.333333,0.333333
24153,mining sensor,blockchain mining,0.333333,0.333333


# Técnicas de similitud

In [19]:
def convierte_a_minusculas(text : str):
	if not isinstance(text, float):
		return text.lower()
	return text

In [20]:
def obtiene_descartados(filtrados, data_target : pd.DataFrame, feature = "title"):
  total_gral = Counter(data_target[feature])
  total_filtrados = Counter(filtrados)
  diferencia = total_gral - total_filtrados
  return list(diferencia.elements())

## Filtro usando coincidencia de subcadenas

In [21]:
def contains_substring(abstract, terms):
    for term in terms:
        if term in abstract:
            return True
    return False

### Experimento incluyendo las stopwords

In [22]:
data_target = data[['title', 'abstract', 'publication_date']]
data_target['title'] = data['title'].apply(preprocess_text_basic)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_target['title'] = data['title'].apply(preprocess_text_basic)


In [23]:
# Obtiene el núnero de filtrados
len([title for title in data_target['title'] if not contains_substring(title, excluyentes)]
)

3187

In [24]:
filtrados_usando_subcadenas = [title for title in data_target['title'] if not contains_substring(title, excluyentes)]
filtrados_usando_subcadenas

['method for operating a comminution circuit and respective comminution circuit',
 'process for integrating the mining and processing together with data collection in real time. dry pre-concentration via sensor based ore sorting (sbs), combined with dry comminution, in combination with a wet final concentration flowsheet',
 'system and method for continuous optimization of mineral processing operations',
 'method for water and moisture management for a mining operation',
 'method for training a quality prediction model for a processing device of a continuous industrial process, method for controlling of a continuous industrial process comprising a processing device, and a processing device',
 'computer-implemented methods referring to an industrial process for manufacturing a product and system for performing said methods',
 'method for optimizing mineral recovery process',
 'industrial robotic platforms',
 'monitoring ore',
 'real-time monitoring and estimation system for wear in rubb

In [25]:
descartados_subc_con_stopwords = obtiene_descartados(filtrados_usando_subcadenas, data_target)
len(descartados_subc_con_stopwords)

5

In [26]:
# Obtiene los descartados
data_descartados_con_subc = data_target[data_target['title'].isin(descartados_subc_con_stopwords)]
data_descartados_con_subc

Unnamed: 0,title,abstract,publication_date
145,file system integration into data mining model,Aspects of a storage device including a memory...,2023-11-09
1544,interactive sequential pattern mining,Interactive sequential pattern mining is discl...,2020-02-18
1821,"text keyword mining method and device thereof,...",The invention discloses a text keyword mining ...,2022-02-18
2020,keyword mining system and mining method,The invention discloses a keyword mining syste...,2021-03-26
2373,"text main body keyword mining method, system a...",The invention relates to the field of patent d...,2023-05-09


In [27]:
# Obtiene los filtrados
data_filtrados_con_subc = data_target[~data_target['title'].isin(descartados_subc_con_stopwords)]
data_filtrados_con_subc

Unnamed: 0,title,abstract,publication_date
0,method for operating a comminution circuit and...,A method for operating an ore comminution circ...,2020-09-03
1,process for integrating the mining and process...,The present invention belongs to the mining se...,2023-08-10
2,system and method for continuous optimization ...,A method for continuously optimizing an ore gr...,2023-12-21
3,method for water and moisture management for a...,Wet conditions pose a plurality of hazards for...,2024-03-14
4,method for training a quality prediction model...,"A method (1000, 1001) for training a quality p...",2022-09-22
...,...,...,...
3187,"systems, methods, and program products for fac...",Machine learning systems and methods are descr...,2023-06-13
3188,method and system for managing vehicle transpo...,CA 03091162 2020-08-13 (12) INTERNATIONAL APPL...,2021-11-16
3189,method for providing clock frequencies for com...,The present disclosure relates to a method for...,2023-04-20
3190,wall panels and other components for custom en...,Custom enclosures are made using various types...,2023-08-03


---
**Observación**

---

La técnica de subcadenas considerando las stopwords logra descatar 5 registos dejando 3187.

### Excluyendo las stopwords

In [28]:
data_target = data[['title', 'abstract', 'publication_date']]
data_target['title'] = data_target['title'].apply(preprocesar_texto)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_target['title'] = data_target['title'].apply(preprocesar_texto)


## Filtro usando coincidencia fuzzy
Este método ocupa umbrales y se ha probado con varios umbrales desde 10% a 90%.

La función por defecto usa un umbral del 80%.

Consideraciones:
+ Valores Bajos (e.g., 50%): Puede aumentar la cantidad de coincidencias, pero podría incluir coincidencias no deseadas.
+ Valores Altos (e.g., 90-100%): Limita las coincidencias a aquellas que son casi exactas, reduciendo falsos positivos.
+ El ajuste del threshold dependerá de cuánta tolerancia tengas a diferencias en la ortografía, errores tipográficos o variaciones en los términos que se están comparando.

In [29]:
def remove_numbers(text):
    return re.sub(r'\d+', '', text)

def contains_fuzzy_match(text, terms, threshold=80):
    # Limpiar el texto
    text_sin_numeros = remove_numbers(text)
    words = text_sin_numeros.split()
    for word in words:
        for term in terms:
            if fuzz.partial_ratio(word, term) >= threshold:
                return True
    return False

In [115]:
lst_resultados = []

### Experimento incluyendo las stopwords

In [117]:
# Vuelve a seleecionar los datos para asegurar
data_target = data[['title', 'abstract', 'publication_date']]
data_target['title'] = data_target['title'].apply(convierte_a_minusculas)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_target['title'] = data_target['title'].apply(convierte_a_minusculas)


In [54]:
# Probando con varios umbrales
for limite in range(1,10):
  print(f"Con threshold {limite*10} quedan {len([title for title in data_target['title'] if not contains_fuzzy_match(title, excluyentes, threshold=limite*10)])} registros")


Con threshold 10 quedan 0 registros
Con threshold 20 quedan 0 registros
Con threshold 30 quedan 0 registros
Con threshold 40 quedan 1 registros
Con threshold 50 quedan 14 registros
Con threshold 60 quedan 95 registros
Con threshold 70 quedan 675 registros
Con threshold 80 quedan 981 registros
Con threshold 90 quedan 1022 registros


In [118]:
filtro_fuzzy = [title for title in data_target['title'] if not contains_fuzzy_match(title, excluyentes, threshold=90)]
filtro_fuzzy

['industrial robotic platforms',
 'monitoring ore',
 'mineral recovery control',
 'low cost, size, weight, and power (cswap) geolocation capability utilizing signal characteristics passed through to backhaul network',
 'user generated tag collection system and method',
 'optimal energy storage utilization',
 'smart city disorderly-stacked material management system and method',
 'underground mine energy management system',
 'ore component analysis device and method',
 'mineral recovery control',
 'fluidized-bed flotation unit, mineral processing apparatus, and fluidized-bed flotation method',
 'everything interconnection training platform and control method and device',
 'safety interlock recommendation system',
 'smart city smart drone uass/uav/vtol smart mailbox landing pad',
 'smart city smart drone uass/uav/vtol smart mailbox landing pad',
 'augmented photo capture',
 'adaptive discovery and mixed-variable optimization of next generation synthesizable microelectronic materials',
 '

In [119]:
descartados_fuzzy_con_stopwords = obtiene_descartados(filtro_fuzzy, data_target)

In [120]:
print(len(filtro_fuzzy))
print(len(descartados_fuzzy_con_stopwords))

1022
2170


In [121]:
len(filtro_fuzzy) + len(descartados_fuzzy_con_stopwords) == data_target['title'].shape[0]

True

---
**Observación**

---

La técnica fuzzy considerando un umbral de 90% y tomando las stopwords logra descatar 2170 registos dejando 1022.

In [122]:
# Obtiene los filtrados
data_filtrados_fuzzy_sw = data_target[~data_target['title'].isin(descartados_fuzzy_con_stopwords)]
data_filtrados_fuzzy_sw

Unnamed: 0,title,abstract,publication_date
7,industrial robotic platforms,Industrial robotic platforms are described. Th...,2024-06-11
8,monitoring ore,Systems and methods for estimating the magnitu...,2020-03-12
10,mineral recovery control,A mineral recovery system for use in a mining ...,2024-01-16
11,"low cost, size, weight, and power (cswap) geol...",Aspects of the disclosure relate to a method f...,2023-12-21
14,user generated tag collection system and method,"In some embodiments, an input document is rece...",2021-09-09
...,...,...,...
3174,temperature-pressure controllable paste slurry...,The invention provides a temperature-pressure ...,2022-09-09
3175,bcma monoclonal antibody and the antibody-drug...,Provided is an antibody and an antibody-drug c...,2023-05-11
3184,apparatus and method of communication,An apparatus and a method of communication are...,2022-01-13
3185,natural gas hydrate automatic microwave stirri...,The invention relates to a natural gas hydrate...,2020-10-23


In [123]:
# Obtiene los descartados
data_descartados_fuzzy_sw = data_target[data_target['title'].isin(descartados_fuzzy_con_stopwords)]
data_descartados_fuzzy_sw

Unnamed: 0,title,abstract,publication_date
0,method for operating a comminution circuit and...,A method for operating an ore comminution circ...,2020-09-03
1,process for integrating the mining and process...,The present invention belongs to the mining se...,2023-08-10
2,system and method for continuous optimization ...,A method for continuously optimizing an ore gr...,2023-12-21
3,method for water and moisture management for a...,Wet conditions pose a plurality of hazards for...,2024-03-14
4,method for training a quality prediction model...,"A method (1000, 1001) for training a quality p...",2022-09-22
...,...,...,...
3186,adaptive control of longitudinal roll in syste...,FIELD: mining.SUBSTANCE: invention relates to ...,2021-09-08
3187,"systems, methods, and program products for fac...",Machine learning systems and methods are descr...,2023-06-13
3188,method and system for managing vehicle transpo...,CA 03091162 2020-08-13 (12) INTERNATIONAL APPL...,2021-11-16
3189,method for providing clock frequencies for com...,The present disclosure relates to a method for...,2023-04-20


### Experimento DESCARTANDO las stopwords

In [125]:
# Vuelve a seleecionar los datos para asegurar
data_target = data[['title', 'abstract', 'publication_date']]
data_target['title'] = data_target['title'].apply(convierte_a_minusculas)
data_target['title'] = data_target['title'].apply(preprocesar_texto)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_target['title'] = data_target['title'].apply(convierte_a_minusculas)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_target['title'] = data_target['title'].apply(preprocesar_texto)


In [None]:
# Probando con varios umbrales
for limite in range(1,10):
  print(f"Con threshold {limite*10} quedan {len([title for title in data_target['title'] if not contains_fuzzy_match(title, excluyentes, threshold=limite*10)])} registros")


In [126]:
filtro_fuzzy_sin_stopwords = [title for title in data_target['title'] if not contains_fuzzy_match(title, excluyentes, threshold=90)]
filtro_fuzzy_sin_stopwords

['method operating comminution circuit respective comminution circuit',
 'system method continuous optimization mineral processing operations',
 'method training quality prediction model processing device continuous industrial process, method controlling continuous industrial process comprising processing device, processing device',
 'computer-implemented methods referring industrial process manufacturing product system performing said methods',
 'method optimizing mineral recovery process',
 'industrial robotic platforms',
 'monitoring ore',
 'real-time monitoring estimation system wear rubber liners lifter bars utilized ore grinding mills',
 'mineral recovery control',
 'low cost, size, weight, power (cswap) geolocation capability utilizing signal characteristics passed backhaul network',
 'system carrying cross-protocol communication iot (internet things) devices',
 'user generated tag collection system method',
 'method arrangement controlling comminution process grinding circuit',

In [127]:
descartados_fuzzy_sin_stopwords = obtiene_descartados(filtro_fuzzy_sin_stopwords, data_target)

In [128]:
len(descartados_fuzzy_sin_stopwords)

1203

In [129]:
len(filtro_fuzzy_sin_stopwords)

1989

In [130]:
len(filtro_fuzzy_sin_stopwords) + len(descartados_fuzzy_sin_stopwords) == data_target.shape[0]

True

In [131]:
# Obtiene los descartados
data_descartados_fuzzy_sin_sw = data_target[data_target['title'].isin(descartados_fuzzy_sin_stopwords)]
data_descartados_fuzzy_sin_sw

Unnamed: 0,title,abstract,publication_date
1,process integrating mining processing together...,The present invention belongs to the mining se...,2023-08-10
3,method water moisture management mining operation,Wet conditions pose a plurality of hazards for...,2024-03-14
13,big data quality inspection method customer se...,The invention discloses a big data quality ins...,2022-01-14
18,model updating method device automatically min...,The invention provides a model updating method...,2022-07-05
21,ai-based platform automated labor law complian...,An AI-based platform for automated labor law c...,2023-08-10
...,...,...,...
3177,horizontal well sand-water co-production contr...,The invention discloses a horizontal well sand...,2020-07-14
3179,stable curing treatment equipment industrial m...,The utility model relates to stable solidifica...,2020-11-17
3180,mining anti-explosion all-terrain intelligent ...,The utility model relates to a mining explosio...,2023-06-13
3182,mining gas monitoring sensor,The utility model discloses a mining gas monit...,2023-02-14


In [132]:
# Obtiene los filtrados
data_filtrados_fuzzy_sin_sw = data_target[~data_target['title'].isin(descartados_fuzzy_sin_stopwords)]
data_filtrados_fuzzy_sin_sw

Unnamed: 0,title,abstract,publication_date
0,method operating comminution circuit respectiv...,A method for operating an ore comminution circ...,2020-09-03
2,system method continuous optimization mineral ...,A method for continuously optimizing an ore gr...,2023-12-21
4,method training quality prediction model proce...,"A method (1000, 1001) for training a quality p...",2022-09-22
5,computer-implemented methods referring industr...,A computer-implemented method is provided. The...,2024-05-23
6,method optimizing mineral recovery process,Disclosed is a method for optimizing a mineral...,2023-08-24
...,...,...,...
3186,adaptive control longitudinal roll system deve...,FIELD: mining.SUBSTANCE: invention relates to ...,2021-09-08
3187,"systems, methods, program products facilitatin...",Machine learning systems and methods are descr...,2023-06-13
3188,method system managing vehicle transportation ...,CA 03091162 2020-08-13 (12) INTERNATIONAL APPL...,2021-11-16
3190,wall panels components custom enclosures shipp...,Custom enclosures are made using various types...,2023-08-03


---
**Observación**

---

La técnica fuzzy considerando un umbral de 90% y descartando las stopwords logra descatar 1203 registos dejando 1989.

In [133]:
lst_resultados.append(['fuzzy - umbral 90%',
                      data_descartados_fuzzy_sw.shape[0],
                      data_filtrados_fuzzy_sw.shape[0],
                      data_descartados_fuzzy_sin_sw.shape[0],
                      data_filtrados_fuzzy_sin_sw.shape[0]])

## Filtro usando coincidencia basada en stemming

In [72]:
def contains_stem_match(abstract, terms):
    stemmer = PorterStemmer()
    stemmed_terms = [stemmer.stem(term) for term in terms]
    words = abstract.split()
    for word in words:
        if stemmer.stem(word) in stemmed_terms:
            return True
    return False

### Experimento considerando stopwords

In [134]:
# Vuelve a seleecionar los datos para asegurar
data_target = data[['title', 'abstract', 'publication_date']]
data_target['title'] = data_target['title'].apply(convierte_a_minusculas)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_target['title'] = data_target['title'].apply(convierte_a_minusculas)


In [135]:
len([title for title in data_target['title'] if not contains_stem_match(title, excluyentes)])

3192

In [136]:
filtrados = [title for title in data_target['title'] if not contains_stem_match(title, excluyentes)]

In [137]:
descartados_con_stemm = obtiene_descartados(filtrados, data_target)

In [138]:
# Obtiene los filtrados
data_filtrados_stemm_sw = data_target[~data_target['title'].isin(descartados_con_stemm)]
data_filtrados_stemm_sw

Unnamed: 0,title,abstract,publication_date
0,method for operating a comminution circuit and...,A method for operating an ore comminution circ...,2020-09-03
1,process for integrating the mining and process...,The present invention belongs to the mining se...,2023-08-10
2,system and method for continuous optimization ...,A method for continuously optimizing an ore gr...,2023-12-21
3,method for water and moisture management for a...,Wet conditions pose a plurality of hazards for...,2024-03-14
4,method for training a quality prediction model...,"A method (1000, 1001) for training a quality p...",2022-09-22
...,...,...,...
3187,"systems, methods, and program products for fac...",Machine learning systems and methods are descr...,2023-06-13
3188,method and system for managing vehicle transpo...,CA 03091162 2020-08-13 (12) INTERNATIONAL APPL...,2021-11-16
3189,method for providing clock frequencies for com...,The present disclosure relates to a method for...,2023-04-20
3190,wall panels and other components for custom en...,Custom enclosures are made using various types...,2023-08-03


In [139]:
# Obtiene los descartados
data_descartados_stemm_sw = data_target[data_target['title'].isin(descartados_con_stemm)]
data_descartados_stemm_sw

Unnamed: 0,title,abstract,publication_date


---
**Observación**

---
Aplicando esta técnica no se descartan registros




### Experimento descartando las stopwords

In [140]:
# Vuelve a seleecionar los datos para asegurar
data_target = data[['title', 'abstract', 'publication_date']]
data_target['title'] = data_target['title'].apply(convierte_a_minusculas)
data_target['title'] = data_target['title'].apply(preprocesar_texto)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_target['title'] = data_target['title'].apply(convierte_a_minusculas)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_target['title'] = data_target['title'].apply(preprocesar_texto)


In [141]:
filtrados = [title for title in data_target['title'] if not contains_stem_match(title, excluyentes)]

In [142]:
descartados_con_stemm_sin_sw = obtiene_descartados(filtrados, data_target)

In [143]:
# Obtiene los filtrados
data_filtrados_stemm_sin_sw = data_target[~data_target['title'].isin(descartados_con_stemm_sin_sw)]
data_filtrados_stemm_sin_sw

Unnamed: 0,title,abstract,publication_date
0,method operating comminution circuit respectiv...,A method for operating an ore comminution circ...,2020-09-03
1,process integrating mining processing together...,The present invention belongs to the mining se...,2023-08-10
2,system method continuous optimization mineral ...,A method for continuously optimizing an ore gr...,2023-12-21
3,method water moisture management mining operation,Wet conditions pose a plurality of hazards for...,2024-03-14
4,method training quality prediction model proce...,"A method (1000, 1001) for training a quality p...",2022-09-22
...,...,...,...
3187,"systems, methods, program products facilitatin...",Machine learning systems and methods are descr...,2023-06-13
3188,method system managing vehicle transportation ...,CA 03091162 2020-08-13 (12) INTERNATIONAL APPL...,2021-11-16
3189,method providing clock frequencies computing c...,The present disclosure relates to a method for...,2023-04-20
3190,wall panels components custom enclosures shipp...,Custom enclosures are made using various types...,2023-08-03


---
**Observación**

---
Aplicando esta técnica no se descartan registros.

In [144]:
lst_resultados.append(['stemm',
                      data_descartados_stemm_sw.shape[0],
                      data_filtrados_stemm_sw.shape[0],
                      0,
                      data_filtrados_stemm_sin_sw.shape[0]])

## Filtro usando coincidencia usando expresiones regulares

In [88]:
def contains_regex_match(abstract, terms):
  for term in terms:
    pattern = re.compile(term)
    if pattern.search(abstract):
      return True
  return False

### Experimento considerando stopwords

In [145]:
# Vuelve a seleecionar los datos para asegurar
data_target = data[['title', 'abstract', 'publication_date']]
data_target['title'] = data_target['title'].apply(convierte_a_minusculas)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_target['title'] = data_target['title'].apply(convierte_a_minusculas)


In [146]:
filtrados = [title for title in data_target['title'] if not contains_regex_match(title, excluyentes)]

In [147]:
descartados_con_regex_con_sw = obtiene_descartados(filtrados, data_target)

In [148]:
# Obtiene los filtrados
data_filtrados_regex_sw = data_target[~data_target['title'].isin(descartados_con_regex_con_sw)]
data_filtrados_regex_sw

Unnamed: 0,title,abstract,publication_date
0,method for operating a comminution circuit and...,A method for operating an ore comminution circ...,2020-09-03
1,process for integrating the mining and process...,The present invention belongs to the mining se...,2023-08-10
2,system and method for continuous optimization ...,A method for continuously optimizing an ore gr...,2023-12-21
3,method for water and moisture management for a...,Wet conditions pose a plurality of hazards for...,2024-03-14
4,method for training a quality prediction model...,"A method (1000, 1001) for training a quality p...",2022-09-22
...,...,...,...
3187,"systems, methods, and program products for fac...",Machine learning systems and methods are descr...,2023-06-13
3188,method and system for managing vehicle transpo...,CA 03091162 2020-08-13 (12) INTERNATIONAL APPL...,2021-11-16
3189,method for providing clock frequencies for com...,The present disclosure relates to a method for...,2023-04-20
3190,wall panels and other components for custom en...,Custom enclosures are made using various types...,2023-08-03


In [149]:
# Obtiene los descartados
data_descartados_regex_sw = data_target[data_target['title'].isin(descartados_con_regex_con_sw)]
data_descartados_regex_sw

Unnamed: 0,title,abstract,publication_date
145,file system integration into data mining model,Aspects of a storage device including a memory...,2023-11-09
1544,interactive sequential pattern mining,Interactive sequential pattern mining is discl...,2020-02-18
1821,"text keyword mining method and device thereof,...",The invention discloses a text keyword mining ...,2022-02-18
2020,keyword mining system and mining method,The invention discloses a keyword mining syste...,2021-03-26
2373,"text main body keyword mining method, system a...",The invention relates to the field of patent d...,2023-05-09


---
**Observación**

---

La técnica usando expresiones regulares y considerando las stopwords logra descatar 5 registros dejando 3187.

### Experimento descartando stopwords

In [150]:
# Vuelve a seleecionar los datos para asegurar
data_target = data[['title', 'abstract', 'publication_date']]
data_target['title'] = data_target['title'].apply(convierte_a_minusculas)
data_target['title'] = data_target['title'].apply(preprocesar_texto)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_target['title'] = data_target['title'].apply(convierte_a_minusculas)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_target['title'] = data_target['title'].apply(preprocesar_texto)


In [151]:
filtrados = [title for title in data_target['title'] if not contains_regex_match(title, excluyentes)]

In [152]:
descartados_con_regex_sin_sw = obtiene_descartados(filtrados, data_target)

In [153]:
# Obtiene los filtrados
data_filtrados_regex_sin_sw = data_target[~data_target['title'].isin(descartados_con_regex_sin_sw)]
data_filtrados_regex_sin_sw

Unnamed: 0,title,abstract,publication_date
0,method operating comminution circuit respectiv...,A method for operating an ore comminution circ...,2020-09-03
1,process integrating mining processing together...,The present invention belongs to the mining se...,2023-08-10
2,system method continuous optimization mineral ...,A method for continuously optimizing an ore gr...,2023-12-21
3,method water moisture management mining operation,Wet conditions pose a plurality of hazards for...,2024-03-14
4,method training quality prediction model proce...,"A method (1000, 1001) for training a quality p...",2022-09-22
...,...,...,...
3187,"systems, methods, program products facilitatin...",Machine learning systems and methods are descr...,2023-06-13
3188,method system managing vehicle transportation ...,CA 03091162 2020-08-13 (12) INTERNATIONAL APPL...,2021-11-16
3189,method providing clock frequencies computing c...,The present disclosure relates to a method for...,2023-04-20
3190,wall panels components custom enclosures shipp...,Custom enclosures are made using various types...,2023-08-03


In [154]:
# Obtiene los descartados
data_descartados_regex_sin_sw = data_target[data_target['title'].isin(descartados_con_regex_sin_sw)]
data_descartados_regex_sin_sw

Unnamed: 0,title,abstract,publication_date
145,file system integration data mining model,Aspects of a storage device including a memory...,2023-11-09
544,arrangement delivery production control inform...,A method for providing a mining work machine a...,2020-01-07
937,externally validated proof work appending bloc...,A method for establishing an externally verifi...,2024-03-21
1544,interactive sequential pattern mining,Interactive sequential pattern mining is discl...,2020-02-18
1821,"text keyword mining method device thereof, sto...",The invention discloses a text keyword mining ...,2022-02-18
2020,keyword mining system mining method,The invention discloses a keyword mining syste...,2021-03-26
2373,"text main body keyword mining method, system d...",The invention relates to the field of patent d...,2023-05-09


---
**Observación**

---

La técnica usando expresiones regulares y considerando las stopwords logra descatar 7 registros dejando 3185.

In [155]:
lst_resultados.append(['regex',
                      data_descartados_regex_sw.shape[0],
                      data_filtrados_regex_sw.shape[0],
                      data_descartados_regex_sin_sw.shape[0],
                      data_filtrados_regex_sin_sw.shape[0]])

## Filtro usando coincidencia de prefijos y sufijos

In [100]:
def contains_prefix_suffix(abstract, terms):
    words = abstract.split()
    for word in words:
        for term in terms:
            if word.startswith(term) or word.endswith(term):
                return True
    return False

### Experimento considerando stopwords

In [156]:
# Vuelve a seleecionar los datos para asegurar
data_target = data[['title', 'abstract', 'publication_date']]
data_target['title'] = data_target['title'].apply(convierte_a_minusculas)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_target['title'] = data_target['title'].apply(convierte_a_minusculas)


In [157]:
filtrados = [title for title in data_target['title'] if not contains_prefix_suffix(title, excluyentes)]

In [158]:
descartados_con_prefix_con_sw = obtiene_descartados(filtrados, data_target)

In [159]:
# Obtiene los filtrados
data_filtrados_prefix_con_sw = data_target[~data_target['title'].isin(descartados_con_prefix_con_sw)]
data_filtrados_prefix_con_sw

Unnamed: 0,title,abstract,publication_date
0,method for operating a comminution circuit and...,A method for operating an ore comminution circ...,2020-09-03
1,process for integrating the mining and process...,The present invention belongs to the mining se...,2023-08-10
2,system and method for continuous optimization ...,A method for continuously optimizing an ore gr...,2023-12-21
3,method for water and moisture management for a...,Wet conditions pose a plurality of hazards for...,2024-03-14
4,method for training a quality prediction model...,"A method (1000, 1001) for training a quality p...",2022-09-22
...,...,...,...
3187,"systems, methods, and program products for fac...",Machine learning systems and methods are descr...,2023-06-13
3188,method and system for managing vehicle transpo...,CA 03091162 2020-08-13 (12) INTERNATIONAL APPL...,2021-11-16
3189,method for providing clock frequencies for com...,The present disclosure relates to a method for...,2023-04-20
3190,wall panels and other components for custom en...,Custom enclosures are made using various types...,2023-08-03


In [160]:
# Obtiene los descartados
data_descartados_prefix_con_sw = data_target[data_target['title'].isin(descartados_con_prefix_con_sw)]
data_descartados_prefix_con_sw

Unnamed: 0,title,abstract,publication_date


### Experimento descartando stopwords

In [161]:
# Vuelve a seleecionar los datos para asegurar
data_target = data[['title', 'abstract', 'publication_date']]
data_target['title'] = data_target['title'].apply(convierte_a_minusculas)
data_target['title'] = data_target['title'].apply(preprocesar_texto)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_target['title'] = data_target['title'].apply(convierte_a_minusculas)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_target['title'] = data_target['title'].apply(preprocesar_texto)


In [162]:
filtrados = [title for title in data_target['title'] if not contains_prefix_suffix(title, excluyentes)]

In [163]:
descartados_con_prefix_sin_sw = obtiene_descartados(filtrados, data_target)

In [164]:
# Obtiene los filtrados
data_filtrados_prefix_sin_sw = data_target[~data_target['title'].isin(descartados_con_prefix_sin_sw)]
data_filtrados_prefix_sin_sw

Unnamed: 0,title,abstract,publication_date
0,method operating comminution circuit respectiv...,A method for operating an ore comminution circ...,2020-09-03
1,process integrating mining processing together...,The present invention belongs to the mining se...,2023-08-10
2,system method continuous optimization mineral ...,A method for continuously optimizing an ore gr...,2023-12-21
3,method water moisture management mining operation,Wet conditions pose a plurality of hazards for...,2024-03-14
4,method training quality prediction model proce...,"A method (1000, 1001) for training a quality p...",2022-09-22
...,...,...,...
3187,"systems, methods, program products facilitatin...",Machine learning systems and methods are descr...,2023-06-13
3188,method system managing vehicle transportation ...,CA 03091162 2020-08-13 (12) INTERNATIONAL APPL...,2021-11-16
3189,method providing clock frequencies computing c...,The present disclosure relates to a method for...,2023-04-20
3190,wall panels components custom enclosures shipp...,Custom enclosures are made using various types...,2023-08-03


In [165]:
# Obtiene los descartados
data_descartados_prefix_sin_sw = data_target[data_target['title'].isin(descartados_con_prefix_sin_sw)]
data_descartados_prefix_sin_sw

Unnamed: 0,title,abstract,publication_date


In [166]:
lst_resultados.append(['prefix-sufix',
                      data_descartados_prefix_con_sw.shape[0],
                      data_filtrados_prefix_con_sw.shape[0],
                      data_descartados_prefix_sin_sw.shape[0],
                      data_filtrados_prefix_sin_sw.shape[0]])


# Resumen

In [168]:
df_resultados = pd.DataFrame(lst_resultados, columns=['tecnica', 'descartados_con_SW', 'filtrados_con_SW',
                                      'descartados_sin_SW', 'filtrados_sin_SW'])
df_resultados

Unnamed: 0,tecnica,descartados_con_SW,filtrados_con_SW,descartados_sin_SW,filtrados_sin_SW
0,fuzzy - umbral 90%,2170,1022,1203,1989
1,stemm,0,3192,0,3192
2,regex,5,3187,7,3185
3,prefix-sufix,0,3192,0,3192


In [172]:
# Aplicar bordes a todas las celdas
df_styled = df_resultados.style.set_table_styles(
    [{'selector': 'td', 'props': [('border', '2px solid black')]}]
)
# 2. Aplicar un estilo diferente a la primera columna
df_styled = df_styled.set_table_styles(
    {
        'Col1': [{'selector': 'td', 'props': [('background-color', 'yellow'), ('color', 'red'), ('font-weight', 'bold')]}]
    },
    overwrite=False  # Para mantener los estilos previos
)
df_styled

Unnamed: 0,tecnica,descartados_con_SW,filtrados_con_SW,descartados_sin_SW,filtrados_sin_SW
0,fuzzy - umbral 90%,2170,1022,1203,1989
1,stemm,0,3192,0,3192
2,regex,5,3187,7,3185
3,prefix-sufix,0,3192,0,3192




---
#Código NO CONSIDERADO




In [95]:
def preprocess_text_v(text):
	# Definir stopwords y puntuación
	stop_words = set(stopwords.words('english'))
	punctuation = set(string.punctuation)
  # Convertir a string y manejar NaN
	if isinstance(text, float):
		return []
  # Tokenizar el texto
	tokens = word_tokenize(text.lower())
	text = text.translate(str.maketrans('', '', string.punctuation))
	# Filtrar stopwords y puntuación
	tokens = [word for word in tokens if word not in stop_words and word not in punctuation]
	return tokens

In [11]:
# Definir stopwords y puntuación
stop_words = set(stopwords.words('english'))
punctuation = set(string.punctuation)

# Aplicar el preprocesamiento a la columna de abstracts
lst_tokens = data_target['abstract'].apply(preprocess_text_v)
data_target['tokens'] = [token for token in lst_tokens]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_target['tokens'] = [token for token in lst_tokens]


In [12]:
data_target.tokens.head()

Unnamed: 0,tokens
0,"[method, operating, ore, comminution, circuit,..."
1,"[present, invention, belongs, mining, sector, ..."
2,"[method, continuously, optimizing, ore, grindi..."
3,"[wet, conditions, pose, plurality, hazards, mi..."
4,"[method, 1000, 1001, training, quality, predic..."
