<a href="https://colab.research.google.com/github/griisnc/NLP_Python/blob/main/SpanishCorpusProcessor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# This code was made by Griselda Navarrete

# Description

The code uses Python libraries like NLTK, pandas, and scikit-learn to perform text analysis on a corpus of text files in Spanish.

# Purpose

The code reads text files from a specific directory, tokenizes the text into words, removes stop words (common words with little meaning), stems the remaining words to their root form, and builds an index of these stemmed words and the files they appear in. Then, the code calculates term frequencies and TF-IDF values for the terms in the corpus, generating matrices and tables that represent the distribution of terms across the documents. This information can be used to analyze the content of the documents, find similar documents, and other text mining tasks.

In other words, the purpose of the code is to build a system for analyzing Spanish text using Python and data science libraries. It does so by processing a collection of documents, extracting important words, and representing them numerically for further analysis and mining.

In [None]:
# Import necessary libraries
from IPython import get_ipython
from IPython.display import display
import os # for file and directory operations
import random # for random file selection

In [None]:
# Mount Google Drive to access files
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Select a random file from the specified directory
archivo = random.sample(os.listdir('/content/drive/MyDrive/Similitud/'),1)
archivo = archivo[0]

In [None]:
# Read the content of the selected file
with open('/content/drive/MyDrive/Similitud/'+archivo,"r", encoding="utf8") as entrada:
    texto = entrada.read()

In [None]:
texto #Print the content of the file

'Online Distributed Proofreading Team.\n\n\n\n\n\n\n\n\n\n\nNOVELAS\n\nDE\n\nVOLTAIRE,\n\nTRADUCIDAS\n\nPOR J. MARCHENA.\n\n\nBURDEOS,\n\nIMPRENTA DE PEDRO BEAUME,\n\nALLÉES DE TOURNY, NO. 5.\n\n1819.\n\n\n\n\nZADIG,\n\nó\n\nEL DESTINO,\n\nHISTORIA ORIENTAL.\n\n\n\n\nDEDICATORIA DE ZADIG\n\nA LA SULTANA CHERAAH, POR SADI.\n\n\nA 18 del mes de Cheval, año 837 de la hegira.\n\nEmbeleso de las niñas de los ojos, tormento del corazon, luz del\nánimo, no beso yo el polvo de tus piés, porque ó no andas á pié, ó si\nandas, pisas ó rosas ó tapetes de Iran. Ofrézcote la version de un\nlibro de un sabio de la antigüedad, que siendo tan feliz que nada\ntenia que hacer, gozó la dicha mayor de divertirse con escribir la\nhistoria de Zadig, libro que dice mas de lo que parece. Ruégote que le\nleas y le aprecies en lo que valiere; pues aunque todavía está tu vida\nen su primavera, aunque te embisten de rondon los pasatiempos todos,\naunque eres hermosa, y tu talento da á tu hermosura mayor realce,\na

In [None]:
print(texto)

Online Distributed Proofreading Team.










NOVELAS

DE

VOLTAIRE,

TRADUCIDAS

POR J. MARCHENA.


BURDEOS,

IMPRENTA DE PEDRO BEAUME,

ALLÉES DE TOURNY, NO. 5.

1819.




ZADIG,

ó

EL DESTINO,

HISTORIA ORIENTAL.




DEDICATORIA DE ZADIG

A LA SULTANA CHERAAH, POR SADI.


A 18 del mes de Cheval, año 837 de la hegira.

Embeleso de las niñas de los ojos, tormento del corazon, luz del
ánimo, no beso yo el polvo de tus piés, porque ó no andas á pié, ó si
andas, pisas ó rosas ó tapetes de Iran. Ofrézcote la version de un
libro de un sabio de la antigüedad, que siendo tan feliz que nada
tenia que hacer, gozó la dicha mayor de divertirse con escribir la
historia de Zadig, libro que dice mas de lo que parece. Ruégote que le
leas y le aprecies en lo que valiere; pues aunque todavía está tu vida
en su primavera, aunque te embisten de rondon los pasatiempos todos,
aunque eres hermosa, y tu talento da á tu hermosura mayor realce,
aunque te elogian de dia y de noche, motivos concomitantes que

In [None]:
texto.split() # Split the text into words

['Online',
 'Distributed',
 'Proofreading',
 'Team.',
 'NOVELAS',
 'DE',
 'VOLTAIRE,',
 'TRADUCIDAS',
 'POR',
 'J.',
 'MARCHENA.',
 'BURDEOS,',
 'IMPRENTA',
 'DE',
 'PEDRO',
 'BEAUME,',
 'ALLÉES',
 'DE',
 'TOURNY,',
 'NO.',
 '5.',
 '1819.',
 'ZADIG,',
 'ó',
 'EL',
 'DESTINO,',
 'HISTORIA',
 'ORIENTAL.',
 'DEDICATORIA',
 'DE',
 'ZADIG',
 'A',
 'LA',
 'SULTANA',
 'CHERAAH,',
 'POR',
 'SADI.',
 'A',
 '18',
 'del',
 'mes',
 'de',
 'Cheval,',
 'año',
 '837',
 'de',
 'la',
 'hegira.',
 'Embeleso',
 'de',
 'las',
 'niñas',
 'de',
 'los',
 'ojos,',
 'tormento',
 'del',
 'corazon,',
 'luz',
 'del',
 'ánimo,',
 'no',
 'beso',
 'yo',
 'el',
 'polvo',
 'de',
 'tus',
 'piés,',
 'porque',
 'ó',
 'no',
 'andas',
 'á',
 'pié,',
 'ó',
 'si',
 'andas,',
 'pisas',
 'ó',
 'rosas',
 'ó',
 'tapetes',
 'de',
 'Iran.',
 'Ofrézcote',
 'la',
 'version',
 'de',
 'un',
 'libro',
 'de',
 'un',
 'sabio',
 'de',
 'la',
 'antigüedad,',
 'que',
 'siendo',
 'tan',
 'feliz',
 'que',
 'nada',
 'tenia',
 'que',
 'hacer,',

In [None]:
# Define a list of separators for text cleaning
separadores = ["[","]","(",")",",",".",";",":","\"","¿","?","¡","!","--","_"]

for separador in separadores:
    texto = texto.replace(separador," ")

In [None]:
texto.split() # Split the cleaned text into words

['Online',
 'Distributed',
 'Proofreading',
 'Team',
 'NOVELAS',
 'DE',
 'VOLTAIRE',
 'TRADUCIDAS',
 'POR',
 'J',
 'MARCHENA',
 'BURDEOS',
 'IMPRENTA',
 'DE',
 'PEDRO',
 'BEAUME',
 'ALLÉES',
 'DE',
 'TOURNY',
 'NO',
 '5',
 '1819',
 'ZADIG',
 'ó',
 'EL',
 'DESTINO',
 'HISTORIA',
 'ORIENTAL',
 'DEDICATORIA',
 'DE',
 'ZADIG',
 'A',
 'LA',
 'SULTANA',
 'CHERAAH',
 'POR',
 'SADI',
 'A',
 '18',
 'del',
 'mes',
 'de',
 'Cheval',
 'año',
 '837',
 'de',
 'la',
 'hegira',
 'Embeleso',
 'de',
 'las',
 'niñas',
 'de',
 'los',
 'ojos',
 'tormento',
 'del',
 'corazon',
 'luz',
 'del',
 'ánimo',
 'no',
 'beso',
 'yo',
 'el',
 'polvo',
 'de',
 'tus',
 'piés',
 'porque',
 'ó',
 'no',
 'andas',
 'á',
 'pié',
 'ó',
 'si',
 'andas',
 'pisas',
 'ó',
 'rosas',
 'ó',
 'tapetes',
 'de',
 'Iran',
 'Ofrézcote',
 'la',
 'version',
 'de',
 'un',
 'libro',
 'de',
 'un',
 'sabio',
 'de',
 'la',
 'antigüedad',
 'que',
 'siendo',
 'tan',
 'feliz',
 'que',
 'nada',
 'tenia',
 'que',
 'hacer',
 'gozó',
 'la',
 'dicha',

In [None]:
pip install nltk  #install NLTK



In [None]:
import nltk # Import NLTK

In [None]:
nltk.download('punkt_tab') # Download necessary NLTK data

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [None]:
tokens = nltk.word_tokenize(texto,"spanish") # Tokenize the text using NLTK
tokens

['Online',
 'Distributed',
 'Proofreading',
 'Team',
 'NOVELAS',
 'DE',
 'VOLTAIRE',
 'TRADUCIDAS',
 'POR',
 'J',
 'MARCHENA',
 'BURDEOS',
 'IMPRENTA',
 'DE',
 'PEDRO',
 'BEAUME',
 'ALLÉES',
 'DE',
 'TOURNY',
 'NO',
 '5',
 '1819',
 'ZADIG',
 'ó',
 'EL',
 'DESTINO',
 'HISTORIA',
 'ORIENTAL',
 'DEDICATORIA',
 'DE',
 'ZADIG',
 'A',
 'LA',
 'SULTANA',
 'CHERAAH',
 'POR',
 'SADI',
 'A',
 '18',
 'del',
 'mes',
 'de',
 'Cheval',
 'año',
 '837',
 'de',
 'la',
 'hegira',
 'Embeleso',
 'de',
 'las',
 'niñas',
 'de',
 'los',
 'ojos',
 'tormento',
 'del',
 'corazon',
 'luz',
 'del',
 'ánimo',
 'no',
 'beso',
 'yo',
 'el',
 'polvo',
 'de',
 'tus',
 'piés',
 'porque',
 'ó',
 'no',
 'andas',
 'á',
 'pié',
 'ó',
 'si',
 'andas',
 'pisas',
 'ó',
 'rosas',
 'ó',
 'tapetes',
 'de',
 'Iran',
 'Ofrézcote',
 'la',
 'version',
 'de',
 'un',
 'libro',
 'de',
 'un',
 'sabio',
 'de',
 'la',
 'antigüedad',
 'que',
 'siendo',
 'tan',
 'feliz',
 'que',
 'nada',
 'tenia',
 'que',
 'hacer',
 'gozó',
 'la',
 'dicha',

In [None]:
from nltk.corpus import stopwords # Import stopwords from NLTK

In [None]:
nltk.download('stopwords') # Download stopwords for Spanish

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
stop_words = set(stopwords.words('spanish')) # Create a set of Spanish stopwords

In [None]:
# es importante revisar el contexto de nuestro corpus, porque en ocasiones es necesario modificar algunas características
# del tokenizador. Por ejemplo: aquí quitamos los guiones al inicio y al final.

# Clean tokens by removing underscores and hyphens
for i,token in enumerate(tokens):
    if token.startswith("_") or token.startswith("—"):
        tokens[i] = tokens[i][1:]
    if token.endswith("_") or token.endswith("—"):
        tokens[i] = tokens[i][:-1]
texto = " ".join(tokens)

In [None]:
# Define a function to tokenize text and remove punctuation
def tokenizar(texto):
    puntuacion = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~¿¡'
    tokens = nltk.word_tokenize(texto,"spanish")
    for i,token in enumerate(tokens):
        tokens[i] = token.strip(puntuacion)
    texto = " ".join(tokens)
    tokens = nltk.word_tokenize(texto,"spanish")
    return tokens

In [None]:
tokens = tokenizar(texto) # Tokenize the text using the defined function
tokens

['Online',
 'Distributed',
 'Proofreading',
 'Team',
 'NOVELAS',
 'DE',
 'VOLTAIRE',
 'TRADUCIDAS',
 'POR',
 'J',
 'MARCHENA',
 'BURDEOS',
 'IMPRENTA',
 'DE',
 'PEDRO',
 'BEAUME',
 'ALLÉES',
 'DE',
 'TOURNY',
 'NO',
 '5',
 '1819',
 'ZADIG',
 'ó',
 'EL',
 'DESTINO',
 'HISTORIA',
 'ORIENTAL',
 'DEDICATORIA',
 'DE',
 'ZADIG',
 'A',
 'LA',
 'SULTANA',
 'CHERAAH',
 'POR',
 'SADI',
 'A',
 '18',
 'del',
 'mes',
 'de',
 'Cheval',
 'año',
 '837',
 'de',
 'la',
 'hegira',
 'Embeleso',
 'de',
 'las',
 'niñas',
 'de',
 'los',
 'ojos',
 'tormento',
 'del',
 'corazon',
 'luz',
 'del',
 'ánimo',
 'no',
 'beso',
 'yo',
 'el',
 'polvo',
 'de',
 'tus',
 'piés',
 'porque',
 'ó',
 'no',
 'andas',
 'á',
 'pié',
 'ó',
 'si',
 'andas',
 'pisas',
 'ó',
 'rosas',
 'ó',
 'tapetes',
 'de',
 'Iran',
 'Ofrézcote',
 'la',
 'version',
 'de',
 'un',
 'libro',
 'de',
 'un',
 'sabio',
 'de',
 'la',
 'antigüedad',
 'que',
 'siendo',
 'tan',
 'feliz',
 'que',
 'nada',
 'tenia',
 'que',
 'hacer',
 'gozó',
 'la',
 'dicha',

In [None]:
word_tokens = [word for word in tokens if word.isalpha()] # Filter out non-alphabetic tokens

In [None]:
filteres_sentence =  [w for w in word_tokens if not w.lower() in stop_words] # Filter out stopwords from the word tokens

In [None]:
filtered_sentence = [] # Create an empty list to store filtered tokens

In [None]:
# Add non-stopword tokens to the filtered list
for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w) # se agrega ambos arreglos ya filtrados

print(word_tokens)

['Online', 'Distributed', 'Proofreading', 'Team', 'NOVELAS', 'DE', 'VOLTAIRE', 'TRADUCIDAS', 'POR', 'J', 'MARCHENA', 'BURDEOS', 'IMPRENTA', 'DE', 'PEDRO', 'BEAUME', 'ALLÉES', 'DE', 'TOURNY', 'NO', 'ZADIG', 'ó', 'EL', 'DESTINO', 'HISTORIA', 'ORIENTAL', 'DEDICATORIA', 'DE', 'ZADIG', 'A', 'LA', 'SULTANA', 'CHERAAH', 'POR', 'SADI', 'A', 'del', 'mes', 'de', 'Cheval', 'año', 'de', 'la', 'hegira', 'Embeleso', 'de', 'las', 'niñas', 'de', 'los', 'ojos', 'tormento', 'del', 'corazon', 'luz', 'del', 'ánimo', 'no', 'beso', 'yo', 'el', 'polvo', 'de', 'tus', 'piés', 'porque', 'ó', 'no', 'andas', 'á', 'pié', 'ó', 'si', 'andas', 'pisas', 'ó', 'rosas', 'ó', 'tapetes', 'de', 'Iran', 'Ofrézcote', 'la', 'version', 'de', 'un', 'libro', 'de', 'un', 'sabio', 'de', 'la', 'antigüedad', 'que', 'siendo', 'tan', 'feliz', 'que', 'nada', 'tenia', 'que', 'hacer', 'gozó', 'la', 'dicha', 'mayor', 'de', 'divertirse', 'con', 'escribir', 'la', 'historia', 'de', 'Zadig', 'libro', 'que', 'dice', 'mas', 'de', 'lo', 'que', 'p

In [None]:
len(word_tokens)

25823

In [None]:
print(filtered_sentence) # Print the original and filtered tokens

['Online', 'Distributed', 'Proofreading', 'Team', 'NOVELAS', 'DE', 'VOLTAIRE', 'TRADUCIDAS', 'POR', 'J', 'MARCHENA', 'BURDEOS', 'IMPRENTA', 'DE', 'PEDRO', 'BEAUME', 'ALLÉES', 'DE', 'TOURNY', 'NO', 'ZADIG', 'ó', 'EL', 'DESTINO', 'HISTORIA', 'ORIENTAL', 'DEDICATORIA', 'DE', 'ZADIG', 'A', 'LA', 'SULTANA', 'CHERAAH', 'POR', 'SADI', 'A', 'mes', 'Cheval', 'año', 'hegira', 'Embeleso', 'niñas', 'ojos', 'tormento', 'corazon', 'luz', 'ánimo', 'beso', 'polvo', 'piés', 'ó', 'andas', 'á', 'pié', 'ó', 'si', 'andas', 'pisas', 'ó', 'rosas', 'ó', 'tapetes', 'Iran', 'Ofrézcote', 'version', 'libro', 'sabio', 'antigüedad', 'siendo', 'tan', 'feliz', 'tenia', 'hacer', 'gozó', 'dicha', 'mayor', 'divertirse', 'escribir', 'historia', 'Zadig', 'libro', 'dice', 'mas', 'parece', 'Ruégote', 'leas', 'aprecies', 'valiere', 'pues', 'aunque', 'todavía', 'vida', 'primavera', 'aunque', 'embisten', 'rondon', 'pasatiempos', 'aunque', 'hermosa', 'talento', 'da', 'á', 'hermosura', 'mayor', 'realce', 'aunque', 'elogian', 'di

In [None]:
len(filtered_sentence)

14460

In [None]:
# Build an index of words and files
indice = {}
corpus = os.listdir('/content/drive/MyDrive/Similitud/')
for archivo in corpus:
    with open('/content/drive/MyDrive/Similitud/'+archivo,"r", encoding="utf8") as entrada:
        texto = entrada.read()
    tokens = tokenizar(texto)
    vocabulario = set(tokens)
    for palabra in vocabulario:
        if palabra not in indice:
            indice[palabra] = set()
        indice[palabra].add(archivo)

In [None]:
indice

{'sucedidas': {'1000', '2000'},
 'calamitosos': {'1000', '2000'},
 'bullía': {'1000', '2000'},
 'agraviaba': {'1000', '2000'},
 'alojado': {'1000', '2000'},
 'desembolsar': {'1000', '2000'},
 'Arremetía': {'1000', '2000'},
 'quebrando': {'1000', '2000'},
 'pidieron': {'1000', '2000'},
 'latina': {'1000', '2000'},
 'doblará': {'1000', '2000'},
 'PRIVILEGIO': {'1000', '2000'},
 'apeado': {'1000', '2000'},
 'vee': {'1000', '2000'},
 'escribo': {'1000', '2000'},
 'niñerías': {'1000', '2000'},
 'laberinto': {'1000', '2000'},
 'ensártalos': {'1000', '2000'},
 'llamaba': {'1000', '2000', '5985'},
 'capellán': {'1000', '2000'},
 'Capítulo': {'1000', '2000'},
 'acostar': {'1000', '2000', '5985'},
 'Cual': {'1000', '2000'},
 'militares': {'1000', '2000', '5985'},
 'ascuras': {'1000', '2000'},
 'andariega': {'1000', '2000'},
 'viento': {'1000', '2000', '5985'},
 'parecióles': {'1000', '2000'},
 'presentados': {'1000', '2000'},
 'Camoes': {'1000', '2000'},
 'áspero': {'1000', '2000', '5985'},
 'tr

In [None]:
print (indice) # Print the index and its length

{'sucedidas': {'1000', '2000'}, 'calamitosos': {'1000', '2000'}, 'bullía': {'1000', '2000'}, 'agraviaba': {'1000', '2000'}, 'alojado': {'1000', '2000'}, 'desembolsar': {'1000', '2000'}, 'Arremetía': {'1000', '2000'}, 'quebrando': {'1000', '2000'}, 'pidieron': {'1000', '2000'}, 'latina': {'1000', '2000'}, 'doblará': {'1000', '2000'}, 'PRIVILEGIO': {'1000', '2000'}, 'apeado': {'1000', '2000'}, 'vee': {'1000', '2000'}, 'escribo': {'1000', '2000'}, 'niñerías': {'1000', '2000'}, 'laberinto': {'1000', '2000'}, 'ensártalos': {'1000', '2000'}, 'llamaba': {'5985', '1000', '2000'}, 'capellán': {'1000', '2000'}, 'Capítulo': {'1000', '2000'}, 'acostar': {'5985', '1000', '2000'}, 'Cual': {'1000', '2000'}, 'militares': {'5985', '1000', '2000'}, 'ascuras': {'1000', '2000'}, 'andariega': {'1000', '2000'}, 'viento': {'5985', '1000', '2000'}, 'parecióles': {'1000', '2000'}, 'presentados': {'1000', '2000'}, 'Camoes': {'1000', '2000'}, 'áspero': {'5985', '1000', '2000'}, 'tropiece': {'1000', '2000'}, 'sed

In [None]:
len(indice)

27201

In [None]:
stemmer = nltk.stem.SnowballStemmer("spanish") # Import SnowballStemmer for stemming

In [None]:
# Stem a word using the stemmer
palabra = "corriendo"
raiz = stemmer.stem(palabra)
print(raiz)

corr


In [None]:
# Define a function to stem tokens
def stemmizar(tokens):
    stemmer = nltk.stem.SnowballStemmer("spanish")
    stems = []
    for token in tokens:
        stem = stemmer.stem(token)
        stems.append(stem)
    return stems

In [None]:
# Build a new index with stemmed tokens
indice = {}
corpus = '/content/drive/MyDrive/Similitud/'
for archivo in os.listdir(corpus):
    ruta = os.path.join(corpus, archivo)
    if os.path.isfile(ruta):
      with open(ruta, "r") as entrada:
          texto = entrada.read()
          texto = texto.lower()
      tokens = tokenizar(texto)
      stems = stemmizar(tokens)
      vocabulario = set(stems)
      for entrada in vocabulario:
          if entrada not in indice:
              indice[entrada] = set()
          indice[entrada].add(archivo)

In [None]:
len(indice) # Print the length of the new index

11620

In [None]:
# Import pandas for data manipulation
import pandas as pd

In [None]:
# Build a frequency index of words and files
indice_frecuencias = {}
corpus = '/content/drive/MyDrive/Corpus10/'
for archivo in os.listdir(corpus):
    ruta = os.path.join(corpus, archivo)
    if os.path.isfile(ruta):
      with open(ruta, "r") as entrada:
        texto = entrada.read()
        texto = texto.lower()
    tokens = tokenizar(texto)
    #stems = stemmizar(tokens)
    vocabulario = tokens
    for entrada in vocabulario:
        if entrada not in indice_frecuencias:
            indice_frecuencias[entrada] = {}
        if archivo not in indice_frecuencias[entrada]:
            indice_frecuencias[entrada][archivo]=0
        indice_frecuencias[entrada][archivo]+=1

In [None]:
# Create a frequency table using pandas
vocabulario = indice_frecuencias.keys()
#corpus = corpus # solo como recordatorio de que alli estan los archivos
corpus = os.listdir('/content/drive/MyDrive/Corpus10/')
tabla_frecuencias = pd.DataFrame(0,index=vocabulario,columns=corpus)
for entrada in indice_frecuencias:
    for archivo in indice_frecuencias[entrada]:
        tabla_frecuencias.loc[entrada,archivo] = indice_frecuencias[entrada][archivo]

In [None]:
tabla_frecuencias # Display the frequency table

Unnamed: 0,5985,9890,2000,8870,1619,9980,7109,320,9895,5201
online,1,0,0,0,0,0,1,0,0,0
distributed,1,1,0,0,0,1,1,0,1,0
proofreading,1,0,0,0,0,0,1,0,0,0
team,1,0,0,0,0,3,1,0,0,0
novelas,1,0,6,0,0,0,3,0,2,0
...,...,...,...,...,...,...,...,...,...,...
universe,0,0,0,0,0,0,0,0,0,1
record,0,0,0,0,0,0,0,0,0,1
souls,0,0,0,0,0,0,0,0,0,1
mouths,0,0,0,0,0,0,0,0,0,1


In [None]:
# Select a random file and display its top 20 frequent words
archivo = random.sample(os.listdir('/content/drive/MyDrive/Similitud/'),1)
archivo = "2000"
tabla_frecuencias[archivo].sort_values(ascending=False).head(20)

Unnamed: 0,2000
que,20742
de,18409
y,18264
la,10491
a,9873
en,8283
el,8265
no,6338
los,4769
se,4752


In [None]:
nltk.download("stopwords") # Download stopwords for Spanish (again)
# Get Spanish stopwords and print them
palabras_funcionales=nltk.corpus.stopwords.words("spanish")
print(palabras_funcionales)

['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'del', 'se', 'las', 'por', 'un', 'para', 'con', 'no', 'una', 'su', 'al', 'lo', 'como', 'más', 'pero', 'sus', 'le', 'ya', 'o', 'este', 'sí', 'porque', 'esta', 'entre', 'cuando', 'muy', 'sin', 'sobre', 'también', 'me', 'hasta', 'hay', 'donde', 'quien', 'desde', 'todo', 'nos', 'durante', 'todos', 'uno', 'les', 'ni', 'contra', 'otros', 'ese', 'eso', 'ante', 'ellos', 'e', 'esto', 'mí', 'antes', 'algunos', 'qué', 'unos', 'yo', 'otro', 'otras', 'otra', 'él', 'tanto', 'esa', 'estos', 'mucho', 'quienes', 'nada', 'muchos', 'cual', 'poco', 'ella', 'estar', 'estas', 'algunas', 'algo', 'nosotros', 'mi', 'mis', 'tú', 'te', 'ti', 'tu', 'tus', 'ellas', 'nosotras', 'vosotros', 'vosotras', 'os', 'mío', 'mía', 'míos', 'mías', 'tuyo', 'tuya', 'tuyos', 'tuyas', 'suyo', 'suya', 'suyos', 'suyas', 'nuestro', 'nuestra', 'nuestros', 'nuestras', 'vuestro', 'vuestra', 'vuestros', 'vuestras', 'esos', 'esas', 'estoy', 'estás', 'está', 'estamos', 'estáis', 'están', 'e

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
corpus

['5985', '9890', '2000', '8870', '1619', '9980', '7109', '320', '9895', '5201']

In [None]:
#from sklearn.feature_extraction.text import CountVectorizer
#lista_archivos = ['/content/drive/MyDrive/Similitud/'+archivo for archivo in corpus]
#vectorizador = CountVectorizer(input="filename",analyzer="word")
#vectorizador = CountVectorizer(input="filename",analyzer="word",tokenizer=tokenizador)
#vectorizador = CountVectorizer(input="filename",analyzer="word",tokenizer = tokenizador, stop_words=palabras_funcionales)
#matriz_frecuencias = vectorizador.fit_transform(lista_archivos)

In [None]:
matriz_frecuencias.toarray() # Convert the document-term matrix to a dense array and display it (commented out)


array([[ 0,  0,  0, ...,  0,  0,  1],
       [30,  0,  0, ...,  0,  0,  0],
       [ 0,  0,  0, ...,  0,  4,  1],
       ...,
       [ 0,  0,  0, ...,  0,  0,  0],
       [ 0,  0,  0, ...,  0,  1,  0],
       [ 0,  0,  0, ...,  0,  0,  0]])

In [None]:
# Get the vocabulary (feature names) from the vectorizer (commented out)
vocabulario = vectorizador.get_feature_names_out()
corpus = corpus # solo como recordatorio de que alli estan los archivos
tabla_frecuencias = pd.DataFrame(matriz_frecuencias.toarray(),index=corpus,columns=vocabulario)
tabla_frecuencias

Unnamed: 0,000,01,05,10,100,1000,1005,101,1010,1012,...,únicamente,único,únicos,úntense,úntese,úsase,úsense,úsese,útil,útiles
5985,0,0,0,0,0,0,0,0,0,0,...,1,5,2,0,0,0,0,0,0,1
9890,30,0,0,6,1,0,0,0,0,0,...,2,3,1,0,0,0,0,0,0,0
2000,0,0,0,1,0,0,0,0,0,0,...,0,12,1,0,0,1,0,0,4,1
8870,2,1,2,7,16,0,0,0,0,0,...,0,0,0,1,4,0,0,0,1,2
1619,0,0,0,5,2,2,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
9980,0,0,0,30,0,0,0,0,0,0,...,0,5,0,0,0,0,6,2,0,0
7109,0,0,0,0,0,0,0,0,0,0,...,0,2,0,0,0,0,0,0,2,1
320,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9895,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
5201,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# Select a random file and display its top 20 frequent words (commented out)

archivo = random.sample(('/content/drive/MyDrive/Corpus10/'),1)
archivo = "2000"
tabla_frecuencias.loc[archivo].sort_values(ascending=False).head(20)

Unnamed: 0,2000
que,20769
de,18410
la,10492
en,8285
el,8265
no,6346
los,4769
se,4752
con,4275
por,3945


In [None]:
# Import TfidfVectorizer for calculating TF-IDF values
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Create a list of file paths
corpus = os.listdir('/content/drive/MyDrive/Corpus10/')
lista_archivos = ['/content/drive/MyDrive/Corpus10/'+archivo for archivo in corpus]


# Create a TfidfVectorizer object with sublinear TF scaling
vectorizador = TfidfVectorizer(input="filename",analyzer="word",sublinear_tf=True)

# Create a TF-IDF matrix using TfidfVectorizer
matriz_tfidf = vectorizador.fit_transform(lista_archivos)

# Get the vocabulary (feature names) from the vectorizer
vocabulario = vectorizador.get_feature_names_out()

# Create a TF-IDF table using pandas
tabla_tfidf = pd.DataFrame(matriz_tfidf.toarray(),index=corpus,columns=vocabulario)

# Display the TF-IDF table
tabla_tfidf

Unnamed: 0,000,01,05,10,100,1000,1005,101,1010,1012,...,únicamente,único,únicos,úntense,úntese,úsase,úsense,úsese,útil,útiles
5985,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.01057,0.017418,0.015658,0.0,0.0,0.0,0.0,0.0,0.0,0.008222
9890,0.040466,0.0,0.0,0.01793,0.008044,0.0,0.0,0.0,0.0,0.0,...,0.015567,0.012185,0.008044,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2000,0.0,0.0,0.0,0.002322,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.007314,0.002908,0.0,0.0,0.00391,0.0,0.0,0.006169,0.002585
8870,0.012923,0.008979,0.015202,0.015707,0.025192,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.008979,0.021425,0.0,0.0,0.0,0.005937,0.010052
1619,0.0,0.0,0.0,0.007937,0.00645,0.008672,0.005122,0.005122,0.005122,0.005122,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9980,0.0,0.0,0.0,0.021636,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.011597,0.0,0.0,0.0,0.0,0.023111,0.014017,0.0,0.0
7109,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.010043,0.0,0.0,0.0,0.0,0.0,0.0,0.01237,0.007306
320,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9895,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.008019,0.0,0.0,0.0,0.0,0.0,0.0,0.009877,0.0
5201,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
lista_archivos # Print the list of file paths

['/content/drive/MyDrive/Corpus10/5985',
 '/content/drive/MyDrive/Corpus10/9890',
 '/content/drive/MyDrive/Corpus10/2000',
 '/content/drive/MyDrive/Corpus10/8870',
 '/content/drive/MyDrive/Corpus10/1619',
 '/content/drive/MyDrive/Corpus10/9980',
 '/content/drive/MyDrive/Corpus10/7109',
 '/content/drive/MyDrive/Corpus10/320',
 '/content/drive/MyDrive/Corpus10/9895',
 '/content/drive/MyDrive/Corpus10/5201']

In [None]:
vectorizador # Print the vectorizer object

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer # Import TfidfVectorizer
vectorizer=TfidfVectorizer() # Create a TfidfVectorizer object

In [None]:
x=vectorizer.fit_transform(corpus) # Fit the vectorizer to the corpus and transform it


In [None]:
vectorizer.get_feature_names_out() # Get the feature names from the vectorizer

array(['1619', '2000', '320', '5201', '5985', '7109', '8870', '9890',
       '9895', '9980'], dtype=object)

In [None]:
matriz_tfidf # Print the TF-IDF matrix

<10x53858 sparse matrix of type '<class 'numpy.float64'>'
	with 85885 stored elements in Compressed Sparse Row format>

In [None]:
vocabulario # Print the vocabulary

array(['000', '01', '05', ..., 'úsese', 'útil', 'útiles'], dtype=object)

In [None]:
corpus

['5985', '9890', '2000', '8870', '1619', '9980', '7109', '320', '9895', '5201']

In [None]:
# Select a file and display its top 20 TF-IDF values

#archivo = random.sample(os.listdir('/content/drive/MyDrive/Corpus10/Corpus10/'),1)
archivo = "2000"
archivo
tabla_tfidf.loc[archivo].sort_values(ascending=False).head(20)

Unnamed: 0,2000
quijote,0.028964
sancho,0.028863
panza,0.026834
dulcinea,0.026023
vuesa,0.024682
andante,0.024565
toboso,0.023872
andantes,0.0238
capítulo,0.023577
camila,0.023447


In [None]:
archivo

'2000'

In [None]:
#En la columna quijote suma tf y df y toma ducinea tambien
#tabla_tfidf[["quijote", "dulcinea"]].sum(axis=1).sort_values(ascending=False).head(20)

Unnamed: 0,0
2000,0.054986
9980,0.011915
5985,0.0
9890,0.0
8870,0.0
1619,0.0
7109,0.0
320,0.0
9895,0.0
5201,0.0


In [None]:
# Create a TF-IDF matrix for a different corpus
corpus = os.listdir('/content/drive/MyDrive/Similitud/')
lista_archivos = ['/content/drive/MyDrive/Similitud/'+archivo for archivo in corpus]
vectorizador = TfidfVectorizer(input="filename",analyzer="word")
matriz_tfidf = vectorizador.fit_transform(lista_archivos)
vocabulario = vectorizador.get_feature_names_out()
tabla_tfidf = pd.DataFrame(matriz_tfidf.toarray(),index=corpus,columns=vocabulario)
tabla_tfidf

Unnamed: 0,10,16,1604,1614,1615,17,18,1819,23,837,...,últimas,último,últimos,única,únicamente,único,únicos,úsase,útil,útiles
2000,3.5e-05,3.5e-05,3.5e-05,3.5e-05,3.5e-05,3.5e-05,0.0,0.0,3.5e-05,0.0,...,0.00011,0.00099,0.000106,0.000192,0.0,0.000329,2.7e-05,3.5e-05,0.000141,2.7e-05
5985,0.0,0.0,0.0,0.0,0.0,0.0,0.00069,0.00069,0.0,0.00069,...,0.000408,0.0,0.0,0.000815,0.00069,0.002038,0.000815,0.0,0.0,0.000408
1000,3.5e-05,3.5e-05,3.5e-05,3.5e-05,3.5e-05,3.5e-05,0.0,0.0,3.5e-05,0.0,...,0.00011,0.00099,0.000106,0.000192,0.0,0.000329,2.7e-05,3.5e-05,0.000141,2.7e-05


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
import numpy as np  # Make sure to import numpy
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    '2000',
    '5985',
    '1000'
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
vectorizer.get_feature_names_out()
print(X.shape)

(3, 3)


In [None]:
# Get feature names and create a DataFrame

#obtenemos las palabras por frecuencias es decir las que aparecen en los documentos
features = vectorizer.get_feature_names_out()
# Convertir la matriz dispersa en una matriz densa y combinarla con los nombres de las características
dense_matrix = X.toarray()
#convertimos en un data frame
df = pd.DataFrame(dense_matrix, columns=features)

#agregamos en la columna inicial el documento al que corresponde
df.insert(0,"Documento",corpus)

In [None]:
df

Unnamed: 0,Documento,1000,2000,5985
0,2000,0.0,1.0,0.0
1,5985,0.0,0.0,1.0
2,1000,1.0,0.0,0.0


In [None]:
idf_values = vectorizer.idf_

# Mostrar el IDF de cada palabra
idf_df = pd.DataFrame({'Palabra': features, 'IDF': idf_values})
print(idf_df)

  Palabra       IDF
0    1000  1.693147
1    2000  1.693147
2    5985  1.693147


In [None]:
tf_matrix = X.multiply(1 / idf_values)  # Dividir la matriz TF-IDF por el IDF para obtener solo el TF
tf_df = pd.DataFrame(tf_matrix.toarray(), columns=features)

# Agregar la columna "Documento"
tf_df.insert(0, "Documento", corpus)

# Mostrar la matriz de TF
print("\nMatriz de frecuencias de término (TF):")
print(tf_df)


Matriz de frecuencias de término (TF):
  Documento      1000      2000      5985
0      2000  0.000000  0.590616  0.000000
1      5985  0.000000  0.000000  0.590616
2      1000  0.590616  0.000000  0.000000


In [None]:
tf_df


Unnamed: 0,Documento,1000,2000,5985
0,2000,0.0,0.590616,0.0
1,5985,0.0,0.0,0.590616
2,1000,0.590616,0.0,0.0


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True