# Sistema de recomendación basado en contenidos

__`Author`: Marvin Correia__

El  objetivo  de  este  proyecto  es  implementar  un  sistema  de 
recomendación basado en  contenido, que nos permita recomendar  los 
mejores  documentos  para  un  cliente,  mediante  el  algoritmo  de 
clasificación KNN.

Crear un software que reciba un archivo de texto plano con extensión .txt, 
que  contenga  el  conjunto  de  posibles  documentos  a  recomendar  al 
usuario  final.  Cada  documento  viene  representado  en  una  línea  del 
archivo

## Instalación de paquetes

In [1]:
%pip install scikit-learn nltk


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Técnicas de preprocesamiento de texto utilizadas

### Eliminación de palabras de parada (`Stop Words`)

Son palabras comunes que suelen eliminarse durante el procesamiento del texto, como parte de la etapa de preprocesamiento. Estas palabras se consideran no informativas porque aparecen con frecuencia en textos de distintos ámbitos y no contribuyen significativamente a la comprensión del contenido.

La eliminación de los `stop words` ayuda a sanear una frase de varias maneras:

- Reducir el tamaño del vocabulario;
- Centrarse en las palabras más importantes;
- Mejorar la eficacia de los algoritmos de tratamiento de textos;

**Ejemplos de stopwords:**

In [2]:
from nltk.corpus import stopwords
print(stopwords.words('english'))
print(stopwords.words('spanish'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

### Words Stemming

`Stemming` es el proceso de reducir una palabra a su radical o raíz, también conocido como `stem`. El objetivo es obtener la forma básica de una palabra ignorando los sufijos y prefijos, de modo que las distintas variaciones de la misma palabra se traten como la misma raíz.

La técnica de vaporización ayuda en los siguientes aspectos:

- Reducción del vocabulario
- Normalización de palabras
- Mejora de la concordancia de palabras clave

**Por ejemplo**, supongamos que tiene un sistema de recomendación de películas, y un usuario valora la película "walk", mientras que otro usuario valora la película "walked". Aplicando stemming, ambas palabras se reducirían a su forma raíz "walk".

In [3]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk import word_tokenize
import pandas as pd

document = "Walked Walking Walked"
tokens = [token for token in word_tokenize(document.lower()) if token.isalpha() and token not in stopwords.words('english')]

stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in tokens]
pd.DataFrame({'original': tokens, 'stemmed': stemmed_tokens})

Unnamed: 0,original,stemmed
0,walked,walk
1,walking,walk
2,walked,walk


## Importación de módulos 

In [4]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import PorterStemmer
import nltk
import math
from IPython.display import display, Markdown

nltk.download('punkt')
nltk.download('stopwords')

pd.set_option('display.max_colwidth', 100)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/macbookair/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/macbookair/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Código principal

El código se encarga de generar el contenido markdown que nos permitirá generar las tablas de términos y sus frecuencias
para cada documento y también hacer el cálculo de similitud coseno entre los documentos.

In [5]:
FILENAME = "documents.txt"
DOC_LANGUAGE = 'english'

def load_documents(filename) -> list[str]:
    with open(filename, mode='r') as file:
        return file.read().splitlines()
    

def docs_sanitize(documents, language, stem=False) -> list[str]:
    """ Remove stop words and if stem=True return the stemmed documents """
    san_documents = []
    stemmer = PorterStemmer()
    for document in documents:
        tokens = [token for token in word_tokenize(document.lower()) if token.isalnum() and token not in stopwords.words(language)]
        if stem:
            stemmed_tokens = [stemmer.stem(token) for token in tokens]
            san_documents.append(' '.join(stemmed_tokens))
        else:
            san_documents.append(' '.join(tokens))
    return san_documents


def calculate_tf(term, terms) -> float:
    """ Frequência de termos em um documento. TF =  term_count / total_terms """
    return terms.count(term) / len(terms)


def calculate_idf(term, documents) -> float:
    """ IDF (Inverse Document Frequency) de um termo. IDF = log(total_docs / total_docs_with_term) | 0 """
    document_count = len(documents)
    term_count = sum([1 for document in documents if term in document])
    return math.log(document_count / term_count) if term_count > 0 else 0


def generate_term_freq_tables(original_docs, preprocessed_docs):
    display(Markdown("## Table Of Terms"))
    for i, document in enumerate(preprocessed_docs):
        terms = document.split()
        data = []
        columns=["Index", "Term", "TF", "IDF", "TF-IDF"]

        for term in terms:
            term_index = original_docs[i].lower().index(term)
            tf = calculate_tf(term, terms)
            idf = calculate_idf(term, preprocessed_docs)
            tf_idf = tf * idf
            data.append([term_index, term, tf, idf, tf_idf])
        
        display(Markdown(f"### Document {i + 1}"))
        display(Markdown(f"**ORIGINAL:** _{original_docs[i]}_"))
        display(Markdown(f"**PRE-PROCESSED:** _{document}_"))
        display(pd.DataFrame(data, columns=columns))
    
    print(end="\n\n")


def cos_similarity_knn(train_docs, target_doc):
    vectorizer = TfidfVectorizer()
    matrix_tfidf = vectorizer.fit_transform(train_docs)
    k = len(train_docs)
    knn_model = NearestNeighbors(n_neighbors=k, metric='cosine')
    knn_model.fit(matrix_tfidf)
    target_tfidf = vectorizer.transform([target_doc.strip()])
    distances, indexes = knn_model.kneighbors(target_tfidf)
    return (distances, indexes)


def generate_similarity_comparation(original_docs, preprocessed_docs):
    display(Markdown("## Cosine Similarity Using KNN"))
    for doc_index, document in enumerate(preprocessed_docs):
        distances, indexes = cos_similarity_knn(preprocessed_docs, document)
        results = []
        for i, index in enumerate(indexes[0]):
            resultado = {
                'Documents': original_docs[index],
                'Similarity': 1 - distances[0][i]
            }
            results.append(resultado)

        display(Markdown(f"**ORIGINAL:** _{original_docs[doc_index]}_"))
        display(Markdown(f"**PRE-PROCESSED:** _{document}_"))
        display(pd.DataFrame(results))
        print()


original_docs = load_documents(FILENAME)
preprocessed_docs = docs_sanitize(original_docs, DOC_LANGUAGE, stem=False)

generate_term_freq_tables(original_docs, preprocessed_docs)
generate_similarity_comparation(original_docs, preprocessed_docs)


## Table Of Terms

### Document 1

**ORIGINAL:** _Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity._

**PRE-PROCESSED:** _aromas include tropical fruit broom brimstone dried herb palate overly expressive offering unripened apple citrus dried sage alongside brisk acidity_

Unnamed: 0,Index,Term,TF,IDF,TF-IDF
0,0,aromas,0.05,0.559616,0.027981
1,7,include,0.05,1.94591,0.097296
2,15,tropical,0.05,1.94591,0.097296
3,24,fruit,0.05,0.847298,0.042365
4,31,broom,0.05,1.94591,0.097296
5,38,brimstone,0.05,1.94591,0.097296
6,52,dried,0.1,1.94591,0.194591
7,58,herb,0.05,0.559616,0.027981
8,68,palate,0.05,0.847298,0.042365
9,81,overly,0.05,1.94591,0.097296


### Document 2

**ORIGINAL:** _This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red berry fruits and freshened with acidity. It's already drinkable, although it will certainly be better from 2016._

**PRE-PROCESSED:** _ripe fruity wine smooth still structured firm tannins filled juicy red berry fruits freshened acidity already drinkable although certainly better 2016_

Unnamed: 0,Index,Term,TF,IDF,TF-IDF
0,8,ripe,0.047619,1.252763,0.059655
1,17,fruity,0.047619,1.94591,0.092662
2,27,wine,0.047619,0.847298,0.040348
3,40,smooth,0.047619,1.94591,0.092662
4,53,still,0.047619,1.94591,0.092662
5,59,structured,0.047619,1.94591,0.092662
6,71,firm,0.047619,1.94591,0.092662
7,76,tannins,0.047619,1.252763,0.059655
8,88,filled,0.047619,1.94591,0.092662
9,104,juicy,0.047619,1.94591,0.092662


### Document 3

**ORIGINAL:** _Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through, with crisp acidity underscoring the flavors. The wine was all stainless-steel fermented._

**PRE-PROCESSED:** _tart snappy flavors lime flesh rind dominate green pineapple pokes crisp acidity underscoring flavors wine fermented_

Unnamed: 0,Index,Term,TF,IDF,TF-IDF
0,0,tart,0.0625,1.252763,0.078298
1,9,snappy,0.0625,1.94591,0.121619
2,21,flavors,0.125,1.252763,0.156595
3,32,lime,0.0625,1.94591,0.121619
4,37,flesh,0.0625,1.94591,0.121619
5,47,rind,0.0625,1.252763,0.078298
6,52,dominate,0.0625,1.94591,0.121619
7,67,green,0.0625,1.252763,0.078298
8,73,pineapple,0.0625,1.252763,0.078298
9,83,pokes,0.0625,1.94591,0.121619


### Document 4

**ORIGINAL:** _Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opulent, with notes of honey-drizzled guava and mango giving way to a slightly astringent, semidry finish._

**PRE-PROCESSED:** _pineapple rind lemon pith orange blossom start aromas palate bit opulent notes guava mango giving way slightly astringent semidry finish_

Unnamed: 0,Index,Term,TF,IDF,TF-IDF
0,0,pineapple,0.05,1.252763,0.062638
1,10,rind,0.05,1.252763,0.062638
2,16,lemon,0.05,1.94591,0.097296
3,22,pith,0.05,1.94591,0.097296
4,31,orange,0.05,1.94591,0.097296
5,38,blossom,0.05,1.94591,0.097296
6,46,start,0.05,1.94591,0.097296
7,60,aromas,0.05,0.559616,0.027981
8,72,palate,0.05,0.847298,0.042365
9,84,bit,0.05,1.94591,0.097296


### Document 5

**ORIGINAL:** _Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rustic, earthy, herbal characteristics. Nonetheless, if you think of it as a pleasantly unfussy country wine, it's a good companion to a hearty winter stew._

**PRE-PROCESSED:** _much like regular bottling 2012 comes across rather rough tannic rustic earthy herbal characteristics nonetheless think pleasantly unfussy country wine good companion hearty winter stew_

Unnamed: 0,Index,Term,TF,IDF,TF-IDF
0,0,much,0.04,1.94591,0.077836
1,5,like,0.04,1.94591,0.077836
2,14,regular,0.04,1.94591,0.077836
3,22,bottling,0.04,1.94591,0.077836
4,36,2012,0.04,1.94591,0.077836
5,47,comes,0.04,1.94591,0.077836
6,53,across,0.04,1.94591,0.077836
7,63,rather,0.04,1.94591,0.077836
8,70,rough,0.04,1.94591,0.077836
9,80,tannic,0.04,1.94591,0.077836


### Document 6

**ORIGINAL:** _Blackberry and raspberry aromas show a typical Navarran whiff of green herbs and, in this case, horseradish. In the mouth, this is fairly full bodied, with tomatoes acidity. Spicy, herbal flavors complement dark plum fruit, while the finish is fresh but grabby._

**PRE-PROCESSED:** _blackberry raspberry aromas show typical navarran whiff green herbs case horseradish mouth fairly full bodied tomatoes acidity spicy herbal flavors complement dark plum fruit finish fresh grabby_

Unnamed: 0,Index,Term,TF,IDF,TF-IDF
0,0,blackberry,0.037037,1.94591,0.072071
1,15,raspberry,0.037037,1.94591,0.072071
2,25,aromas,0.037037,0.559616,0.020727
3,32,show,0.037037,1.94591,0.072071
4,39,typical,0.037037,1.94591,0.072071
5,47,navarran,0.037037,1.94591,0.072071
6,56,whiff,0.037037,1.94591,0.072071
7,65,green,0.037037,1.252763,0.046399
8,71,herbs,0.037037,1.94591,0.072071
9,90,case,0.037037,1.94591,0.072071


### Document 7

**ORIGINAL:** _Here's a bright, informal red that opens with aromas of candied berry, white pepper and savory herb that carry over to the the palate. It's balanced with fresh acidity and soft tannins._

**PRE-PROCESSED:** _bright informal red opens aromas candied berry white pepper savory herb carry palate balanced fresh acidity soft tannins_

Unnamed: 0,Index,Term,TF,IDF,TF-IDF
0,9,bright,0.055556,1.94591,0.108106
1,17,informal,0.055556,1.94591,0.108106
2,26,red,0.055556,1.252763,0.069598
3,35,opens,0.055556,1.94591,0.108106
4,46,aromas,0.055556,0.559616,0.03109
5,56,candied,0.055556,1.94591,0.108106
6,64,berry,0.055556,0.847298,0.047072
7,71,white,0.055556,1.94591,0.108106
8,77,pepper,0.055556,1.94591,0.108106
9,88,savory,0.055556,1.94591,0.108106






## Cosine Similarity Using KNN

**ORIGINAL:** _Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity._

**PRE-PROCESSED:** _aromas include tropical fruit broom brimstone dried herb palate overly expressive offering unripened apple citrus dried sage alongside brisk acidity_

Unnamed: 0,Documents,Similarity
0,"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressi...",1.0
1,"Here's a bright, informal red that opens with aromas of candied berry, white pepper and savory h...",0.110201
2,"Blackberry and raspberry aromas show a typical Navarran whiff of green herbs and, in this case, ...",0.063025
3,"Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opu...",0.047129
4,"Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through...",0.017224
5,"This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled o...",0.015163
6,"Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rus...",0.0





**ORIGINAL:** _This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red berry fruits and freshened with acidity. It's already drinkable, although it will certainly be better from 2016._

**PRE-PROCESSED:** _ripe fruity wine smooth still structured firm tannins filled juicy red berry fruits freshened acidity already drinkable although certainly better 2016_

Unnamed: 0,Documents,Similarity
0,"This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled o...",1.0
1,"Here's a bright, informal red that opens with aromas of candied berry, white pepper and savory h...",0.142019
2,"Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through...",0.047855
3,"Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rus...",0.023567
4,"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressi...",0.015163
5,"Blackberry and raspberry aromas show a typical Navarran whiff of green herbs and, in this case, ...",0.013742
6,"Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opu...",0.0





**ORIGINAL:** _Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through, with crisp acidity underscoring the flavors. The wine was all stainless-steel fermented._

**PRE-PROCESSED:** _tart snappy flavors lime flesh rind dominate green pineapple pokes crisp acidity underscoring flavors wine fermented_

Unnamed: 0,Documents,Similarity
0,"Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through...",1.0
1,"Blackberry and raspberry aromas show a typical Navarran whiff of green herbs and, in this case, ...",0.126423
2,"Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opu...",0.085075
3,"This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled o...",0.047855
4,"Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rus...",0.026771
5,"Here's a bright, informal red that opens with aromas of candied berry, white pepper and savory h...",0.019919
6,"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressi...",0.017224





**ORIGINAL:** _Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opulent, with notes of honey-drizzled guava and mango giving way to a slightly astringent, semidry finish._

**PRE-PROCESSED:** _pineapple rind lemon pith orange blossom start aromas palate bit opulent notes guava mango giving way slightly astringent semidry finish_

Unnamed: 0,Documents,Similarity
0,"Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opu...",1.0
1,"Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through...",0.085075
2,"Here's a bright, informal red that opens with aromas of candied berry, white pepper and savory h...",0.054503
3,"Blackberry and raspberry aromas show a typical Navarran whiff of green herbs and, in this case, ...",0.051691
4,"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressi...",0.047129
5,"This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled o...",0.0
6,"Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rus...",0.0





**ORIGINAL:** _Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rustic, earthy, herbal characteristics. Nonetheless, if you think of it as a pleasantly unfussy country wine, it's a good companion to a hearty winter stew._

**PRE-PROCESSED:** _much like regular bottling 2012 comes across rather rough tannic rustic earthy herbal characteristics nonetheless think pleasantly unfussy country wine good companion hearty winter stew_

Unnamed: 0,Documents,Similarity
0,"Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rus...",1.0
1,"Blackberry and raspberry aromas show a typical Navarran whiff of green herbs and, in this case, ...",0.028713
2,"Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through...",0.026771
3,"This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled o...",0.023567
4,"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressi...",0.0
5,"Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opu...",0.0
6,"Here's a bright, informal red that opens with aromas of candied berry, white pepper and savory h...",0.0





**ORIGINAL:** _Blackberry and raspberry aromas show a typical Navarran whiff of green herbs and, in this case, horseradish. In the mouth, this is fairly full bodied, with tomatoes acidity. Spicy, herbal flavors complement dark plum fruit, while the finish is fresh but grabby._

**PRE-PROCESSED:** _blackberry raspberry aromas show typical navarran whiff green herbs case horseradish mouth fairly full bodied tomatoes acidity spicy herbal flavors complement dark plum fruit finish fresh grabby_

Unnamed: 0,Documents,Similarity
0,"Blackberry and raspberry aromas show a typical Navarran whiff of green herbs and, in this case, ...",1.0
1,"Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through...",0.126423
2,"Here's a bright, informal red that opens with aromas of candied berry, white pepper and savory h...",0.072887
3,"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressi...",0.063025
4,"Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opu...",0.051691
5,"Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rus...",0.028713
6,"This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled o...",0.013742





**ORIGINAL:** _Here's a bright, informal red that opens with aromas of candied berry, white pepper and savory herb that carry over to the the palate. It's balanced with fresh acidity and soft tannins._

**PRE-PROCESSED:** _bright informal red opens aromas candied berry white pepper savory herb carry palate balanced fresh acidity soft tannins_

Unnamed: 0,Documents,Similarity
0,"Here's a bright, informal red that opens with aromas of candied berry, white pepper and savory h...",1.0
1,"This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled o...",0.142019
2,"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressi...",0.110201
3,"Blackberry and raspberry aromas show a typical Navarran whiff of green herbs and, in this case, ...",0.072887
4,"Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opu...",0.054503
5,"Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through...",0.019919
6,"Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rus...",0.0





En la salida anterior mostramos para cada documento, una tabla que marca la similitud del documento objetivo con todos los documentos existentes. Los resultados se presentan de mayor a menor similitud, el primero es siempre `1` porque compara el documento objetivo con el propio documento.

## Utilización de Stemming

En el código anterior desactivamos el `stem` ahora lo probaremos con el `stem` activado:
 - Para utilizar el stemming podemos pasar la flag `stem = True` a la función de limpieza

In [6]:
original_docs = load_documents(FILENAME)
preprocessed_docs = docs_sanitize(original_docs, DOC_LANGUAGE, stem=True)

generate_similarity_comparation(original_docs, preprocessed_docs)

## Cosine Similarity Using KNN

**ORIGINAL:** _Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity._

**PRE-PROCESSED:** _aroma includ tropic fruit broom brimston dri herb palat overli express offer unripen appl citru dri sage alongsid brisk acid_

Unnamed: 0,Documents,Similarity
0,"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressi...",1.0
1,"Here's a bright, informal red that opens with aromas of candied berry, white pepper and savory h...",0.10082
2,"Blackberry and raspberry aromas show a typical Navarran whiff of green herbs and, in this case, ...",0.079655
3,"Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opu...",0.047583
4,"This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled o...",0.042338
5,"Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through...",0.01739
6,"Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rus...",0.0





**ORIGINAL:** _This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red berry fruits and freshened with acidity. It's already drinkable, although it will certainly be better from 2016._

**PRE-PROCESSED:** _ripe fruiti wine smooth still structur firm tannin fill juici red berri fruit freshen acid alreadi drinkabl although certainli better 2016_

Unnamed: 0,Documents,Similarity
0,"This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled o...",1.0
1,"Here's a bright, informal red that opens with aromas of candied berry, white pepper and savory h...",0.144849
2,"Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through...",0.048497
3,"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressi...",0.042338
4,"Blackberry and raspberry aromas show a typical Navarran whiff of green herbs and, in this case, ...",0.03856
5,"Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rus...",0.023884
6,"Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opu...",0.0





**ORIGINAL:** _Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through, with crisp acidity underscoring the flavors. The wine was all stainless-steel fermented._

**PRE-PROCESSED:** _tart snappi flavor lime flesh rind domin green pineappl poke crisp acid underscor flavor wine ferment_

Unnamed: 0,Documents,Similarity
0,"Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through...",1.0
1,"Blackberry and raspberry aromas show a typical Navarran whiff of green herbs and, in this case, ...",0.128274
2,"Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opu...",0.085075
3,"This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled o...",0.048497
4,"Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rus...",0.026771
5,"Here's a bright, informal red that opens with aromas of candied berry, white pepper and savory h...",0.020046
6,"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressi...",0.01739





**ORIGINAL:** _Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opulent, with notes of honey-drizzled guava and mango giving way to a slightly astringent, semidry finish._

**PRE-PROCESSED:** _pineappl rind lemon pith orang blossom start aroma palat bit opul note guava mango give way slightli astring semidri finish_

Unnamed: 0,Documents,Similarity
0,"Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opu...",1.0
1,"Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through...",0.085075
2,"Here's a bright, informal red that opens with aromas of candied berry, white pepper and savory h...",0.054852
3,"Blackberry and raspberry aromas show a typical Navarran whiff of green herbs and, in this case, ...",0.052448
4,"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressi...",0.047583
5,"This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled o...",0.0
6,"Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rus...",0.0





**ORIGINAL:** _Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rustic, earthy, herbal characteristics. Nonetheless, if you think of it as a pleasantly unfussy country wine, it's a good companion to a hearty winter stew._

**PRE-PROCESSED:** _much like regular bottl 2012 come across rather rough tannic rustic earthi herbal characterist nonetheless think pleasantli unfussi countri wine good companion hearti winter stew_

Unnamed: 0,Documents,Similarity
0,"Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rus...",1.0
1,"Blackberry and raspberry aromas show a typical Navarran whiff of green herbs and, in this case, ...",0.029133
2,"Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through...",0.026771
3,"This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled o...",0.023884
4,"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressi...",0.0
5,"Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opu...",0.0
6,"Here's a bright, informal red that opens with aromas of candied berry, white pepper and savory h...",0.0





**ORIGINAL:** _Blackberry and raspberry aromas show a typical Navarran whiff of green herbs and, in this case, horseradish. In the mouth, this is fairly full bodied, with tomatoes acidity. Spicy, herbal flavors complement dark plum fruit, while the finish is fresh but grabby._

**PRE-PROCESSED:** _blackberri raspberri aroma show typic navarran whiff green herb case horseradish mouth fairli full bodi tomato acid spici herbal flavor complement dark plum fruit finish fresh grabbi_

Unnamed: 0,Documents,Similarity
0,"Blackberry and raspberry aromas show a typical Navarran whiff of green herbs and, in this case, ...",1.0
1,"Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through...",0.128274
2,"Here's a bright, informal red that opens with aromas of candied berry, white pepper and savory h...",0.101985
3,"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressi...",0.079655
4,"Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opu...",0.052448
5,"This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled o...",0.03856
6,"Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rus...",0.029133





**ORIGINAL:** _Here's a bright, informal red that opens with aromas of candied berry, white pepper and savory herb that carry over to the the palate. It's balanced with fresh acidity and soft tannins._

**PRE-PROCESSED:** _bright inform red open aroma candi berri white pepper savori herb carri palat balanc fresh acid soft tannin_

Unnamed: 0,Documents,Similarity
0,"Here's a bright, informal red that opens with aromas of candied berry, white pepper and savory h...",1.0
1,"This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled o...",0.144849
2,"Blackberry and raspberry aromas show a typical Navarran whiff of green herbs and, in this case, ...",0.101985
3,"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressi...",0.10082
4,"Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opu...",0.054852
5,"Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through...",0.020046
6,"Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rus...",0.0





Podemos observar una pequeña variación en la similitud entre documentos cuando se utiliza la técnica del `stemming`, en este conjunto de documentos no es muy relevante pero a continuación mostraré un ejemplo de cómo puede ser útil de forma general para cualquier tipo de documento.

## Ventajas de utilizar Stemming

En este ejemplo vamos a plurificar el primer documento para que algunas palabras difieran del documento original, este ejemplo muestra lo útil que puede ser la técnica de `stemming`.

Creamos dos documentos saneados, uno usando `stem` y otro no.
Pasamos una flag `stem = True` al primer documento

```python
san_doc_target_stem = ' '.join(docs_sanitize([target_document], language='english', stem=True))
...
san_doc_target_no_stem = ' '.join(docs_sanitize([target_document], language='english', stem=False))
```

In [7]:
original_docs = load_documents(FILENAME)

# plural of the first document
target_document = "Aromas include tropical fruits , brooms , brimstones and dried herbs . The palates isn't overly expressive , offering unripened apples , citrus and dried sages alongside brisk acidities ."

def print_result(distances, indexes, original_doc, san_document):
    results = []
    for i, index in enumerate(indexes[0]):
        resultado = {
            'Documents': original_docs[index],
            'Similarity': 1 - distances[0][i]
        }
        results.append(resultado)

    display(Markdown(f"**ORIGINAL:** _{original_doc}_"))
    display(Markdown(f"**PRE-PROCESSED:** _{san_document}_"))
    display(pd.DataFrame(results))
    print()


# with stem
display(Markdown(f"## With Stemming"))
preprocessed_docs = docs_sanitize(original_docs, DOC_LANGUAGE, stem=True)
san_doc_target_stem = ' '.join(docs_sanitize([target_document], language='english', stem=True))
distances, indexes = cos_similarity_knn(preprocessed_docs, san_doc_target_stem)
print_result(distances, indexes, original_doc=target_document, san_document=san_doc_target_stem)

# without stem
display(Markdown(f"## Without Stemming"))
preprocessed_docs = docs_sanitize(original_docs, DOC_LANGUAGE, stem=False)
san_doc_target_no_stem = ' '.join(docs_sanitize([target_document], language='english', stem=False))
distances, indexes = cos_similarity_knn(preprocessed_docs, san_doc_target_no_stem)
print_result(distances, indexes, original_doc=target_document, san_document=san_doc_target_no_stem)


## With Stemming

**ORIGINAL:** _Aromas include tropical fruits , brooms , brimstones and dried herbs . The palates isn't overly expressive , offering unripened apples , citrus and dried sages alongside brisk acidities ._

**PRE-PROCESSED:** _aroma includ tropic fruit broom brimston dri herb palat overli express offer unripen appl citru dri sage alongsid brisk acid_

Unnamed: 0,Documents,Similarity
0,"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressi...",1.0
1,"Here's a bright, informal red that opens with aromas of candied berry, white pepper and savory h...",0.10082
2,"Blackberry and raspberry aromas show a typical Navarran whiff of green herbs and, in this case, ...",0.079655
3,"Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opu...",0.047583
4,"This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled o...",0.042338
5,"Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through...",0.01739
6,"Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rus...",0.0





## Without Stemming

**ORIGINAL:** _Aromas include tropical fruits , brooms , brimstones and dried herbs . The palates isn't overly expressive , offering unripened apples , citrus and dried sages alongside brisk acidities ._

**PRE-PROCESSED:** _aromas include tropical fruits brooms brimstones dried herbs palates overly expressive offering unripened apples citrus dried sages alongside brisk acidities_

Unnamed: 0,Documents,Similarity
0,"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressi...",0.771562
1,"Blackberry and raspberry aromas show a typical Navarran whiff of green herbs and, in this case, ...",0.072096
2,"This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled o...",0.058714
3,"Here's a bright, informal red that opens with aromas of candied berry, white pepper and savory h...",0.025308
4,"Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opu...",0.022839
5,"Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through...",0.0
6,"Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rus...",0.0





En este escenario en el que el primer documento estaba pluralizado, podemos observar que cuando utilizamos `stemming` las palabras plurales se reducen a su palabra raíz, lo que nos permite obtener mejores resultados.

En la primera tabla en la que utilizamos el `stemming` los resultados fueron mejores que en la segunda tabla en la que no utilizamos la técnica. Esto se debe a las diferencias entre algunas palabras que tienen el mismo significado, por ejemplo `fruit` y `fruits`.

## Conclusión

En este proyecto, se implementó un sistema de recomendación basado en la similitud del coseno utilizando el modelo KNN. Para mejorar la calidad de las recomendaciones, se emplearon técnicas como la eliminación de palabras de parada (stop words) y el stemming. El stemming demostró ser una técnica altamente beneficiosa para obtener resultados más precisos.

El stemming consiste en reducir las palabras a su raíz o forma básica, lo cual ayuda a agrupar términos similares y reducir la variabilidad del texto. Al aplicar esta técnica, podremos obtener una mejora significativa en la calidad de las recomendaciones al eliminar redundancias y ruido en el conjunto de datos.

Es importante destacar que, si bien el algoritmo utilizado es crucial para el éxito del sistema de recomendación, el preprocesamiento de los datos es igualmente importante. El uso adecuado de técnicas como la eliminación de palabras de parada y el stemming contribuyó a una mayor precisión en las recomendaciones generadas.