In [1]:
%pip install numpy scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [2]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.metrics import f1_score

# 20newsgroups por ser un dataset clásico de NLP ya viene incluido y formateado
# en sklearn
from sklearn.datasets import fetch_20newsgroups
import numpy as np
import random as rnd
import pandas as pd
from IPython.display import display, HTML


### Consigna del desafío 1

**1**. Vectorizar documentos. Tomar 5 documentos al azar y medir similaridad con el resto de los documentos.
Estudiar los 5 documentos más similares de cada uno analizar si tiene sentido
la similaridad según el contenido del texto y la etiqueta de clasificación.

**2**. Construir un modelo de clasificación por prototipos (tipo zero-shot). Clasificar los documentos de un conjunto de test comparando cada uno con todos los de entrenamiento y asignar la clase al label del documento del conjunto de entrenamiento con mayor similaridad.

**3**. Entrenar modelos de clasificación Naïve Bayes para maximizar el desempeño de clasificación
(f1-score macro) en el conjunto de datos de test. Considerar cambiar parámteros
de instanciación del vectorizador y los modelos y probar modelos de Naïve Bayes Multinomial
y ComplementNB.

**4**. Transponer la matriz documento-término. De esa manera se obtiene una matriz
término-documento que puede ser interpretada como una colección de vectorización de palabras.
Estudiar ahora similaridad entre palabras tomando 5 palabras y estudiando sus 5 más similares. **La elección de palabras no debe ser al azar para evitar la aparición de términos poco interpretables, elegirlas "manualmente"**.


### Punto 1

#### Dataset 20 newsgroups, carga de datos

In [3]:
# cargamos los datos (ya separados de forma predeterminada en train y test)
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

#### Vectorización

In [4]:
# Vectorizo el dataset de train
tfidfvect = TfidfVectorizer()
X_train = tfidfvect.fit_transform(newsgroups_train.data)
y_train = newsgroups_train.target
# Vectorizo el dataset de test 
X_test = tfidfvect.transform(newsgroups_test.data)
y_test = newsgroups_test.target

#### Similitud

In [5]:
rnd.seed(92)
# Tomo 5 muestras al azar, sobre los documentos vectorizados para el set de test.
doc_idxs = rnd.sample(range(0, X_test.shape[0]), 5)

for idx in doc_idxs:
    # Por cada muestra, busco los documentos similares
    cossim = cosine_similarity(X_test[idx], X_test)[0]
    # Ordeno los resultados de mayor a menor y me quedo con los 5 indices mas similares, ingorando el primero, que es el mismo documento
    mostsim = np.argsort(cossim)[::-1][1:6]
    list_sims = []
    for sim in mostsim:
        list_sims.append({"index": sim, "similitud": cossim[sim], "target": newsgroups_test.target_names[y_test[sim]], "texto": newsgroups_test.data[sim]})
    df_sims = pd.DataFrame(list_sims)
    print(f"Similitudes para documento: {idx}\nlabel: {newsgroups_test.target_names[y_test[idx]]}\ntexto:\n{newsgroups_test.data[idx]}")
    html_table = df_sims.to_html().replace("\\n", "<br>").replace("\\t", "&#09;")
    display(HTML(html_table))
    print("*****************************************")

# print (docs)

Similitudes para documento: 3449
label: talk.politics.guns
texto:

Funny, the medical examiner today stated that there was no
evidence ONE WAY or ANOTHER that there were bullet wounds --
not a single autopsy has been performed, so all reports are
deemed speculative.  INCLUDING reports that there were NO
bullet wounds.


Before long, I think all the kneejerk conspiracy theorists
are going to start getting pretty pissed off at how easily
they mislead themselves.  Also, pretty disappointed at
being ignored by the coutnry.



Unnamed: 0,index,similitud,target,texto
0,1121,0.461568,talk.politics.guns,"Apparently needing to clarify his comments from Thursday, Dr. Nizam Plawaby (spelling?), the Medical Examiner for Tarrant County, Texas, who has authority in the Waco deaths, stated that since no autopsies had been performed, there is no evidence for bullet wounds, or evidence against bullet wounds. Janet Reno also stated that she had never been told of bullet wounds by anyone in the Justice Department."
1,832,0.317879,talk.religion.misc,".  It should be remembered that all of the first reports came from the FBI, and that independent observers, i.e. the press, were not allowed to get close and see things for themselves. Official communiques tend to be self-serving for the agencies that issue them. People in general tend to believe first reports, as these get the most and the biggest headlines. Corrections are often overlooked. An example is the FBI report that several of the bodies found in the rubble had bullet wounds. The local coroner, who is independent of the FBI, has so far found no bullet wounds!"
2,4763,0.245326,talk.politics.guns,"In the interest of accuracy (seems a liitle late to start that, I know) the medical examiner has *not* contradicted the FBI. The FBI said they found some folks who had been shot in the head, and the medical examiner said ""we have not seen evidence of this"". At the time the medical examiner said that, they were dealing with charred bodies in the compound - this sounds like typical medical examiner not releasing details until a thorough investigation. The medical examiner saying he hasn't seen something is *not* the same thing as saying that it isn't there."
3,613,0.239701,talk.politics.misc,"Yesterday, the FBI was saying that at least three of the bodies had gunshot wounds, indicating that they were shot trying to escape the fire. Today's paper quotes the medical examiner as saying that there is no evidence of gunshot wounds in any of the recovered bodies. At the beginning of this siege, it was reported that while Koresh had a class III (machine gun) license, today's paper quotes the government as saying, no, they didn't have a license. Today's paper reports that a number of the bodies were found with shoulder weapons next to them, as if they had been using them while dying -- which doesn't sound like the sort of action I would expect from a suicide. Our government lies, as it tries to cover over its incompetence and negligence. Why should I believe the FBI's claims about anything else, when we can see that they are LYING? This system of government is beyond reform."
4,2695,0.210677,talk.politics.guns,"So far, the medical examiner (according to the news) has found NO EVIDENCE of gunshot wounds in bodies so far examined. If this continues to be the case, it will sort of shoot holes (pun intended) in the FBI story, wouldn't it? And cartridges going off outside a firearm do not launch a bullet like they do when fired from a gun. The bullet hardly moves, it is the brass casing that goes flying, and then with less than lethal force. It will hurt, yes, but not KILL you - I doubt if it wil penetrate a coat, for example. How about an INDEPENDENT investigation, with full subpoena powers, and powers to prosecute on felony charges, to investigate for any possible illegal/criminal activity on the part of both the BATF and FBI? I cannot see any reason why not - to use the phrase they like to use so often, ""if they have nothing to hide..."" they should welcome it, and vigorously support it. Note that an internal investigation by the Dept of Justice is NOT an independent investigation..."


*****************************************
Similitudes para documento: 4228
label: misc.forsale
texto:
Hi netters,

Quantum LPS 240AT harddisk forsale.

3.5" frame, 1/3 height.
IDE format, master or slave
723 cyl 13 hd 51 s/t = 234.9 real megs
Access time of 16 ms.
256K cache on the drive

Asking $300. 


Unnamed: 0,index,similitud,target,texto
0,4004,0.227706,comp.sys.ibm.pc.hardware,"Hi,  I have bought a new harddisk and want to use it with my old  TEAC SD3105 , 100Mb harddisk. Unfortunataly I do not have any documentation with this harddisk. Could someone please tell me  how I should set the jumpers for master or slave ?  Thanks in advance,  Robert Tenback.  <rhtenbac@cs.ruu.nl>"
1,2991,0.195262,comp.sys.ibm.pc.hardware,"I'm having trouble with installing a second IDE drive on a Promise IDE caching controller. The first drive is a conner 3204 and works fine. The second drive is a conner 30174, it is currently unjumpered to be the slave drive. The problem is the slave drive is recognized but is reported back as having no free space. Disabling cache has made no effect. What else should I check for?"
2,6100,0.184595,sci.electronics,"Well, if you're willing to spend a little money, you could buy one of those IDE caching controllers (assuming you have an IDE of course) and put the 256K SIMMs on them. Hardware cache!"
3,4578,0.18345,misc.forsale,"I am selling a Western Digital 212 meg IDE HD, the Caviar 2200 model. The access time is <15 ms, and it has a built in cache. It is BRAND NEW, still in the original static bag. Asking $275, obo."
4,4100,0.176129,comp.sys.ibm.pc.hardware,": : Hello : : I have problems combining two IDE hard disks : (Seagate ST3283A and Quantum LPS105A). As single hard disk both : are working fine, but connecting them together to my : controller doesn't work. : : My questions are: : : - Has anybody out there ever been succesful using such hard disks : together and if so what jumper settings and BIOS settting did he/she : use? : : - Is it possible that my controller is the reason for my troubles ? : The only thing I know about it is that it is an : IDE-harddisk-controller. How many harddisks can such a controller : control? In my case only one ? : : : Thanks in advance : : Volker : IDE drives have jumpers on them to indicate if it is a master or a slave. If it is a master, then a second jumper indicates if a slave is present. These must be set correctly according to each drive's manufacturers spec- ification. The CMOS setup is almost positively NOT the problem. It is probably not the controller - IDE controllers all support exactly two drives maximum. Check those jumpers."


*****************************************
Similitudes para documento: 4627
label: rec.sport.baseball
texto:

Bill James is, however, very closely tied to STATS.


Unnamed: 0,index,similitud,target,texto
0,1968,0.381607,rec.sport.baseball,"[Some discussion about whether Elias is money grubbing deleted] Some thoughts and facts, 1.) Bill James is a partial owner of STATS, inc. However he has almost nothing to do with the day-to-day operations of the company, although he does have significant input into the design of the books that bear his name. (The handbook, but not the scoreboard). To the best of my knowledge, the only things that Bill actually writes for STATS are the predictions section of the handbook, and the Bill James Fantasy Baseball rulebook. 2.) The debate over Elias goes way back. Bill James' early stuff was hampered by the fact that Elias would not give access to their stats at any price. Project Scoresheet, and later, STATS were founded to fill this void. You can call STATS, and ask them for a report on just about anything in their database, and they will provide it -- for a price, of course. Or you could just log into their online system and look at the data yourself. Having attempted to pry numbers from Elias in the past (football, not baseball), they just don't do that. In STATS eyes, the high ground comes from making the information available at all. 3.) That being said, I'm pretty dissapointed by Bill's book this year, too. I am given to understant that it was mostly a response to the publishers desire to have the book come out sooner than April. Hope this makes things just a little bit clearer. (Bias alert. I am a former part-time employee of STATS.)"
1,1231,0.31891,rec.sport.baseball,"Uh, Bill James doesn't sell statistics. He sells books with statistics, but he is not in the business of providing stats like Elias, STATS, Howe, Baseball workshop etc. are."
2,4968,0.294406,rec.sport.baseball,"funny, it seems to me that the stats major league and minor league handbooks, which are nothing BUT collections of statistics, are authored by ""bill james and stats inc. (and howe, for the minor league handbook)"". and i am not sure how the 1993 bill james player ratings book qualifies as a ""book with statistics"", while the elias analyst is a ""statistics book"". the analyst contains more stats, sure, but it also contains more dialogue. finally, the point was not about the word ""statistics"". it was about ""money-grubbing"". i don't see how anyone who has looked at the bill james player ratings book cannot consider him money-grubbing."
3,6174,0.256514,rec.sport.baseball,"Not totally true. For the past year or two, the AP has been getting box scores from STATS, Inc. The AP representative in the press box is actually a STATS reporter ($25 dollars a game, but free parking. And anybody can do it.) The box is downloaded to STATS in Chicago, some quick error checking is done, and then STATS sends it to the AP. I'm not sure where the appreveiations come in hear. I don't think it is at STATS's. It may just be a space correction by the AP sports editor that day. While I'm mentioning STATS reporters, they are always looking for new people. Especially if you live in Cleveland or Pittsburgh, you're road to getting into the press box may be real short. For more info, call STATS (708) 676-3322, and ask about the reporter network. It's a fun way to get paid for watching baseball games. End of public service announcement."
4,439,0.208497,rec.sport.baseball,I forgot to mention that the stats are for games through 4/20.


*****************************************
Similitudes para documento: 4506
label: talk.religion.misc
texto:
   Surely you are not equating David Koresh with Christianity? The two are
   not comparable.


Unnamed: 0,index,similitud,target,texto
0,6765,0.418703,talk.religion.misc,": : > Surely you are not equating David Koresh with Christianity? The two are : > not comparable. : : This is always an option: when the sect is causing harm, re-label : the cult to something else. : : Cheers, : Kent Good point. I would not doubt that DK could have spouted verse and debated with best. According to reports his extensive Bible knowledge was one way he sucked in the fools (followers?). Quote bible all you want. I too judge what you say be what you do and even more by if it makes sense. Sense, common that is. Doesn't seem so common after all!"
1,5363,0.275627,talk.politics.misc,dead? I saw David Koresh at a local 7-11......
2,2502,0.259255,alt.atheism,"On one of the morning shows (I think is was the Today Show), David Koresh's lawyer was interviewed. During that interview he flipped through some letters that David Koresh wrote. On one of letters was written in Hebrew (near the bottom of the page):  koresh adonai"
3,1470,0.2434,sci.crypt,They are and they have. David
4,4784,0.234732,alt.atheism,"I suppose for the same reason that you do not believe in all the gods. Why should any be any different? I use the same arguments to dismiss Koresh as I do god. Tell me, then, why do you not believe that Koresh is the son of god? By logic it is equally possible that Koresh is Jesus reborn."


*****************************************
Similitudes para documento: 4572
label: soc.religion.christian
texto:
[Someone quoted the following.  I've removed the name because it's not
clear which name goes with which level of quote.  --clh]



On the basis of these examples I would say that Joe Moore was only wrong
in claiming Augustine as a prime mover of the sin=sex view.  These quotes
clearly equate sexuality with defilement and incontinance, even within
the marriage relationship (else they would not apply to Mary after her marriage
to Joseph).

So Joe's assignment of the reasoning behind the concept of the perpetual
virginity of Mary does seem to be supported by these quotes.



Unnamed: 0,index,similitud,target,texto
0,3768,0.299574,soc.religion.christian,"[referring to Mary] I have quite a problem with the idea that Mary never committed a sin. Was Mary fully human? If it is possible for God to miraculously make a person free of original sin, and free of committing sin their whole life, then what is the purpose of the Incarnation of Jesus? Why can't God just repeat the miracle done for Mary to make all the rest of us sinless, without the need for repentance and salvation and all that? I don't particularly object to the idea of the assumption, or the perpetual virginity (both of which I regard as Catholic dogma about which I will agree to disagree with my Catholic brothers and sisters in Christ), and I even believe in the virgin birth of Jesus, but this concept of Mary's sinlessness seems to me to be at odds with the rest of Christian doctrine as I understand it."
1,3672,0.275061,soc.religion.christian,": Consequently, : this verse indicates that she was without sin. Also, as was observed at : the very top of this post, Mary had to be free from sin in order to be the : mother of Jesus, who was definitely without sin. If the mother of Jesus had to be without sin in order to give birth to God, then why didn't Mary's mother have to be without sin in order to give birth to the perfect vessel for Jesus? For that matter, why didn't Mary's grandmother have to be without sin either? Seems to me that with all the original sin flowing through each person, the need for the last one (Mary) to have none puts God in a box, where we say that He couldn't have incarnated Himself through a normal human being. My God is an all powerful God, Who can do whatever suits His purpose. This includes creating a solar system and planet earth with the appearance of great age; providing a path through the Red Sea for the children of Israel that does not depend on the existence of a ridge of high ground and a wind blowing at the right speed and direction; and the birth of Himself from a normal sinful person without being tainted by her original sin. I see far too much focus on the ""objects"" of religion and not nearly enough on the personal relationship that is available to all believers with the Author of our existence, without the necessity of having this relationship channeled through conduits to God in the form of Mary, Apostles and a Pope. : Note that the idea of Mary being conceived without Original Sin, i.e. the : Immaculate Conception, is distinct from the idea of Mary not having sinned : during her lifetime, which is a separate doctrine and, I believe, also : held by the Catholic Church. If Mary was born without original sin, and didn't sin during her lifetime, how is she any different from Jesus? This means the world has had two perfect humans: one died to take away the sins of the world; the other gave birth to Him? I would certainly want to see some scriptural support for this before I would start praying to anyone other than God. Everything I have ever read from the bible teaches me that Jesus was and is the only sinless Lamb of God, not His mother, grandmother........ : Hope this is useful to you. Very useful in helping me understand some of the RC beliefs. Thank you."
2,1716,0.251366,soc.religion.christian,"speaking of the Immaculate Conception of the Blessed Virgin: Yes. For examples of this in the writings of the early fathers, consider:  You alone and your Mother  are more beautiful than any others;  For there is no blemish in you,  nor any stains upon your Mother.  Who of my children  can compare in beauty to these?  -- St. Ephrem the Syrian, Nisibene Hymns, 27:8, around  A.D. 370  Lift me up not from Sara but from Mary, a Virgin not only undefiled but a Virgin whom grace has made inviolate, free of every stain of sin.  -- St. Ambrose, ""Commentary on Psalm 118"", 22:30, ca. A.D. 388 There are many others. No. We have, for instance:  Was there ever anyone of any breeding who dared to speak the name of  Holy Mary, and being questioned, did not immediately add, ""the Virgin""?  ... And to Holy Mary, Virgin is invariably added, for that Holy Woman  remains undefiled.  -- St. Epiphanus of Salamis, ""Panacea against all heresies"",  between A.D. 374-377.  We surely cannot deny that you were right in correcting the doctrine  about children of Mary ... For the Lord Jesus would not have chosen  to be born of a virgin if He had judged that she would be so incontinent  as to taint the birthplace of the Body of the Lord, home of the Eternal  King, with the seed of human intercourse. Anyone who proposes this is  merely proposing ... that Christ could not be born of a virgin.  -- Pope St. Siricius, Letter to Anysius, Bishop of Thessalonica, A.D. 392 Note that St. Augustine's conversion to Christianity was in A.D. 387. I don't know offhand when his election as bishop of Hippo was, but I'm quite sure it was after 392. The belief in Mary's perpetual virginity originated long before Augustine's time. We hold that it originated with the Apostles. Strictly speaking, however, Mary's perpetual virginity is independent of her Immaculate Conception. Mary could have been Immaculately Conceived and not remained a virgin; she could have remained a virgin and not been Immaculately Conceived. No. It has been held in the Church since ancient times that original sin was transmitted at conception, when a person's life begins. Biology had nothing to do with it. Prayerfully reflecting on the truth of Mary's sinlessness, and the means by which God could have achieved this, the Church arrived at the truth of the Immaculate Conception. Thus, the Immaculate Conception is not a new doctrine, but the logical result of our understanding of two old ones. The celebration of the Feast of the Immaculate Conception itself was given by Pope Sixtus IV (1471-84) and the Feast was made a precept feast of the Church by Pope Clement XI (1700-21). No. First of all, Lourdes is private revelation, and doctrine is not based on private revelation. The most that private revelation can do is enhance and deepen our understanding of existing public revelation, which ended with the death of St. John the Apostle. Second, the ""case for the doctrine"" was irreformably sealed in 1854 with the ex cathedra promulgation of the Bull ""Ineffabilis Deus"" by Pope Pius IX. This meant that the doctrine was formally recognized as a dogma; a dogma, by definition, cannot change and is required to be believed by the faithful. The apparition at Lourdes happened in 1858, four years later. The most that might be claimed is that Lourdes gave the infallible proclamation of 1854 a sort of heavenly stamp of approval, but the Church has never claimed that, nor shall she. In Christ's Peace, Brad Kaiser (bradk@isdgsm.eurpd.csg.mot.com)"
3,1051,0.244472,soc.religion.christian,"whitsebd@nextwork.rose-hulman.edu (Bryan Whitsell) sent in a list of verses which he felt condemn homosexuality. mls@panix.com (Michael Siemon) wrote in response that some of these verses ""are used against us only through incredibly perverse interpretations"" and that others ""simply do not address the issues."" [remainder of my post deleted] The moderator then made some comments I would like to address: If you are referring to the terms ""effeminate"" and ""homosexuals"" in the above passage, I agree that the accuracy of the translation has been challenged. However, I was simply commenting on the charge that it is an ""incredibly perverse"" interpretation to read this as a condemnation of homosexuality. Such a charge seems to imply that no reasonable person would ever conclude from the verse that Paul intended to condemn homosexuality; however, I think I can see how a reasonable person might very well take this view of the verse. Therefore I do not believe it is ""incredibly perverse"" to read it in this way. Actually, I wasn't thinking of the church at all. After all, a couple doesn't have to be married by a minister. A secular justice of the peace could do the job, and the two people would be married. My point was that it is easy to find a biblical basis for heterosexual marriage, but where in the Bible would one get a Christian marriage between two people of the same sex? And if you do see a biblical basis for same-sex marriages, how willing would gay Christians be to ""save themselves"" for such a marriage and to never have sexual intercourse with anyone outside of that marriage relationship? Please note that I am not trying to imply that gay Christians would not be willing to be so monogamous, I am genuinely interested in hearing opinions on the subject. I have heard comments from gays in the past that lead me to believe they regard promiscuity as one of the main points of being homosexual, yet I tend to doubt that gays who want to be Christian would advocate such a position. So what is the gay view? - Mark"
4,937,0.237859,soc.religion.christian,"Yes, Mary is fully human. However, that does not imply that she was just as subject to sin as we are. Catholic doctrine says that man's nature is good (Gen 1:31), but is damaged by Original Sin (Rom 5:12-16). In that case, being undamaged by Original Sin, Mary is more fully human than any of the rest of us.  You ask why God cannot ""repeat the miracle"" of Mary's preservation from Original Sin. A better way to phrase it would be ""why _did_ He not"" do it that way, but you misunderstand how Mary's salvation was obtained. Like ours, the Blessed Virgin Mary's salvation was obtained through the merits of the Sacrifice of Christ on the Cross. However, as God is not bound by time, which is His creation, God is free to apply His Sacrifice to anyone at any time, even if that person lived before Christ came to Earth, from our time-bound perspective. Therefore, Christ's Death and Resurrection still served a necessary purpose, and were necessary even for Mary's salvation."


*****************************************


#### Conclusiones

##### Documento a comparar 3449 con etiqueta talk.politics.guns  
El documento habla sobre una autopsia aparentemente por un homicidio que involucra armas y tiros.

Comparación con documento 1121, con similitud 0.461568 y etiqueta talk.politics.guns  
Coinciden las etiquetas. El texto tiene alguna similitud, hablan de autopsias, balas, heridas, evidencia, etc.

Comparación con documento 832, con similitud 0.317879 y etiqueta talk.religion.misc  
Las etiquetas son algo distintas, aunque tienen la misma base (talk).  
En cuanto al texto, también habla de reportes, balas. No parece un texto de religión, podría estar etiquetado como talk.politics.guns o algo mas parecido a esto.

Comparación con documento 4763, con similitud 0.245326 y etiqueta talk.politics.guns  
Coinciden las etiquetas. El texto habla sobre disparos, examen médico sobre el cuerpo (aparentemente un homicidio). Parece tener cierta similitud.

Comparación con documento 613, con similitud 0.239701 y etiqueta talk.politics.misc  
Etiquetas similares, pero no iguales, ambas son discusiones sobre politica, una relacionada a armas, otra a contenido general.
El texto también habla sobre cuerpos, disparos, heridas, tiene cierta similitud.

Comparación con documento 2695, con similitud 0.210677 y etiqueta talk.politics.misc  
Coinciden las etiquetas. El texto habla de examen médico sobre un cuerpo, evidencias, disparos, heridas, etc. También tiene cierta similitud.

##### Documento a comparar 4228 con etiqueta misc.forsale  
El documento habla sobre un aviso de venta de disco rígido.

Comparación con documento 4004, con similitud 0.227706 y etiqueta comp.sys.ibm.pc.hardware  
Etiquetas totalmente distintas. El texto habla sobre alguien que compró un disco, tiene cierta similitud ya que el documento comparado es sobre un anuncio de venta de un disco.

Comparación con documento 2991, con similitud 0.195262 y etiqueta comp.sys.ibm.pc.hardware  
Etiquetas totalmente distintas. En cuanto al texto, habla sobre problemas de configuración de un disco, lo cual tiene cierta similitud con el anuncio de venta. Ambos mencionan características del disco, como si funciona como master o slave, cache, formato IDE, etc.

Comparación con documento 6100, con similitud 0.184595 y etiqueta sci.electronics  
Etiquetas totalmente distintas. El texto hace recomendaciones sobre la compra de un disco IDE, con cache, etc, lo cual tiene cierta similitud.

Comparación con documento 4578, con similitud 0.183450 y etiqueta misc.forsale  
Coinciden las etiquetas. Tiene cierta similitud al ser otro anuncio de venta de disco, aunque al no describirlos de forma similar, pierde algo de similitud. A mi parecer, este sería un documento más similar que los anteriores, pero la métrica no dice lo mismo.

Comparación con documento 4100, con similitud 0.176129 y etiqueta comp.sys.ibm.pc.hardware  
Etiquetas totalmente distintas. El texto tiene referencia a un problema con discos en una pc, y la respuesta es una sugerencia sobre configuración de esos discos. No tiene mucha similitud con un aviso de venta en si, pero también hablan principalmente sobre discos rígidos.

##### Documento a comparar 4627 con etiqueta rec.sport.baseball  
El documento habla sobre un jugador de baseball y una empresa llamada STATS.

Comparación con documento 1968, con similitud 0.381607 y etiqueta rec.sport.baseball  
Coinciden las etiquetas. El texto es mucho mas largo, pero parece hablar sobre la misma discusión, el jugador y que es dueño de la empresa STATS, tiene mucha similitud en contenido general.

Comparación con documento 1231, con similitud 0.318910 y etiqueta rec.sport.baseball  
Coinciden las etiquetas. También se habla del mismo jugador, y de la empresa STATS, tiene similitud el contenido.

Comparación con documento 4968, con similitud 0.294406 y etiqueta rec.sport.baseball  
Coinciden las etiquetas. Otro documento donde se habla del jugador y la empresa, tiene similitud.

Comparación con documento 6174, con similitud 0.256514 y etiqueta rec.sport.baseball  
Coinciden las etiquetas. Este documento habla principalmente sobre la empresa STATS, no menciona al jugador. Tiene cierta similitud, pero no tanta como los documentos anteriores.

Comparación con documento 439, con similitud 0.208497 y etiqueta rec.sport.baseball  
Coinciden las etiquetas. Más allá de ser el 5to documento, en este se habla de estadísticas y no de la empresa, a simple vista no parece tener similitud en contenido interpretado.

##### Documento a comparar 4506 con etiqueta talk.religion.misc  
El documento habla sobre la comparación entre una persona y el cristianismo, diciendo que no son comparables.

Comparación con documento 6765, con similitud 0.418703 y etiqueta talk.religion.misc  
Coinciden las etiquetas. El texto tiene una cita directa al documento original, luego un comentario sobre la cita. el hecho de citar el documento original hace que tengan mucha similitud.

Comparación con documento 5363, con similitud 0.275627 y etiqueta talk.politics.misc  
Etiquetas distintas, aunque sobre la misma rama de discusión general. Se habla de la misma persona en otro contexto, la similitud esta dada por hablar de la misma persona.

Comparación con documento 2502, con similitud 0.259255 y etiqueta alt.atheism  
Etiquetas distintas, aunque puede haber alguna similitud al hablar de religión y ateismo. El documento menciona a la misma persona, no parece tener mucha similitud en contenido más allá de la mención.

Comparación con documento 1470, con similitud 0.243400 y etiqueta sci.crypt  
Etiquetas totalmente distintas. El texto es muy corto, no tiene similitud, la firma tiene el mismo nombre que la persona mencionada en el documento comparado, pero no tiene nada que ver una cosa con la otra.

Comparación con documento 4784, con similitud 0.234732 y etiqueta alt.atheism  
Etiquetas distintas, aunque como se mencionó en un caso anterior, puede haber similitudes. En cuanto al texto, el contenido no es similar, pero usa el apellido del personaje para hacer una comparación.

##### Documento a comparar 4572 con etiqueta soc.religion.christian  
El documento habla sobre pecado, sexualidad, impureza, matrimonio, e interpretación sobre como aplica a María y Jose.

Comparación con documento 3768, con similitud 0.299574 y etiqueta soc.religion.christian  
Coinciden las etiquetas. El texto tiene bastante relación con el pecado referido a María, e interpretaciones que lo hacen similar al documento comparado.

Comparación con documento 3672, con similitud 0.275061 y etiqueta soc.religion.christian  
Coinciden las etiquetas. El texto es mucho mas largo, pero habla también sobre María, el pecado, Dios, etc. Bastante relacionado con el texto comparado.

Comparación con documento 1716, con similitud 0.251366 y etiqueta soc.religion.christian  
Coinciden las etiquetas. Este texto es aun mas largo, pero en general habla principalmente sobre María, el pecado, la virginidad, temas que el documento comparado también tiene.

Comparación con documento 1051, con similitud 0.244472 y etiqueta soc.religion.christian  
Coinciden las etiquetas. Este documento no coincide casi nada, el tema principal es distinto, aunque habla de matrimonio, sexualidad, cristianismo, que usa como comparativas.

Comparación con documento 937, con similitud 0.237859 y etiqueta soc.religion.christian  
Coinciden las etiquetas. Este documento tiene alguna similitud, habla de Maria, el pecado, etc.