### Vectorización de texto y modelo de clasificación Naïve Bayes con el dataset 20 newsgroups

In [1]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.metrics import f1_score

from sklearn.datasets import fetch_20newsgroups
import numpy as np

## Carga de datos

In [2]:
# cargamos los datos (ya separados de forma predeterminada en train y test)
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

## Vectorización

In [3]:
# instanciamos un vectorizador
tfidfvect = TfidfVectorizer()

In [4]:
# Transformamos directamente los datos
X_train = tfidfvect.fit_transform(newsgroups_train.data)


In [5]:
# es muy útil tener el diccionario opuesto que va de índices a términos
idx2word = {v: k for k,v in tfidfvect.vocabulary_.items()}

In [6]:
idx2word 

{95844: 'was',
 97181: 'wondering',
 48754: 'if',
 18915: 'anyone',
 68847: 'out',
 88638: 'there',
 30074: 'could',
 37335: 'enlighten',
 60560: 'me',
 68080: 'on',
 88767: 'this',
 25775: 'car',
 80623: 'saw',
 88532: 'the',
 68781: 'other',
 31990: 'day',
 51326: 'it',
 34809: 'door',
 84538: 'sports',
 57390: 'looked',
 89360: 'to',
 21987: 'be',
 41715: 'from',
 55746: 'late',
 9843: '60s',
 35974: 'early',
 11174: '70s',
 25492: 'called',
 24160: 'bricklin',
 34810: 'doors',
 96247: 'were',
 76471: 'really',
 83426: 'small',
 49447: 'in',
 16809: 'addition',
 41724: 'front',
 24635: 'bumper',
 81658: 'separate',
 77878: 'rest',
 67670: 'of',
 23480: 'body',
 51136: 'is',
 17936: 'all',
 54632: 'know',
 25590: 'can',
 88143: 'tellme',
 62746: 'model',
 64931: 'name',
 37287: 'engine',
 84276: 'specs',
 99911: 'years',
 73373: 'production',
 96433: 'where',
 59079: 'made',
 46814: 'history',
 68409: 'or',
 96395: 'whatever',
 49932: 'info',
 100208: 'you',
 45885: 'have',
 41979: '

In [7]:
# en `y_train` guardamos los targets
y_train = newsgroups_train.target
y_train[:10]


array([ 7,  4,  4,  1, 14, 16, 13,  3,  2,  4])

In [8]:
# hay 20 clases correspondientes a los 20 grupos de noticias
print(f'clases {np.unique(newsgroups_test.target)}')
newsgroups_test.target_names

clases [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]


['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

# Punto 1

In [9]:
# Selecciono 5 documentos al azar del conjunto de entrenamiento manteniendo la semilla
np.random.seed(4)
idxs = np.random.choice(np.arange(X_train.shape[0]), 5, replace=False)

In [10]:
idxs

array([7676, 3746, 3074, 8936, 5917])

## Veamos la similaridad de documentos con cada uno de los documentos

#### Documento 1

In [11]:
# Documento 1
idx_1 = idxs[0]
print("ID del documento",idx_1)
print()
print("Contenido del documento:")
print(newsgroups_train.data[idx_1])

ID del documento 7676

Contenido del documento:


Uh, I'm afraid that your information is slightly out of date... PKWare
has obtained a license to export their program to the whole world,
except a very limited list of countries... Draw your own conclusions
about the strength of the algorithm... :-)

Regards,
Vesselin


In [12]:
# Mido la similaridad coseno con todos los documentos de train
cossim = cosine_similarity(X_train[idx_1], X_train)[0]

In [13]:
# Veo los valores de similaridad ordenados de mayor a menos
np.sort(cossim)[::-1]


array([1.        , 0.47324105, 0.22317102, ..., 0.        , 0.        ,
       0.        ])

In [14]:
# Muestro los 5 documentos más similares
mas_similares = np.argsort(cossim)[::-1][1:6]
print(mas_similares)


[ 7822   433 10778  6015 10261]


In [15]:
# El documento original pertenece a la clase:
print(newsgroups_train.target_names[y_train[idx_1]])

sci.crypt


In [16]:
# Veo a qué clases pertenecen los 5 documentos más similares
for i in mas_similares:
  print(newsgroups_train.target_names[y_train[i]])

sci.crypt
sci.crypt
sci.crypt
sci.crypt
sci.crypt


#### Analicemos el texto de los documentos más similares.

In [17]:
# Documento mas similar 1
print("ID del documento 1 mas similar",mas_similares[0])
print()
print("Contenido del documento:")
print(newsgroups_train.data[mas_similares[0]])

ID del documento 1 mas similar 7822

Contenido del documento:

in this regard that permission to export >>> PKZIP's encryption scheme
has twice been denied by NSA.  Draw you own >>> conclusions.

PKWare >>has obtained a license to export their program to the whole
world, >>except a very limited list of countries...  Draw your own
conclusions >>about the strength of the algorithm...  :-)

Sorry if I was less than clear.  :-) I was referring to our own efforts
to receive export permission from NSA for the PKZIP encryption
algorithm, not to any effort on the part of Phil Katz or PKWare.

I should point out that the original version of this algorithm was
designed by Roger Schlafly and that WE (meaning Roger and myself) were
twice denied an export license for it.  The second go 'round was just
this past fall.

I had no knowledge of Phil's attempts in this.  I do not even *know* for
sure if he choose to implement the algorithm as it was designed by
Roger, though I *believe* that was at least

In [18]:
# Documento mas similar 2
print("ID del documento 2 mas similar",mas_similares[1])
print()
print("Contenido del documento:")
print(newsgroups_train.data[mas_similares[1]])

ID del documento 2 mas similar 433

Contenido del documento:



Funny, we had plenty of them in Bulgaria, regardless of the embargo...
:-) So much for export controls...

Regards,
Vesselin


In [19]:
# Documento mas similar 3
print("ID del documento 3 mas similar",mas_similares[2])
print()
print("Contenido del documento:")
print(newsgroups_train.data[mas_similares[2]])

ID del documento 3 mas similar 10778

Contenido del documento:


It depends on the algorithm used. 128-bit secret keys for RSA are
definitively not secure enough.

Regards,
Vesselin


In [20]:
# Documento mas similar 4
print("ID del documento 4 mas similar",mas_similares[3])
print()
print("Contenido del documento:")
print(newsgroups_train.data[mas_similares[3]])

ID del documento 4 mas similar 6015

Contenido del documento:


If there are many as..., er, people in the USA who reason like the
above, then it should not be surprising that the current plot has been
allowed to happen...

Regards,
Vesselin


In [21]:
# Documento mas similar 5
print("ID del documento 4 mas similar",mas_similares[4])
print()
print("Contenido del documento:")
print(newsgroups_train.data[mas_similares[4]])

ID del documento 4 mas similar 10261

Contenido del documento:
As promised, I spoke today with the company mentioned in a Washington
Times article about the Clipper chip announcement. The name of the company
is Secure Communicatiions Technology (Information will be given at the end
of this message on how to contact them).

   Basically they are disturbed about the announcement for many reasons that
we are. More specifically however, Mr. Bryen of Secure Communications
brought to light many points that might interest most of the readers.

   His belief is that AT&T was made known of the clipper well before the
rest of the industry. This is for several reasons, several of which are:

 - A company of AT&T's size could never be able to make a decision to use
   the new chip on the SAME DAY it was announced.

 - Months ago they proposed using their own chip for AT&T's secure telephone
   devices. AT&T basically blew them off as being not interested at all.
   This stuck them as strange, unti

Análisis
> Al mirar la similaridad del documento inicial que pertenece a la clase "sci.crypt" el cual habla sobre el tema de compresión de la empresa PKWare.
> Los 5 documentos mas similares tambien hablan de lo mismo, es más es el mismo que firma cada documento al final.

#### Documento 2

In [22]:
# Documento 2
idx_2 = idxs[1]
print("ID del documento",idx_2)
print()
print("Contenido del documento:")
print(newsgroups_train.data[idx_2])

ID del documento 3746

Contenido del documento:
THE WHITE HOUSE

                  Office of the Press Secretary
                 (Vancouver, British Columbia) 
______________________________________________________________


                       BACKGROUND BRIEFING
                               BY
                 SENIOR ADMINISTRATION OFFICIALS


                          April 4, 1993
	     
                          Canada Place
                  Vancouver, British Columbia  


9:40 A.M. PST
	     
	     
	     Folks, we're about to start the BACKGROUND BRIEFING 
on the aid package.

	     SENIOR ADMINISTRATION OFFICIAL:  Good morning.  The 
President -- President Clinton and President Yeltsin agreed 
yesterday on a series of American initiatives to support economic 
and political reform in Russia, and it's valued at $1.6 billion.  

	     Before taking your questions and running through the 
basic outlines of this package, I want to make a few points.  
First, this is the maxim

In [23]:
# Mido la similaridad coseno con todos los documentos de train
cossim = cosine_similarity(X_train[idx_2], X_train)[0]

In [24]:
# Veo los valores de similaridad ordenados de mayor a menos
np.sort(cossim)[::-1]


array([1.        , 0.67557077, 0.64914676, ..., 0.        , 0.        ,
       0.        ])

In [25]:
# Muestro los 5 documentos más similares
mas_similares = np.argsort(cossim)[::-1][1:6]
print(mas_similares)

[1191 4271 4253 3596 6635]


In [26]:
# El documento original pertenece a la clase:
print(newsgroups_train.target_names[y_train[idx_2]])

talk.politics.misc


In [27]:
# Veo a qué clases pertenecen los 5 documentos más similares
for i in mas_similares:
  print(newsgroups_train.target_names[y_train[i]])

talk.politics.misc
talk.politics.misc
talk.politics.misc
talk.politics.misc
talk.politics.misc


#### Analicemos el texto de los documentos más similares.

In [28]:
# Documento mas similar 1
print("ID del documento 1 mas similar",mas_similares[0])
print()
print("Contenido del documento:")
print(newsgroups_train.data[mas_similares[0]])

ID del documento 1 mas similar 1191

Contenido del documento:
THE WHITE HOUSE

                  Office of the Press Secretary
                  (Vancouver, British Columbia)
_________________________________________________________________
For Immediate Release                               April 4, 1993

	     
                PRESS CONFERENCE BY THE PRESIDENT
                        WITH RUSSIAN PRESS
	     
                           Canada Place
                   Vancouver, British Columbia



2:46 P.M. PDT

	     	  
	     Q	  I had two questions for both Presidents, so you 
could probably answer for Boris, too.  (Laughter.)
	     
	     THE PRESIDENT:  I'll give you my answer, then I'll 
give you Yeltsin's answer.  (Laughter.)
	     
	     Q	  The first is that this is the meeting of the 
Presidents, so the money that's being promised is government 
money, and naturally it's going to be distributed through the 
government.  But you've indicated that three-quarters are going 
to

In [29]:
# Documento mas similar 2
print("ID del documento 2 mas similar",mas_similares[1])
print()
print("Contenido del documento:")
print(newsgroups_train.data[mas_similares[1]])

ID del documento 2 mas similar 4271

Contenido del documento:
THE WHITE HOUSE

                    Office of the Press Secretary
______________________________________________________________
For Immediate Release                             April 13, 1993     

	     
                      REMARKS BY THE PRESIDENT,
               SECRETARY OF EDUCATION RICHARD RILEY AND
                   SECRETARY OF LABOR ROBERT REICH  IN 
                GOALS 2000 SATELLITE TOWN HALL MEETING
	     
                     Chamber of Commerce Building
                           Washington, D.C.   



8:30 P.M. EDT
	     
	     
	     SECRETARY RILEY:  Good evening and welcome to all of you 
in the thousands of communities around the country that are taking 
part in this satellite town meeting for the month of April.
	     
	     You know, today is April 13th.  In 1743, Thomas 
Jefferson was born, 250 years ago.  I think that's appropriate to 
mention at the beginning of this meeting because since that

In [30]:
# Documento mas similar 3
print("ID del documento 3 mas similar",mas_similares[2])
print()
print("Contenido del documento:")
print(newsgroups_train.data[mas_similares[2]])

ID del documento 3 mas similar 4253

Contenido del documento:
THE WHITE HOUSE

                  Office of the Press Secretary
                    (Pittsburgh, Pennsylvania)
______________________________________________________________
For Immediate Release                         April 17, 1993     

	     
                    INTERVIEW OF THE PRESIDENT
                      BY MICHAEL WHITELY OF
                    KDKA-AM RADIO, PITTSBURGH
	     
                 Pittsburgh International Airport
                     Pittsburgh, Pennsylvania    



10:40 A.M. EDT
	     
	     
	     Q	  For everyone listening on KDKA Radio, I'm Mike 
Whitely, KDKA Radio News.  We're here at the Pittsburgh 
International Airport and with me is the President of the United 
States Bill Clinton.
	     
	     And I'd like to welcome you to the area and to KDKA.
	     
	     THE PRESIDENT:  Thank you, Mike.  Glad to be here.
	     
	     Q	  There are a lot of things we'd like to talk 
about in the brief 

In [31]:
# Documento mas similar 4
print("ID del documento 4 mas similar",mas_similares[3])
print()
print("Contenido del documento:")
print(newsgroups_train.data[mas_similares[3]])

ID del documento 4 mas similar 3596

Contenido del documento:
THE WHITE HOUSE

                    Office of the Press Secretary
_________________________________________________________________
For Immediate Release                             April 14, 1993     

	     
                       REMARKS BY THE PRESIDENT
                      AT SUMMER JOBS CONFERENCE

	     	  
                            Hyatt Regency
                        Crystal City, Virginia  


11:22 A.M. EDT

	     
	     THE PRESIDENT:  Thank you very much.  The speech that 
Octavius gave says more than anything I will be able to say today 
about why it's important to give all of our young people a chance to 
get a work experience and to continue to learn, to merge the nature 
of learning and work; why it's important to honor the efforts of 
people like Jerry Levin and Nancye Combs and Pat Irving and all of 
those who are here.  
	     
	     I want to thank the Secretaries of Labor and Education 
and all the 

In [32]:
# Documento mas similar 5
print("ID del documento 4 mas similar",mas_similares[4])
print()
print("Contenido del documento:")
print(newsgroups_train.data[mas_similares[4]])

ID del documento 4 mas similar 6635

Contenido del documento:
THE WHITE HOUSE

                    Office of the Press Secretary
______________________________________________________________
For Immediate Release                             April 15, 1993     

	     
                       REMARKS BY THE PRESIDENT
                   TO LAW ENFORCEMENT ORGANIZATIONS
	     
	     
                           The Rose Garden 


2:52 P.M. EDT


	     THE PRESIDENT:  Good afternoon.  Ladies and gentlemen, 
two months ago I presented a comprehensive plan to reduce our 
national deficit and to increase our investment in the American 
people, their jobs and their economic future.  The federal budget 
plan passed Congress in record time, and created a new sense of hope 
and opportunity in the country.  
	     
	     Then, the short-term jobs plan I presented to Congress, 
which would create a half a million jobs in the next two years passed 
the House of Representatives two weeks ago.  It now 

Análisis
> Al mirar la similaridad del documento inicial que pertenece a la clase "talk.politics.misc" el cual habla sobre un tema de la oficina de prensa de Canada.
> Los 5 documentos mas similares tambien hablan de lo mismo.

#### Documento 3

In [33]:
# Documento 3
idx_3 = idxs[2]
print("ID del documento",idx_3)
print()
print("Contenido del documento:")
print(newsgroups_train.data[idx_3])

ID del documento 3074

Contenido del documento:
Well, we got some responses and are doing some interviews with interesting
responders. However, just in case the other posting was overlooked by an
incredibly talented person ... Mea Culpa for posting this here for Mike,
but we're looking for someone special:

   Tandem Computers is currently looking for a software wizard to help
 us architect & implement a fault-tolerant generalized instrumentation
 subsystem as part of our proprietary operating system kernel (TNS
 Kernel). The TNS Kernel is a proprietary, loosely-coupled parallel,
 message-based operating system. The TNS Kernel has wide connectivity
 to open standards.
   In this key individual contributor role, you will work with other
 developers working on various components of the Transaction Management
 Facility.
   Your background needs to encompass some of the following 4 categories
 (3 of 4 would be excellent):
   Category 1. Math: Working knowledge of statistics, real analysis,

In [34]:
# Mido la similaridad coseno con todos los documentos de train
cossim = cosine_similarity(X_train[idx_3], X_train)[0]

In [35]:
# Veo los valores de similaridad ordenados de mayor a menos
np.sort(cossim)[::-1]

array([1.        , 0.18091718, 0.17296515, ..., 0.        , 0.        ,
       0.        ])

In [36]:
# Muestro los 5 documentos más similares
mas_similares = np.argsort(cossim)[::-1][1:6]
print(mas_similares)

[4271 5443 2350 6719 4166]


In [37]:
# El documento original pertenece a la clase:
print(newsgroups_train.target_names[y_train[idx_3]])

sci.space


In [38]:
# Veo a qué clases pertenecen los 5 documentos más similares
for i in mas_similares:
  print(newsgroups_train.target_names[y_train[i]])

talk.politics.misc
sci.space
sci.crypt
comp.graphics
comp.graphics


#### Analicemos el texto de los documentos más similares.

In [39]:
# Documento mas similar 1
print("ID del documento 1 mas similar",mas_similares[0])
print()
print("Contenido del documento:")
print(newsgroups_train.data[mas_similares[0]])

ID del documento 1 mas similar 4271

Contenido del documento:
THE WHITE HOUSE

                    Office of the Press Secretary
______________________________________________________________
For Immediate Release                             April 13, 1993     

	     
                      REMARKS BY THE PRESIDENT,
               SECRETARY OF EDUCATION RICHARD RILEY AND
                   SECRETARY OF LABOR ROBERT REICH  IN 
                GOALS 2000 SATELLITE TOWN HALL MEETING
	     
                     Chamber of Commerce Building
                           Washington, D.C.   



8:30 P.M. EDT
	     
	     
	     SECRETARY RILEY:  Good evening and welcome to all of you 
in the thousands of communities around the country that are taking 
part in this satellite town meeting for the month of April.
	     
	     You know, today is April 13th.  In 1743, Thomas 
Jefferson was born, 250 years ago.  I think that's appropriate to 
mention at the beginning of this meeting because since that

In [40]:
# Documento mas similar 2
print("ID del documento 2 mas similar",mas_similares[1])
print()
print("Contenido del documento:")
print(newsgroups_train.data[mas_similares[1]])

ID del documento 2 mas similar 5443

Contenido del documento:

   Nick Haines sez;
   >(given that I've heard the Shuttle software rated as Level 5 in
   >maturity, I strongly doubt that this [having lots of bugs] is the case).

   Level 5?  Out of how many?  What are the different levels?  I've never
   heard of this rating system.  Anyone care to clue me in?

This is a rating system used by ARPA and other organisations to
measure the maturity of a `software process' i.e. the entire process
by which software gets designed, written, tested, delivered, supported
etc.

See `Managing the Software Process', by Watts S. Humphrey, Addison
Wesley 1989. An excellent software engineering text. The 5 levels of
software process maturity are:

1. Initial
2. Repeatable
3. Defined
4. Managed
5. Optimizing

The levels are approximately characterized as follows:

1. no statistically software process control. Have no statistical
   basis for estimating how large software will be, how long it will
   ta

In [41]:
# Documento mas similar 3
print("ID del documento 3 mas similar",mas_similares[2])
print()
print("Contenido del documento:")
print(newsgroups_train.data[mas_similares[2]])

ID del documento 3 mas similar 2350

Contenido del documento:
Archive-name: net-privacy/part1
Last-modified: 1993/3/3
Version: 2.1


IDENTITY, PRIVACY, and ANONYMITY on the INTERNET

(c) 1993 L. Detweiler.  Not for commercial use except by permission
from author, otherwise may be freely copied.  Not to be altered. 
Please credit if quoted.

SUMMARY

Information on email and account privacy, anonymous mailing and 
posting, encryption, and other privacy and rights issues associated
with use of the Internet and global networks in general.

(Search for <#.#> for exact section. Search for '_' (underline) for
next section.)

PART 1

Identity
--------
<1.1> What is `identity' on the internet?
<1.2> Why is identity (un)important on the internet?
<1.3> How does my email address (not) identify me and my background?
<1.4> How can I find out more about somebody from their email address?
<1.5> Why is identification (un)stable on the internet? 
<1.6> What is the future of identification on the inter

In [42]:
# Documento mas similar 4
print("ID del documento 4 mas similar",mas_similares[3])
print()
print("Contenido del documento:")
print(newsgroups_train.data[mas_similares[3]])

ID del documento 4 mas similar 6719

Contenido del documento:
Archive-name: graphics/resources-list/part3
Last-modified: 1993/04/17


Computer Graphics Resource Listing : WEEKLY POSTING [ PART 3/3 ]
Last Change : 17 April 1993


11. Scene generators/geographical data/Maps/Data files

DEMs (Digital Elevation Models)
-------------------------------
  DEMs (Digital Elevation Models) as well as other cartographic data
  [huge] is available from spectrum.xerox.com [192.70.225.78], /pub/map.

  Contact:
  Lee Moore -- Webster Research Center, Xerox Corp. --
  Voice: +1 (716) 422 2496
  Arpa, Internet:  Moore.Wbst128@Xerox.Com
[ Check also on ncgia.ucsb.edu (128.111.254.105), /pub/dems -- nfotis ]

  Many of these files are also available on CD-ROM selled by USGS:
  "1:2,000,000 scale  Digital Line Graph (DLG) Data". Contains datas
  for all 50 states. Price is about $28, call to or visit in offices
  in Menlo Park, in Reston, Virginia (800-USA-MAPS).

  The Data User Services Division of the

In [43]:
# Documento mas similar 5
print("ID del documento 4 mas similar",mas_similares[4])
print()
print("Contenido del documento:")
print(newsgroups_train.data[mas_similares[4]])

ID del documento 4 mas similar 4166

Contenido del documento:
Archive-name: graphics/resources-list/part2
Last-modified: 1993/04/17


Computer Graphics Resource Listing : WEEKLY POSTING [ PART 2/3 ]
Last Change : 17 April 1993


14. Plotting packages

Gnuplot 3.2
-----------
  It is one of the best 2- and 3-D plotting packages, with
  online help.It's a command-line driven interactive function plotting utility
  for UNIX, MSDOS, Amiga, Archimedes, and VMS platforms (at least!).
  Freely distributed, it supports many terminals, plotters, and printers
  and is easily extensible to include new devices.
  It was posted to comp.sources.misc in version 3.0, plus 2 patches.
  You can practically find it everywhere (use Archie to find a site near you!).
  The comp.graphics.gnuplot newsgroup is devoted to discussion of Gnuplot.

Xvgr and Xmgr (ACE/gr)
-----------------------
  Xmgr is an XY-plotting tool for UNIX workstations using
  X or OpenWindows. There is an XView version called xvgr for
 

Análisis
> El documento tiene una etiqueta erronea "sci.space" sin embargo luego de revisar el texto habla sobre la búsqueda de programadores para una empresa llamada Tandem Computers, para la implementación de un kernel de un sistema operativo.
> Las etiquetas de los documentos similares no son acorde a lo que se habla en el texto

#### Documento 4

In [44]:
# Documento 4
idx_4 = idxs[3]
print("ID del documento",idx_4)
print()
print("Contenido del documento:")
print(newsgroups_train.data[idx_4])

ID del documento 8936

Contenido del documento:
*Reminder*   Plan now for the Andrew Conference.
*Date* The dates are as noted below.  (We have not changed them.)
*Submission extension*   We are still accepting papers.

*Tutorial topic*  
	_Converting Andrew source code to C++_

This tutorial will discuss the steps necessary to convert a site from C
(extended with classC) to C++.  Conversion of the source code requires
only a couple of steps:
	run the converter
	fill in missing type information
Describing this will not take long.  The remainder of the day will be
spent learning how to write objects in C++ and practicing.

------------------------------

1993 Andrew Technical Conference and Consortium Annual Meeting
June 24-25, 1993
Carnegie Mellon University
Pittsburgh, PA

The conference will be held on the last Thursday and Friday in June.  A
tutorial will be on Thursday the 24th and the conference proper on the
25th with the annual meeting at the dinner on the evening between the
tw

In [45]:
# Mido la similaridad coseno con todos los documentos de train
cossim = cosine_similarity(X_train[idx_4], X_train)[0]

In [46]:
# Veo los valores de similaridad ordenados de mayor a menos
np.sort(cossim)[::-1]

array([1.        , 0.39979522, 0.31632582, ..., 0.        , 0.        ,
       0.        ])

In [47]:
# Muestro los 5 documentos más similares
mas_similares = np.argsort(cossim)[::-1][1:6]
print(mas_similares)

[3425 9879 7967 6546 7007]


In [48]:
# El documento original pertenece a la clase:
print(newsgroups_train.target_names[y_train[idx_4]])

comp.windows.x


In [49]:
# Veo a qué clases pertenecen los 5 documentos más similares
for i in mas_similares:
  print(newsgroups_train.target_names[y_train[i]])

comp.windows.x
comp.windows.x
comp.windows.x
comp.windows.x
sci.crypt


#### Analicemos el texto de los documentos más similares.

In [50]:
# Documento mas similar 1
print("ID del documento 1 mas similar",mas_similares[0])
print()
print("Contenido del documento:")
print(newsgroups_train.data[mas_similares[0]])

ID del documento 1 mas similar 3425

Contenido del documento:
1993 Andrew Tutorial 
                                  and 
                          Technical Conference 


When:  Thursday and Friday, June 24 and 25, 1993 
    (Deadline for Registration:  June 4, 1993) 

Where:  Carnegie Mellon University in Pittsburgh, Pennsylvania.   

Sponsor:  Andrew Consortium of CMU's School of Computer Science.  

Schedule:  The Tutorial  will be on Thursday, followed by dinner and the
    Annual Meeting.  The Conference proper will be on Friday.  All
    Conference attendees are welcome at the Annual Meeting.  

    Wednesday, June 23 

        Check in:  After 4:00 PM 
        Informal Reception:  7:30 PM 

    Thursday, June 24 

        Tutorial:  9:00 A.M. - 5:00 PM 
        Conference Dinner:  6:30 PM 
        Annual meeting:  8:00 PM 

    Friday, June 25 

        Technical Conference:  9:00 AM - 5:00 PM 

Cost:  
    Tutorial fee includes breaks, lunch and tutorial materials:  $100 
   

In [51]:
# Documento mas similar 2
print("ID del documento 2 mas similar",mas_similares[1])
print()
print("Contenido del documento:")
print(newsgroups_train.data[mas_similares[1]])

ID del documento 2 mas similar 9879

Contenido del documento:
Excerpts from netnews.comp.windows.x: 23-Apr-93 X Toolkits Paul
Prescod@undergrad.m (1132)


The Andrew User Interface System is supported, maintained, enhanced, and
distributed by the Andrew Consortium, Carnegie Mellon.  The distribution
terms are those of the X consortium, not the GNU Public License.  Thus
anyone can commercially exploit the Andrew code without restriction. 
(To encourage membership, however, we defer universal release of the
latest versions until Consortium members have had an opportunity to
explore the new capabilities.)

To se what AUIS looks like, you can try a remote demo.  You need an X
server (R5 is best) on a machine linked to the internet.  Give the
command 

	finger help@atk.itc.cmu.edu 

for instructions.

NOTE:  The demo version does not use the Motif-look-and-feel scrollbar,
but one is available.  You can use it on the demo by changing an option
in the ~/preferences file and starting a new edi

In [52]:
# Documento mas similar 3
print("ID del documento 3 mas similar",mas_similares[2])
print()
print("Contenido del documento:")
print(newsgroups_train.data[mas_similares[2]])

ID del documento 3 mas similar 7967

Contenido del documento:
Archive-name: x-faq/part1
Last-modified: 1993/04/04

This article and several following contain the answers to some Frequently Asked 
Questions (FAQ) often seen in comp.windows.x. It is posted to help reduce 
volume in this newsgroup and to provide hard-to-find information of general 
interest.

		Please redistribute this article!

This article includes answers to the following questions, which are loosely
grouped into categories. Questions marked with a + indicate questions new to 
this issue; those with significant changes of content since the last issue are 
marked by !:

  0)  TOPIC: BASIC INFORMATION SOURCES AND DEFINITIONS
  1)! What books and articles on X are good for beginners?
  2)! What courses on X and various X toolkits are available?
  3)! What conferences on X are coming up?
  4)  What X-related public mailing lists are available?
  5)  How can I meet other X developers? 
  6)  What related FAQs are available?

In [53]:
# Documento mas similar 4
print("ID del documento 4 mas similar",mas_similares[3])
print()
print("Contenido del documento:")
print(newsgroups_train.data[mas_similares[3]])

ID del documento 4 mas similar 6546

Contenido del documento:
Excerpts from netnews.comp.windows.x: 23-Apr-93 X Toolkits Paul
Prescod@undergrad.m (1132)



If you're on the internet and your site isn't sheltered from external
tcp/ip traffic, you can use the Remote Andrew Demo to see what the
Andrew Toolkit looks like:

Remote Andrew Demo Service

This network service allows you to run Andrew Toolkit applications
without the overhead of obtaining or compiling the Andrew software.  You
need a host machine on the Internet, and you need to be running the X11
window system.  A simple "finger" command will allow you to experience
ATK applications firsthand.  You'll be able to compose multimedia
documents, navigate through the interactive Andrew Tour, and use the
Andrew Message System to browse through CMU's three thousand bulletin
boards and newsgroups.

To use the Remote Andrew Demo service, simply run the following command
on your machine:

    finger help@atk.itc.cmu.edu

The service will

In [54]:
# Documento mas similar 5
print("ID del documento 4 mas similar",mas_similares[4])
print()
print("Contenido del documento:")
print(newsgroups_train.data[mas_similares[4]])

ID del documento 4 mas similar 7007

Contenido del documento:
............................................................................
        CRYPTO '93  -  Conference Announcement & Final Call for Papers
............................................................................

The Thirteenth Annual CRYPTO Conference, sponsored by the International 
Association for Cryptologic Research (IACR), in cooperation with 
the IEEE Computer Society Technical Committee on Security and Privacy, 
the Computer Science Department of the University of California, Santa 
Barbara, and Bell-Northern Research (a subsidiary of Northern Telecom), 
will be held on the campus of the University of California, Santa Barbara, 
on August 22-26, 1993. Original research papers and technical expository talks 
are solicited on all practical and theoretical aspects of cryptology. It is 
anticipated that some talks may also be presented by special invitation of the 
Program Committee.
------------------------

Análisis
> El documento tiene la etiqueta "comp.windows.x". Luego de revisar el texto habla sobre un tutorial para la conversión de un código a a C++.
> Las etiquetas de los documentos similares son acorde a lo que se habla en el texto sobre tutoriales.

#### Documento 5

In [55]:
# Documento 5
idx_5 = idxs[4]
print("ID del documento",idx_5)
print()
print("Contenido del documento:")
print(newsgroups_train.data[idx_5])

ID del documento 5917

Contenido del documento:
What do I need to do to configure this drive as a slave?
Model# CP30101G

Please reply via e-mail. Thanks!!


In [56]:
# Mido la similaridad coseno con todos los documentos de train
cossim = cosine_similarity(X_train[idx_5], X_train)[0]

In [57]:
# Veo los valores de similaridad ordenados de mayor a menos
np.sort(cossim)[::-1]

array([1.        , 0.26462774, 0.24780124, ..., 0.        , 0.        ,
       0.        ])

In [58]:
# Muestro los 5 documentos más similares
mas_similares = np.argsort(cossim)[::-1][1:6]
print(mas_similares)

[10510  2483 11229  6659  6386]


In [59]:
# El documento original pertenece a la clase:
print(newsgroups_train.target_names[y_train[idx_5]])

comp.sys.ibm.pc.hardware


In [60]:
# Veo a qué clases pertenecen los 5 documentos más similares
for i in mas_similares:
  print(newsgroups_train.target_names[y_train[i]])

comp.sys.ibm.pc.hardware
comp.sys.ibm.pc.hardware
comp.sys.ibm.pc.hardware
comp.sys.ibm.pc.hardware
comp.sys.ibm.pc.hardware


#### Analicemos el texto de los documentos más similares.

In [61]:
# Documento mas similar 1
print("ID del documento 1 mas similar",mas_similares[0])
print()
print("Contenido del documento:")
print(newsgroups_train.data[mas_similares[0]])

ID del documento 1 mas similar 10510

Contenido del documento:
Since the losers that sold me the hard disk for my computer are
so generous, I need the info to set this drive from master to
slave. Any help would be greatly appreciated.  Please reply via e-mail.

Incidentally, avoid purchasing a computer from ACS in Endicott, NY.



In [62]:
# Documento mas similar 2
print("ID del documento 2 mas similar",mas_similares[1])
print()
print("Contenido del documento:")
print(newsgroups_train.data[mas_similares[1]])

ID del documento 2 mas similar 2483

Contenido del documento:
I am looking for information about this drive.  Switch settings, geometry..etc.

Conner CP3204F

Please reply via e-mail.  Many thanks in advance!



In [63]:
# Documento mas similar 3
print("ID del documento 3 mas similar",mas_similares[2])
print()
print("Contenido del documento:")
print(newsgroups_train.data[mas_similares[2]])

ID del documento 3 mas similar 11229

Contenido del documento:
Configuration of IDE Harddisks


last update:	14.4.1993

collected by Carsten Grammes (ph12hucg@rz.uni-sb.de)
and published regularly on comp.sys.ibm.pc.hardware.


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
There is explicitly NO WARRANTY
that the given settings are correct or harmless. (I only collect, I do
not check for myself!!!). There is always the possibility that the
settings may destroy your hardware!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


Since I hope however that only well-minded people undergo the effort of
posting their settings the chance of applicability exists. If you should
agree or disagree with some setting, let me know immediately in order
to update the list.

If you possess a HD not mentioned here of which you know BIOS and/or
jumper settings, please mail them to me for the next update of the list!

Only IDE (AT-Bus) Harddisks will be account

In [64]:
# Documento mas similar 4
print("ID del documento 4 mas similar",mas_similares[3])
print()
print("Contenido del documento:")
print(newsgroups_train.data[mas_similares[3]])

ID del documento 4 mas similar 6659

Contenido del documento:
I have a 486 machine with a 3.5" A: drive and a 5.25" B: drive.  I
want to swap them so 3.5" drive is A:  What do I have to do?

TIA



In [65]:
# Documento mas similar 5
print("ID del documento 4 mas similar",mas_similares[4])
print()
print("Contenido del documento:")
print(newsgroups_train.data[mas_similares[4]])

ID del documento 4 mas similar 6386

Contenido del documento:
I was recently loaned an older Dec 210 286 at work, and I have the option
of adding an additional Western Digital Hard-drive to the machine.  The
existing drive is currently a Western Digital as well, and is working fine,
but I do not have any documentation available for configuring the master/slave
relationship necessary for a c: d: drive setup.

The first drive is currently formatted to Tandy Dos v3.3 but I am eventually
going to upgrade both to MS Dos v 5.0

The drives themselves are both model number WD95044-A circa 5-07-1991
They are 782 cyl 4 head drives.  A note to add is that there is no exact
configuration for these in my current bios, but it seems to work at a
setting 17 (977 cyl 5 head, 300 write_pre, 977 landing zone).
There are three pairs of jumper pins on the back that I presume are
for setting up the master/slave.  Originally, the drive in the machine
had none.  Currently, I was suggested to try the far right

Análisis
> El documento tiene la etiqueta "comp.sys.ibm.pc.hardware". Luego de revisar el texto habla sobre una consulta sobre configurar un disco rigido comoe sclavo en una PC.
> Las etiquetas de los documentos similares son acorde a lo que se habla en el texto sobre hardware.