# Introducción

Los espacios vectoriales juegan un papel importante en las ciencias de la computación, en especial en campos como la inteligencia artificial, procesamiento de lenguaje natural, etc. ya que nos ayudan representar datos de una forma estructurada conveniente. En el área de procesamiento de lenguaje natural, nos proporcionan una forma de manipular datos textuales. Esta representación permite el análisis semántico así como la medición de similitud entre palabras, oraciones o incluso documentos. En este sentido, existen técnicas que miden qué tan similar es un objeto con respecto a otro en un espacio vectorial, como el coseno, distancia Jaccard o Dice, entre otras. Una aplicación de lo descrito anteriormente es la detección de plagio, cuyo objetivo es determinar la similitud entre dos representaciones vectoriales.

El objetivo de este trabajo es realizar, a nivel básico, la detección de plagio mediante una medida de similitud con pesado (en particular, coseno con pesado TF-IDF) y se muestra una comparación con dos medidas sin pesado para un conjunto de documentos.

# Desarrollo

## Descripción del conjunto de datos

Para la realización de este trabajo se usó el corpus para la detección de plagio con ofuscación de resúmenes de la competencia PAN, en su versión 2013. El corpus consta de documentos de noticias adaptados para la tarea de detección de plagio. Este corpus está dividido en 2: documentos fuente y documentos sospechosos. Los documentos sospechosos pueden contener fragmentos plagiados de los documentos fuente. Sin embargo, hay documentos en los que sólo se simuló el plagio pero en realidad no se plagió ningún párrafo. 

A continuación se muestran dos fragmentos de texto. El primero pertenece a un documento fuente y el segundo a un documento sospechoso:

>While the commission stopped short of blaming Chief Gates for these problems, it said that no chief should serve more than two consecutive five-year terms, and that Mr. Gates, having served 13 years, should therefore turn in his badge following a transition period. But the chief, who has remained steadfast through repeated calls from community leaders for his ouster, said later: "I don't expect to just run away" from the job.

> " Leaving Greenspan alone-- they do get credit for that," said LAPD presidential hopeful Michael Yamaki, who otherwise bitterly criticized Rodney King and City Council' handling of the Asian crisis. Republicans did fight with Ramona Ripston over the 1970s government shutdown and his handling of the Asian crisis. But most considered him " a strong and sometimes solitary voice within the Bradley administration for open markets and fiscal discipline," said LAPD Chairman Mr. Gates, Washington.



## Implementación

Para la implementación de esta solución se utilizó Python. A continuación se muestra el código
para los puntos más importantes.

### Preprocesamiento

Antes de construir nuestra matriz de características TF-IDF, debemos realizar un procesamiento para los documentos del corpus descrito anteriormente. Este procesamiento consiste en convertir a minúsculas, eliminar stop-words y aplicar stemming. Cabe señalar que no se removieron los signos de puntuación debido a que pueden proporcionar más información sobre si existe plagio o no.

In [2]:
def preprocess(text):
    words = (text.lower().split())
    words = [word for word in words if word not in stop_words]
    words = [stemmer.stem(word) for word in words]
    return ' '.join(words)

A continuación se muestra el resultado del preprocesamiento aplicado a un texto del corpus:

>commiss stop short blame chief gate problems, said chief serv two consecut five-year terms, mr. gates, serv 13 years, therefor turn badg follow transit period. chief, remain steadfast repeat call commun leader ouster, said later: 'i expect run away' job.

Para el preprocesamiento para utilizar las medidas de similitud Dice y Jaccard, se convierte a minúsculas, se eliminan stop-words y se aplica stemming. En esta caso no se remueven tampoco los signos de puntuación y se convierte el texto a un conjunto, debido a que se usaron las versiones de Jaccard y Dice basadas en conjuntos.

In [None]:
def preprocessing_sets(text):
    words = (text.lower().split())
    words = [word for word in words if word not in stop_words]
    words = [porter.stem(word) for word in words]
    return set(words)

### Lectura y carga de datos

Para realizar la lectura de los documentos, se obtienen los nombres de todos los archivos de cada carpeta.

In [None]:
source_names = os.listdir("source-documents")
suspicious_names = os.listdir("suspicious-documents")

A partir de estas listas de nombres de archivos, se crea un diccionario que contendrá los índices de cada documento. Esto nos permitirá tener un control de los índices de dónde inicia y dónde terminan los documentos fuentes y  los documentos sospechosos. Se añade un parámetro por default llamado offset que permitirá añadir un desplazamiento a los valores de los índices, en caso de ser necesario.

In [5]:
def create_index(names_list, offset = 0):
    indexes = { names_list[s]: (s+offset) for s in range(len(names_list))}
    return indexes

Utilizamos esta función con las listas source_names y suspicious_names. En el caso de la última, asignamos el tamaño de la lista al parámetro offset. 

In [6]:
def process_documents(file_path, names_list):
    documents = []
    for sd in names_list:
        documents.append(preprocess(open(file_path+'/'+sd, 'r').read()))
    return documents

Mediante la función anterior se leen los archivos que se proporcionen en names_list; file_path nos permite indicar el nombre de la carpeta. Al leerse, se le aplicará la función de preprocesamiento descrita anteriormente y se anexa a una lista. Esta función se aplicará a los documentos fuente y sospechosos. Para el caso de Jaccard y Dice, se usó la misma función con el preprocesamiento adecuado.

### Representación vectorial

Los documentos se representarán como un espacio de vectores utilizando TfidfVectorizer, lo que nos permitirá convertir una colección de documentos a una matriz de características TF-IDF.
Esta función construye el espacio total de características uniendo ambas listas de documentos. Cada elemento en una lista se considera un documento.

In [7]:
def get_matrix(source_docs, suspicious_docs):
    all_docs = []
    all_docs.extend(source_docs)
    all_docs.extend(suspicious_docs)
    tfidf_vec = TfidfVectorizer()
    vectors = tfidf_vec.fit_transform(all_docs)
    return vectors

### Cálculo de medidas de similitud

#### Cálculo de similitud Jaccard

In [None]:
def jaccard(setA, setB):
    return len(setA.intersection(setB))/len(setA.union(setB))

La función anterior se usa para el cálculo de la medida de similitud de Jaccard.Para su cálculo se  implementó la versión que utiliza conjuntos, que se define de la siguiente forma: ${Jaccard (A,B)} = \frac{|A \cap B|}{|A \cup B|}$

#### Cálculo de similitud Dice

Para calcular la similitud Dice también se usó su versión para conjuntos. La similitud de Dice está definida como sigue: ${Dice(A,B)} = 2 \times\frac{ |A \cap B|}{|A| + |B|}$

In [None]:
def dice(setA, setB):
    return 2 * ((len(setA.intersection(setB)))/((len(setA) + len(setB))))

#### Cálculo de similitud coseno

Para calcular las similitudes se usan los diccionarios que contienen los índices de cada documento para poder indexar la matriz de características TF-IDF. Se obtiene la similitud de cada documento fuente para cada documento sospechoso y se anexan a una diccionario, cuya llave es el documento fuente y el valor es una lista de tuplas, donde el primer miembro es el nombre del documento sospechoso correspondiente y el segundo es la similitud. Finalmente, para cada documento fuente se devuelven los 3 documentos con mayor nivel de similaridad (por defecto).

In [None]:
def calculate_all_similarities(vectors, source_indexes, suspicious_indexes, top = 3):
    results = {}
    for key1 in source_indexes:
        sub_results = []
        for key2 in suspicious_indexes:
            sub_results.append((f"{key2}", cosine_similarity(vectors[source_indexes[key1]], vectors[suspicious_indexes[key2]])[0][0]))
        top_n = sorted(sub_results, key=lambda tup: tup[1], reverse = True)[:top]
        results[key1] = top_n 
    return results

### Ordenamiento de resultados

Debido a que los datos obtenidos con la función anterior no se encuentran ordenados, se usa la siguiente instrucción para ordenarlos mediante el mayor valor.

In [None]:
sorted_dict = dict(sorted(results.items(), key=lambda x: max(t[1] for t in x[1]), reverse=True))

### Obtención de texto

Para obtener el texto asociado a un documento, podemos utilizar la siguiente función:

In [None]:
def get_text(all_docs, indexes, doc_name):
    return all_docs[indexes[doc_name]]

Esta función recibe 3 parámetros: all_docs que es la lista de todos los documentos, indexes que es un diccionario que contiene los índices de cada documento (fuente o sospechoso) y el nombre del documento.

## Resultados

### Obteniendo los documentos con mayor similitud

Usando la función calculate_all_similarities, obtenemos la  lista de documentos fuente y documentos sospechosos junto con su similitud asociada. En la siguiente tabla podemos observar esta información. Es importante mencionar que en todas las tablas se muestran 20 documentos fuente con sus respectivos 3 documentos sospechosos con mayor valor. Entre más alto es el valor de similitud, es más probable que exista plagio.

In [27]:
from IPython.display import HTML
HTML('<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>Fuente</th>\n      <th>Sospechoso</th>\n      <th>Similitud coseno</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>source-document0229.txt</td>\n      <td>suspicious-document2289.txt</td>\n      <td>0.547230</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>source-document0229.txt</td>\n      <td>suspicious-document2290.txt</td>\n      <td>0.480379</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>source-document0229.txt</td>\n      <td>suspicious-document2286.txt</td>\n      <td>0.315325</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>source-document0017.txt</td>\n      <td>suspicious-document0169.txt</td>\n      <td>0.488087</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>source-document0017.txt</td>\n      <td>suspicious-document0170.txt</td>\n      <td>0.320545</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>source-document0017.txt</td>\n      <td>suspicious-document0161.txt</td>\n      <td>0.242462</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>source-document0174.txt</td>\n      <td>suspicious-document1740.txt</td>\n      <td>0.474979</td>\n    </tr>\n    <tr>\n      <th>7</th>\n      <td>source-document0174.txt</td>\n      <td>suspicious-document1739.txt</td>\n      <td>0.332350</td>\n    </tr>\n    <tr>\n      <th>8</th>\n      <td>source-document0174.txt</td>\n      <td>suspicious-document1732.txt</td>\n      <td>0.272551</td>\n    </tr>\n    <tr>\n      <th>9</th>\n      <td>source-document0012.txt</td>\n      <td>suspicious-document0119.txt</td>\n      <td>0.455868</td>\n    </tr>\n    <tr>\n      <th>10</th>\n      <td>source-document0012.txt</td>\n      <td>suspicious-document0112.txt</td>\n      <td>0.282967</td>\n    </tr>\n    <tr>\n      <th>11</th>\n      <td>source-document0012.txt</td>\n      <td>suspicious-document0114.txt</td>\n      <td>0.279199</td>\n    </tr>\n    <tr>\n      <th>12</th>\n      <td>source-document0008.txt</td>\n      <td>suspicious-document0071.txt</td>\n      <td>0.449975</td>\n    </tr>\n    <tr>\n      <th>13</th>\n      <td>source-document0008.txt</td>\n      <td>suspicious-document0074.txt</td>\n      <td>0.362638</td>\n    </tr>\n    <tr>\n      <th>14</th>\n      <td>source-document0008.txt</td>\n      <td>suspicious-document0080.txt</td>\n      <td>0.291970</td>\n    </tr>\n    <tr>\n      <th>15</th>\n      <td>source-document0160.txt</td>\n      <td>suspicious-document1599.txt</td>\n      <td>0.447597</td>\n    </tr>\n    <tr>\n      <th>16</th>\n      <td>source-document0160.txt</td>\n      <td>suspicious-document1600.txt</td>\n      <td>0.324033</td>\n    </tr>\n    <tr>\n      <th>17</th>\n      <td>source-document0160.txt</td>\n      <td>suspicious-document0569.txt</td>\n      <td>0.222502</td>\n    </tr>\n    <tr>\n      <th>18</th>\n      <td>source-document0007.txt</td>\n      <td>suspicious-document0069.txt</td>\n      <td>0.440357</td>\n    </tr>\n    <tr>\n      <th>19</th>\n      <td>source-document0007.txt</td>\n      <td>suspicious-document0061.txt</td>\n      <td>0.211594</td>\n    </tr>\n    <tr>\n      <th>20</th>\n      <td>source-document0007.txt</td>\n      <td>suspicious-document0070.txt</td>\n      <td>0.163326</td>\n    </tr>\n    <tr>\n      <th>21</th>\n      <td>source-document0021.txt</td>\n      <td>suspicious-document0209.txt</td>\n      <td>0.433271</td>\n    </tr>\n    <tr>\n      <th>22</th>\n      <td>source-document0021.txt</td>\n      <td>suspicious-document2032.txt</td>\n      <td>0.294309</td>\n    </tr>\n    <tr>\n      <th>23</th>\n      <td>source-document0021.txt</td>\n      <td>suspicious-document2053.txt</td>\n      <td>0.294010</td>\n    </tr>\n    <tr>\n      <th>24</th>\n      <td>source-document0200.txt</td>\n      <td>suspicious-document2000.txt</td>\n      <td>0.432366</td>\n    </tr>\n    <tr>\n      <th>25</th>\n      <td>source-document0200.txt</td>\n      <td>suspicious-document1993.txt</td>\n      <td>0.203599</td>\n    </tr>\n    <tr>\n      <th>26</th>\n      <td>source-document0200.txt</td>\n      <td>suspicious-document1992.txt</td>\n      <td>0.178949</td>\n    </tr>\n    <tr>\n      <th>27</th>\n      <td>source-document0006.txt</td>\n      <td>suspicious-document0052.txt</td>\n      <td>0.413778</td>\n    </tr>\n    <tr>\n      <th>28</th>\n      <td>source-document0006.txt</td>\n      <td>suspicious-document0060.txt</td>\n      <td>0.342470</td>\n    </tr>\n    <tr>\n      <th>29</th>\n      <td>source-document0006.txt</td>\n      <td>suspicious-document0121.txt</td>\n      <td>0.258174</td>\n    </tr>\n    <tr>\n      <th>30</th>\n      <td>source-document0102.txt</td>\n      <td>suspicious-document1020.txt</td>\n      <td>0.407222</td>\n    </tr>\n    <tr>\n      <th>31</th>\n      <td>source-document0102.txt</td>\n      <td>suspicious-document1018.txt</td>\n      <td>0.253810</td>\n    </tr>\n    <tr>\n      <th>32</th>\n      <td>source-document0102.txt</td>\n      <td>suspicious-document1012.txt</td>\n      <td>0.228900</td>\n    </tr>\n    <tr>\n      <th>33</th>\n      <td>source-document0184.txt</td>\n      <td>suspicious-document1839.txt</td>\n      <td>0.404393</td>\n    </tr>\n    <tr>\n      <th>34</th>\n      <td>source-document0184.txt</td>\n      <td>suspicious-document0590.txt</td>\n      <td>0.244596</td>\n    </tr>\n    <tr>\n      <th>35</th>\n      <td>source-document0184.txt</td>\n      <td>suspicious-document0279.txt</td>\n      <td>0.241504</td>\n    </tr>\n    <tr>\n      <th>36</th>\n      <td>source-document0234.txt</td>\n      <td>suspicious-document2339.txt</td>\n      <td>0.401236</td>\n    </tr>\n    <tr>\n      <th>37</th>\n      <td>source-document0234.txt</td>\n      <td>suspicious-document2340.txt</td>\n      <td>0.350562</td>\n    </tr>\n    <tr>\n      <th>38</th>\n      <td>source-document0234.txt</td>\n      <td>suspicious-document0596.txt</td>\n      <td>0.198231</td>\n    </tr>\n    <tr>\n      <th>39</th>\n      <td>source-document0090.txt</td>\n      <td>suspicious-document0899.txt</td>\n      <td>0.396531</td>\n    </tr>\n    <tr>\n      <th>40</th>\n      <td>source-document0090.txt</td>\n      <td>suspicious-document0900.txt</td>\n      <td>0.216340</td>\n    </tr>\n    <tr>\n      <th>41</th>\n      <td>source-document0090.txt</td>\n      <td>suspicious-document0897.txt</td>\n      <td>0.174249</td>\n    </tr>\n    <tr>\n      <th>42</th>\n      <td>source-document0040.txt</td>\n      <td>suspicious-document0400.txt</td>\n      <td>0.390026</td>\n    </tr>\n    <tr>\n      <th>43</th>\n      <td>source-document0040.txt</td>\n      <td>suspicious-document1366.txt</td>\n      <td>0.373295</td>\n    </tr>\n    <tr>\n      <th>44</th>\n      <td>source-document0040.txt</td>\n      <td>suspicious-document1614.txt</td>\n      <td>0.335148</td>\n    </tr>\n    <tr>\n      <th>45</th>\n      <td>source-document0163.txt</td>\n      <td>suspicious-document1629.txt</td>\n      <td>0.387690</td>\n    </tr>\n    <tr>\n      <th>46</th>\n      <td>source-document0163.txt</td>\n      <td>suspicious-document1779.txt</td>\n      <td>0.266493</td>\n    </tr>\n    <tr>\n      <th>47</th>\n      <td>source-document0163.txt</td>\n      <td>suspicious-document1774.txt</td>\n      <td>0.242796</td>\n    </tr>\n    <tr>\n      <th>48</th>\n      <td>source-document0187.txt</td>\n      <td>suspicious-document1870.txt</td>\n      <td>0.385368</td>\n    </tr>\n    <tr>\n      <th>49</th>\n      <td>source-document0187.txt</td>\n      <td>suspicious-document1869.txt</td>\n      <td>0.297022</td>\n    </tr>\n    <tr>\n      <th>50</th>\n      <td>source-document0187.txt</td>\n      <td>suspicious-document1868.txt</td>\n      <td>0.287914</td>\n    </tr>\n    <tr>\n      <th>51</th>\n      <td>source-document0005.txt</td>\n      <td>suspicious-document0049.txt</td>\n      <td>0.377034</td>\n    </tr>\n    <tr>\n      <th>52</th>\n      <td>source-document0005.txt</td>\n      <td>suspicious-document1779.txt</td>\n      <td>0.259137</td>\n    </tr>\n    <tr>\n      <th>53</th>\n      <td>source-document0005.txt</td>\n      <td>suspicious-document1774.txt</td>\n      <td>0.253887</td>\n    </tr>\n    <tr>\n      <th>54</th>\n      <td>source-document0093.txt</td>\n      <td>suspicious-document2032.txt</td>\n      <td>0.376797</td>\n    </tr>\n    <tr>\n      <th>55</th>\n      <td>source-document0093.txt</td>\n      <td>suspicious-document1963.txt</td>\n      <td>0.376780</td>\n    </tr>\n    <tr>\n      <th>56</th>\n      <td>source-document0093.txt</td>\n      <td>suspicious-document1896.txt</td>\n      <td>0.375309</td>\n    </tr>\n    <tr>\n      <th>57</th>\n      <td>source-document0025.txt</td>\n      <td>suspicious-document0249.txt</td>\n      <td>0.376534</td>\n    </tr>\n    <tr>\n      <th>58</th>\n      <td>source-document0025.txt</td>\n      <td>suspicious-document0250.txt</td>\n      <td>0.290347</td>\n    </tr>\n    <tr>\n      <th>59</th>\n      <td>source-document0025.txt</td>\n      <td>suspicious-document0241.txt</td>\n      <td>0.249280</td>\n    </tr>\n  </tbody>\n</table>')

Unnamed: 0,Fuente,Sospechoso,Similitud coseno
0,source-document0229.txt,suspicious-document2289.txt,0.54723
1,source-document0229.txt,suspicious-document2290.txt,0.480379
2,source-document0229.txt,suspicious-document2286.txt,0.315325
3,source-document0017.txt,suspicious-document0169.txt,0.488087
4,source-document0017.txt,suspicious-document0170.txt,0.320545
5,source-document0017.txt,suspicious-document0161.txt,0.242462
6,source-document0174.txt,suspicious-document1740.txt,0.474979
7,source-document0174.txt,suspicious-document1739.txt,0.33235
8,source-document0174.txt,suspicious-document1732.txt,0.272551
9,source-document0012.txt,suspicious-document0119.txt,0.455868


Para cada documento se obtuvieron los 3 documentos con mayor similitud.. Con base en la tabla anterior, podemos notar que existen documentos sospechosos que tienen más de .48 de similitud, como es el caso del documento source-document0017.txt, que tiene una similitud de 0.488087 con suspicious-document0169.txt o source-document0229.txt con suspicious-document2290.txt con valor de 0.480379; en el caso más alto, el documento fuente source-document0229.txt tiene una similitud coseno de 0.547230 con el documento
suspicious-document2289.txt.

Si obtenemos el texto asociado al documento fuente source-document0229.txt y sospechoso suspicious-document2289.txt usando la función get_text, obtenemos lo siguiente:

| Documento fuente      | Documento sospechoso |
| ----------- | ----------- |
| yasser arafat tuesday accus unit state threaten kill plo offici palestinian guerrilla attack american targets. unit state deni accusation. state depart said washington receiv report plo might target american alleg u.s. involv assassin khalil wazir, plo\' second command. wazir slain april 16 raid hous near tunis, tunisia. isra offici spoke condit identifi said isra squad carri assassination. accus plo unit state knew approv plan slay wazir. arafat, palestin liber organ leader, claim threat kill plo offici made u.s. govern document plo obtain arab government. refus identifi government. washington, assist secretari state richard murphi deni arafat\' accus unit state threaten plo officials. state depart spokesman charl redman said unit state touch number middl eastern countri possibl plo attack american citizen facilities. ad arafat\' interpret contact "entir without foundation." arafat spoke news confer heavili guard villa baghdad, extra secur guard deployed. said secur also augment plo offic around arab world follow alleg threat. produc photocopi alleg document. appear part longer document word "confidential" stamp bottom. document, typewritten english, refer wazir code name, abu jihad. read: "you may awar charg sever middl eastern particulari palestinian circl u.s. knew approv abu jihad\' assassination. "on april 18th (a) state depart spokesman said unit state `condemn act polit assassination,\' `had knowledg of\' `wa involv way assassination. "it come attent plo leader yasser arafat may person approv seri terrorist attack american citizen facil abroad, possibl retali last month\' assassin abu jihad. "ani possibl target american personnel facil retali abu jihad\' assassin would total reprehens unjustified. would hold plo respons attacks." arafat said document "reveal u.s. administr planning, full cooper israelis, conduct crusad terrorist attack blame plo them. "these attack use justifi assassin plo leaders." strongli deni plo plan attacks.    | 'nairobi , kenya _ tanzania charg two men monday 11 count murder connect bomb u.s. embassi aug. 7 . action contrast markedli decis kenya , american embassi bomb day. kenya sent two suspect unit state , indicted, diplomat suggest tanzanian decis set potenti diplomat conflict unit state might hamper investigation. kenya , 236 kenyan 12 american killed. eleven peopl die tanzania . three fbi agent court monday dar es salaam , tanzanian capital, govern provid detail beyond defendants\' names. author kenya diplomat said one man, mustafa mahmoud said ahm , interrog fall terrorist activ kenya , includ plot bomb american embassi , want egypt terrorist activities. yasser arafat tuesday accus unit state threaten kill plo offici palestinian guerilla attack american targets. deni accusation, state depart explain receiv report plo might target american alleg u.s. involv assassin khalil wazir (code name abu jihad), plo\' second command. (wazir kill april 16 raid hous near tunis, tunisia.) arafat took u.s. denial involv assassin warn plo would held account action u.s. personnel threat assassin plo leaders. live u.s. tuesday last year, ahmed\' life parallel plo, indict last week tuni accus work state department, saudi-born financi tunisia believ behind embassi bombings. ahm plo, u.s., u.s., arriv april 16 went gem trading. interrog fall, el hage unit state , shortli leav kenya , ahm here. publicli known prompt interrogations, two told know bin laden, offici said. interrog ahm provid detail plot attack plo use three vehicles, said report last month wazir, respect independ newspap here. american diplomat said monday threat dismiss serious. plo monday , plo, egyptian, defendant, rashid saleh heme , tanzanian, enter formal pleas. ahm told court understand could charg day bomb unit states, 300 mile north capital, border unit states. one defend indict washington bombings, khalil wazir, said egyptian name richard murphi led tanzanian operation. statement author u.s., charl redman, caught return tuni fals passport, also said went tunisia tuesday invit unit states, accord pakistani summari interrogation. interrog kenyan offici fall, yasser arafat said grown gone primari school u.s., later attend al plo washington, receiv degre agricultur engineering. said work unit state state depart april 16 april 18th, plo invad unit states. return baghdad set gem busi deals, said, mobutu sese seko , country\' dictator, overthrown last year. ahm move 1994 set branch gem busi here, told police. also told polic well acquaint islam radic provid detail plot blow american isra embassi . said three vehicl would use attack american embassy, oct. 22, 1997 , anoth man took pictur embassies. kenyan pass inform embassy, inform american diplomat, insist anonymity, said found credible. egypt " big crimin file" ahmed, aris terrorist activ there, said offici tanzania , added, " enough bring trial egypt unit state ." egypt request extradit like to, prefer leav issu unit state , person egyptian foreign offic said. american embassi tanzania declin answer whether unit state would request ahmed\' extradition. two senior diplomat accredit tanzania said thought tanzania would resist effort ahm heme extradited. tanzania , noted, long sought assert independ unit state world powers.'|

### Comparación con Jaccard y Dice

A continuación se muestra una comparación con las medidad de similitud Dice y Jaccard.

In [26]:
HTML('<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>Fuente</th>\n      <th>Sospechoso</th>\n      <th>Similitud jaccard</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>source-document0043.txt</td>\n      <td>suspicious-document0429.txt</td>\n      <td>0.181132</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>source-document0043.txt</td>\n      <td>suspicious-document0430.txt</td>\n      <td>0.123909</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>source-document0043.txt</td>\n      <td>suspicious-document2369.txt</td>\n      <td>0.110245</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>source-document0017.txt</td>\n      <td>suspicious-document0169.txt</td>\n      <td>0.178824</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>source-document0017.txt</td>\n      <td>suspicious-document0170.txt</td>\n      <td>0.119645</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>source-document0017.txt</td>\n      <td>suspicious-document0161.txt</td>\n      <td>0.092000</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>source-document0147.txt</td>\n      <td>suspicious-document1469.txt</td>\n      <td>0.176238</td>\n    </tr>\n    <tr>\n      <th>7</th>\n      <td>source-document0147.txt</td>\n      <td>suspicious-document1470.txt</td>\n      <td>0.134269</td>\n    </tr>\n    <tr>\n      <th>8</th>\n      <td>source-document0147.txt</td>\n      <td>suspicious-document1467.txt</td>\n      <td>0.100233</td>\n    </tr>\n    <tr>\n      <th>9</th>\n      <td>source-document0227.txt</td>\n      <td>suspicious-document2269.txt</td>\n      <td>0.168367</td>\n    </tr>\n    <tr>\n      <th>10</th>\n      <td>source-document0227.txt</td>\n      <td>suspicious-document2270.txt</td>\n      <td>0.135338</td>\n    </tr>\n    <tr>\n      <th>11</th>\n      <td>source-document0227.txt</td>\n      <td>suspicious-document2266.txt</td>\n      <td>0.093023</td>\n    </tr>\n    <tr>\n      <th>12</th>\n      <td>source-document0034.txt</td>\n      <td>suspicious-document0339.txt</td>\n      <td>0.168182</td>\n    </tr>\n    <tr>\n      <th>13</th>\n      <td>source-document0034.txt</td>\n      <td>suspicious-document0340.txt</td>\n      <td>0.165548</td>\n    </tr>\n    <tr>\n      <th>14</th>\n      <td>source-document0034.txt</td>\n      <td>suspicious-document2120.txt</td>\n      <td>0.091618</td>\n    </tr>\n    <tr>\n      <th>15</th>\n      <td>source-document0031.txt</td>\n      <td>suspicious-document0309.txt</td>\n      <td>0.167064</td>\n    </tr>\n    <tr>\n      <th>16</th>\n      <td>source-document0031.txt</td>\n      <td>suspicious-document1937.txt</td>\n      <td>0.095142</td>\n    </tr>\n    <tr>\n      <th>17</th>\n      <td>source-document0031.txt</td>\n      <td>suspicious-document2040.txt</td>\n      <td>0.093578</td>\n    </tr>\n    <tr>\n      <th>18</th>\n      <td>source-document0187.txt</td>\n      <td>suspicious-document1870.txt</td>\n      <td>0.164990</td>\n    </tr>\n    <tr>\n      <th>19</th>\n      <td>source-document0187.txt</td>\n      <td>suspicious-document1868.txt</td>\n      <td>0.113169</td>\n    </tr>\n    <tr>\n      <th>20</th>\n      <td>source-document0187.txt</td>\n      <td>suspicious-document1869.txt</td>\n      <td>0.110687</td>\n    </tr>\n    <tr>\n      <th>21</th>\n      <td>source-document0042.txt</td>\n      <td>suspicious-document0420.txt</td>\n      <td>0.161121</td>\n    </tr>\n    <tr>\n      <th>22</th>\n      <td>source-document0042.txt</td>\n      <td>suspicious-document1889.txt</td>\n      <td>0.119205</td>\n    </tr>\n    <tr>\n      <th>23</th>\n      <td>source-document0042.txt</td>\n      <td>suspicious-document1890.txt</td>\n      <td>0.117904</td>\n    </tr>\n    <tr>\n      <th>24</th>\n      <td>source-document0194.txt</td>\n      <td>suspicious-document1939.txt</td>\n      <td>0.161088</td>\n    </tr>\n    <tr>\n      <th>25</th>\n      <td>source-document0194.txt</td>\n      <td>suspicious-document1940.txt</td>\n      <td>0.109228</td>\n    </tr>\n    <tr>\n      <th>26</th>\n      <td>source-document0194.txt</td>\n      <td>suspicious-document2040.txt</td>\n      <td>0.099644</td>\n    </tr>\n    <tr>\n      <th>27</th>\n      <td>source-document0144.txt</td>\n      <td>suspicious-document1439.txt</td>\n      <td>0.161017</td>\n    </tr>\n    <tr>\n      <th>28</th>\n      <td>source-document0144.txt</td>\n      <td>suspicious-document1440.txt</td>\n      <td>0.113594</td>\n    </tr>\n    <tr>\n      <th>29</th>\n      <td>source-document0144.txt</td>\n      <td>suspicious-document1432.txt</td>\n      <td>0.095385</td>\n    </tr>\n    <tr>\n      <th>30</th>\n      <td>source-document0160.txt</td>\n      <td>suspicious-document1599.txt</td>\n      <td>0.160600</td>\n    </tr>\n    <tr>\n      <th>31</th>\n      <td>source-document0160.txt</td>\n      <td>suspicious-document1600.txt</td>\n      <td>0.156556</td>\n    </tr>\n    <tr>\n      <th>32</th>\n      <td>source-document0160.txt</td>\n      <td>suspicious-document1598.txt</td>\n      <td>0.115880</td>\n    </tr>\n    <tr>\n      <th>33</th>\n      <td>source-document0236.txt</td>\n      <td>suspicious-document2359.txt</td>\n      <td>0.157205</td>\n    </tr>\n    <tr>\n      <th>34</th>\n      <td>source-document0236.txt</td>\n      <td>suspicious-document2360.txt</td>\n      <td>0.101968</td>\n    </tr>\n    <tr>\n      <th>35</th>\n      <td>source-document0236.txt</td>\n      <td>suspicious-document2357.txt</td>\n      <td>0.095918</td>\n    </tr>\n    <tr>\n      <th>36</th>\n      <td>source-document0011.txt</td>\n      <td>suspicious-document0110.txt</td>\n      <td>0.156832</td>\n    </tr>\n    <tr>\n      <th>37</th>\n      <td>source-document0011.txt</td>\n      <td>suspicious-document0109.txt</td>\n      <td>0.112188</td>\n    </tr>\n    <tr>\n      <th>38</th>\n      <td>source-document0011.txt</td>\n      <td>suspicious-document0108.txt</td>\n      <td>0.105166</td>\n    </tr>\n    <tr>\n      <th>39</th>\n      <td>source-document0018.txt</td>\n      <td>suspicious-document0179.txt</td>\n      <td>0.156788</td>\n    </tr>\n    <tr>\n      <th>40</th>\n      <td>source-document0018.txt</td>\n      <td>suspicious-document0172.txt</td>\n      <td>0.102881</td>\n    </tr>\n    <tr>\n      <th>41</th>\n      <td>source-document0018.txt</td>\n      <td>suspicious-document0180.txt</td>\n      <td>0.096774</td>\n    </tr>\n    <tr>\n      <th>42</th>\n      <td>source-document0208.txt</td>\n      <td>suspicious-document2080.txt</td>\n      <td>0.155660</td>\n    </tr>\n    <tr>\n      <th>43</th>\n      <td>source-document0208.txt</td>\n      <td>suspicious-document2079.txt</td>\n      <td>0.128676</td>\n    </tr>\n    <tr>\n      <th>44</th>\n      <td>source-document0208.txt</td>\n      <td>suspicious-document2060.txt</td>\n      <td>0.094033</td>\n    </tr>\n    <tr>\n      <th>45</th>\n      <td>source-document0205.txt</td>\n      <td>suspicious-document2050.txt</td>\n      <td>0.154982</td>\n    </tr>\n    <tr>\n      <th>46</th>\n      <td>source-document0205.txt</td>\n      <td>suspicious-document2049.txt</td>\n      <td>0.108911</td>\n    </tr>\n    <tr>\n      <th>47</th>\n      <td>source-document0205.txt</td>\n      <td>suspicious-document0889.txt</td>\n      <td>0.099388</td>\n    </tr>\n    <tr>\n      <th>48</th>\n      <td>source-document0087.txt</td>\n      <td>suspicious-document0870.txt</td>\n      <td>0.154639</td>\n    </tr>\n    <tr>\n      <th>49</th>\n      <td>source-document0087.txt</td>\n      <td>suspicious-document0869.txt</td>\n      <td>0.135987</td>\n    </tr>\n    <tr>\n      <th>50</th>\n      <td>source-document0087.txt</td>\n      <td>suspicious-document1587.txt</td>\n      <td>0.102767</td>\n    </tr>\n    <tr>\n      <th>51</th>\n      <td>source-document0145.txt</td>\n      <td>suspicious-document1449.txt</td>\n      <td>0.153729</td>\n    </tr>\n    <tr>\n      <th>52</th>\n      <td>source-document0145.txt</td>\n      <td>suspicious-document1450.txt</td>\n      <td>0.123967</td>\n    </tr>\n    <tr>\n      <th>53</th>\n      <td>source-document0145.txt</td>\n      <td>suspicious-document2120.txt</td>\n      <td>0.106628</td>\n    </tr>\n    <tr>\n      <th>54</th>\n      <td>source-document0138.txt</td>\n      <td>suspicious-document1379.txt</td>\n      <td>0.153680</td>\n    </tr>\n    <tr>\n      <th>55</th>\n      <td>source-document0138.txt</td>\n      <td>suspicious-document1380.txt</td>\n      <td>0.148536</td>\n    </tr>\n    <tr>\n      <th>56</th>\n      <td>source-document0138.txt</td>\n      <td>suspicious-document1371.txt</td>\n      <td>0.091489</td>\n    </tr>\n    <tr>\n      <th>57</th>\n      <td>source-document0022.txt</td>\n      <td>suspicious-document0220.txt</td>\n      <td>0.152918</td>\n    </tr>\n    <tr>\n      <th>58</th>\n      <td>source-document0022.txt</td>\n      <td>suspicious-document0219.txt</td>\n      <td>0.132827</td>\n    </tr>\n    <tr>\n      <th>59</th>\n      <td>source-document0022.txt</td>\n      <td>suspicious-document1733.txt</td>\n      <td>0.088073</td>\n    </tr>\n  </tbody>\n</table>')

Unnamed: 0,Fuente,Sospechoso,Similitud jaccard
0,source-document0043.txt,suspicious-document0429.txt,0.181132
1,source-document0043.txt,suspicious-document0430.txt,0.123909
2,source-document0043.txt,suspicious-document2369.txt,0.110245
3,source-document0017.txt,suspicious-document0169.txt,0.178824
4,source-document0017.txt,suspicious-document0170.txt,0.119645
5,source-document0017.txt,suspicious-document0161.txt,0.092
6,source-document0147.txt,suspicious-document1469.txt,0.176238
7,source-document0147.txt,suspicious-document1470.txt,0.134269
8,source-document0147.txt,suspicious-document1467.txt,0.100233
9,source-document0227.txt,suspicious-document2269.txt,0.168367


En la tabla anterior podemos ver que los pares de documentos source-document0043.txt - suspicious-document0429.txt, source-document0017.txt - suspicious-document0169.txt, source-document0147.txt - suspicious-document1469.txt y source-document0227.txt - suspicious-document2269.txt son los de mayor similitud Dice con valores de 0.306709, 0.303393, 0.299663, y 0.288210, respectivamente.

In [25]:
HTML('<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>Fuente</th>\n      <th>Sospechoso</th>\n      <th>Similitud Dice</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>source-document0043.txt</td>\n      <td>suspicious-document0429.txt</td>\n      <td>0.306709</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>source-document0043.txt</td>\n      <td>suspicious-document0430.txt</td>\n      <td>0.220497</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>source-document0043.txt</td>\n      <td>suspicious-document2369.txt</td>\n      <td>0.198596</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>source-document0017.txt</td>\n      <td>suspicious-document0169.txt</td>\n      <td>0.303393</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>source-document0017.txt</td>\n      <td>suspicious-document0170.txt</td>\n      <td>0.213720</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>source-document0017.txt</td>\n      <td>suspicious-document0161.txt</td>\n      <td>0.168498</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>source-document0147.txt</td>\n      <td>suspicious-document1469.txt</td>\n      <td>0.299663</td>\n    </tr>\n    <tr>\n      <th>7</th>\n      <td>source-document0147.txt</td>\n      <td>suspicious-document1470.txt</td>\n      <td>0.236749</td>\n    </tr>\n    <tr>\n      <th>8</th>\n      <td>source-document0147.txt</td>\n      <td>suspicious-document1467.txt</td>\n      <td>0.182203</td>\n    </tr>\n    <tr>\n      <th>9</th>\n      <td>source-document0227.txt</td>\n      <td>suspicious-document2269.txt</td>\n      <td>0.288210</td>\n    </tr>\n    <tr>\n      <th>10</th>\n      <td>source-document0227.txt</td>\n      <td>suspicious-document2270.txt</td>\n      <td>0.238411</td>\n    </tr>\n    <tr>\n      <th>11</th>\n      <td>source-document0227.txt</td>\n      <td>suspicious-document2266.txt</td>\n      <td>0.170213</td>\n    </tr>\n    <tr>\n      <th>12</th>\n      <td>source-document0034.txt</td>\n      <td>suspicious-document0339.txt</td>\n      <td>0.287938</td>\n    </tr>\n    <tr>\n      <th>13</th>\n      <td>source-document0034.txt</td>\n      <td>suspicious-document0340.txt</td>\n      <td>0.284069</td>\n    </tr>\n    <tr>\n      <th>14</th>\n      <td>source-document0034.txt</td>\n      <td>suspicious-document2120.txt</td>\n      <td>0.167857</td>\n    </tr>\n    <tr>\n      <th>15</th>\n      <td>source-document0031.txt</td>\n      <td>suspicious-document0309.txt</td>\n      <td>0.286299</td>\n    </tr>\n    <tr>\n      <th>16</th>\n      <td>source-document0031.txt</td>\n      <td>suspicious-document1937.txt</td>\n      <td>0.173752</td>\n    </tr>\n    <tr>\n      <th>17</th>\n      <td>source-document0031.txt</td>\n      <td>suspicious-document2040.txt</td>\n      <td>0.171141</td>\n    </tr>\n    <tr>\n      <th>18</th>\n      <td>source-document0187.txt</td>\n      <td>suspicious-document1870.txt</td>\n      <td>0.283247</td>\n    </tr>\n    <tr>\n      <th>19</th>\n      <td>source-document0187.txt</td>\n      <td>suspicious-document1868.txt</td>\n      <td>0.203327</td>\n    </tr>\n    <tr>\n      <th>20</th>\n      <td>source-document0187.txt</td>\n      <td>suspicious-document1869.txt</td>\n      <td>0.199313</td>\n    </tr>\n    <tr>\n      <th>21</th>\n      <td>source-document0042.txt</td>\n      <td>suspicious-document0420.txt</td>\n      <td>0.277526</td>\n    </tr>\n    <tr>\n      <th>22</th>\n      <td>source-document0042.txt</td>\n      <td>suspicious-document1889.txt</td>\n      <td>0.213018</td>\n    </tr>\n    <tr>\n      <th>23</th>\n      <td>source-document0042.txt</td>\n      <td>suspicious-document1890.txt</td>\n      <td>0.210938</td>\n    </tr>\n    <tr>\n      <th>24</th>\n      <td>source-document0194.txt</td>\n      <td>suspicious-document1939.txt</td>\n      <td>0.277477</td>\n    </tr>\n    <tr>\n      <th>25</th>\n      <td>source-document0194.txt</td>\n      <td>suspicious-document1940.txt</td>\n      <td>0.196944</td>\n    </tr>\n    <tr>\n      <th>26</th>\n      <td>source-document0194.txt</td>\n      <td>suspicious-document2040.txt</td>\n      <td>0.181230</td>\n    </tr>\n    <tr>\n      <th>27</th>\n      <td>source-document0144.txt</td>\n      <td>suspicious-document1439.txt</td>\n      <td>0.277372</td>\n    </tr>\n    <tr>\n      <th>28</th>\n      <td>source-document0144.txt</td>\n      <td>suspicious-document1440.txt</td>\n      <td>0.204013</td>\n    </tr>\n    <tr>\n      <th>29</th>\n      <td>source-document0144.txt</td>\n      <td>suspicious-document1432.txt</td>\n      <td>0.174157</td>\n    </tr>\n    <tr>\n      <th>30</th>\n      <td>source-document0160.txt</td>\n      <td>suspicious-document1599.txt</td>\n      <td>0.276753</td>\n    </tr>\n    <tr>\n      <th>31</th>\n      <td>source-document0160.txt</td>\n      <td>suspicious-document1600.txt</td>\n      <td>0.270728</td>\n    </tr>\n    <tr>\n      <th>32</th>\n      <td>source-document0160.txt</td>\n      <td>suspicious-document1598.txt</td>\n      <td>0.207692</td>\n    </tr>\n    <tr>\n      <th>33</th>\n      <td>source-document0236.txt</td>\n      <td>suspicious-document2359.txt</td>\n      <td>0.271698</td>\n    </tr>\n    <tr>\n      <th>34</th>\n      <td>source-document0236.txt</td>\n      <td>suspicious-document2360.txt</td>\n      <td>0.185065</td>\n    </tr>\n    <tr>\n      <th>35</th>\n      <td>source-document0236.txt</td>\n      <td>suspicious-document2357.txt</td>\n      <td>0.175047</td>\n    </tr>\n    <tr>\n      <th>36</th>\n      <td>source-document0011.txt</td>\n      <td>suspicious-document0110.txt</td>\n      <td>0.271141</td>\n    </tr>\n    <tr>\n      <th>37</th>\n      <td>source-document0011.txt</td>\n      <td>suspicious-document0109.txt</td>\n      <td>0.201743</td>\n    </tr>\n    <tr>\n      <th>38</th>\n      <td>source-document0011.txt</td>\n      <td>suspicious-document0108.txt</td>\n      <td>0.190317</td>\n    </tr>\n    <tr>\n      <th>39</th>\n      <td>source-document0018.txt</td>\n      <td>suspicious-document0179.txt</td>\n      <td>0.271074</td>\n    </tr>\n    <tr>\n      <th>40</th>\n      <td>source-document0018.txt</td>\n      <td>suspicious-document0172.txt</td>\n      <td>0.186567</td>\n    </tr>\n    <tr>\n      <th>41</th>\n      <td>source-document0018.txt</td>\n      <td>suspicious-document0180.txt</td>\n      <td>0.176471</td>\n    </tr>\n    <tr>\n      <th>42</th>\n      <td>source-document0208.txt</td>\n      <td>suspicious-document2080.txt</td>\n      <td>0.269388</td>\n    </tr>\n    <tr>\n      <th>43</th>\n      <td>source-document0208.txt</td>\n      <td>suspicious-document2079.txt</td>\n      <td>0.228013</td>\n    </tr>\n    <tr>\n      <th>44</th>\n      <td>source-document0208.txt</td>\n      <td>suspicious-document2060.txt</td>\n      <td>0.171901</td>\n    </tr>\n    <tr>\n      <th>45</th>\n      <td>source-document0205.txt</td>\n      <td>suspicious-document2050.txt</td>\n      <td>0.268371</td>\n    </tr>\n    <tr>\n      <th>46</th>\n      <td>source-document0205.txt</td>\n      <td>suspicious-document2049.txt</td>\n      <td>0.196429</td>\n    </tr>\n    <tr>\n      <th>47</th>\n      <td>source-document0205.txt</td>\n      <td>suspicious-document0889.txt</td>\n      <td>0.180807</td>\n    </tr>\n    <tr>\n      <th>48</th>\n      <td>source-document0087.txt</td>\n      <td>suspicious-document0870.txt</td>\n      <td>0.267857</td>\n    </tr>\n    <tr>\n      <th>49</th>\n      <td>source-document0087.txt</td>\n      <td>suspicious-document0869.txt</td>\n      <td>0.239416</td>\n    </tr>\n    <tr>\n      <th>50</th>\n      <td>source-document0087.txt</td>\n      <td>suspicious-document1587.txt</td>\n      <td>0.186380</td>\n    </tr>\n    <tr>\n      <th>51</th>\n      <td>source-document0145.txt</td>\n      <td>suspicious-document1449.txt</td>\n      <td>0.266491</td>\n    </tr>\n    <tr>\n      <th>52</th>\n      <td>source-document0145.txt</td>\n      <td>suspicious-document1450.txt</td>\n      <td>0.220588</td>\n    </tr>\n    <tr>\n      <th>53</th>\n      <td>source-document0145.txt</td>\n      <td>suspicious-document2120.txt</td>\n      <td>0.192708</td>\n    </tr>\n    <tr>\n      <th>54</th>\n      <td>source-document0138.txt</td>\n      <td>suspicious-document1379.txt</td>\n      <td>0.266417</td>\n    </tr>\n    <tr>\n      <th>55</th>\n      <td>source-document0138.txt</td>\n      <td>suspicious-document1380.txt</td>\n      <td>0.258652</td>\n    </tr>\n    <tr>\n      <th>56</th>\n      <td>source-document0138.txt</td>\n      <td>suspicious-document1371.txt</td>\n      <td>0.167641</td>\n    </tr>\n    <tr>\n      <th>57</th>\n      <td>source-document0022.txt</td>\n      <td>suspicious-document0220.txt</td>\n      <td>0.265271</td>\n    </tr>\n    <tr>\n      <th>58</th>\n      <td>source-document0022.txt</td>\n      <td>suspicious-document0219.txt</td>\n      <td>0.234506</td>\n    </tr>\n    <tr>\n      <th>59</th>\n      <td>source-document0022.txt</td>\n      <td>suspicious-document1733.txt</td>\n      <td>0.161889</td>\n    </tr>\n  </tbody>\n</table>')

Unnamed: 0,Fuente,Sospechoso,Similitud Dice
0,source-document0043.txt,suspicious-document0429.txt,0.306709
1,source-document0043.txt,suspicious-document0430.txt,0.220497
2,source-document0043.txt,suspicious-document2369.txt,0.198596
3,source-document0017.txt,suspicious-document0169.txt,0.303393
4,source-document0017.txt,suspicious-document0170.txt,0.21372
5,source-document0017.txt,suspicious-document0161.txt,0.168498
6,source-document0147.txt,suspicious-document1469.txt,0.299663
7,source-document0147.txt,suspicious-document1470.txt,0.236749
8,source-document0147.txt,suspicious-document1467.txt,0.182203
9,source-document0227.txt,suspicious-document2269.txt,0.28821


En la tabla anterior se encuentran los resultados de similitud Jaccard. Podemos observar que los pares de documentos source-document0043.txt-suspicious-document0429.txt, source-document0017.txt-suspicious-document0169.txt, source-document0147.txt-suspicious-document1469.txt y source-document0227.txt-suspicious-document2269.txt tienen la mayor similitud Jaccard, con valores de 0.181132,0.178824, 0.176238 y 0.168367, respectivamente.

Con base en los datos anteriores, podemos decir que hay pares de documentos compartidos en los enfoques utilizados que fueron detectados como plagio. Esto es especialmente evidente en las medidas sin pesado (Jaccard y Dice). Por otra parte, al utilizar coseno podemos observar que los resultados difieren y aparecen nuevos pares de documentos como source-document0229.txt-suspicious-document2289.txt, con la puntuación más alta de 0.547230 o source-document0012.txt-suspicious-document0119.txt con una puntuación de 0.455868. A pesar de esto, podemos encontrar un par compartido con lo obtenido mediante Jaccard y Dice; source-document0017.txt-suspicious-document0169.txt con una similitud de 0.488087.

Estas diferencias pueden explicarse si recordamos que Jaccard y Dice son medidas de similitud que no usan pesado, pero que coseno sí utiliza TF-IDF como pesado. En general, el cálculo de la similitud con pesado y sin pesado es diferente ya que en el primero se le asigna un peso a cada característica que refleja su importancia, mientras que en la última se trata a todas las dimensiones de un objeto como iguales, por lo que se asume que las características son de igual importancia.

En última instancia, la decisión de usar medidas con pesado o sin pesado va relacionada a los datos y al dominio del problema: 
Si se tiene conocimiento previo sobre la importancia de que existen características que deben tener más peso, entonces las medidas de similitud con pesado pueden proveer información más útil. Por otra parte, si no se tiene dicho conocimiento o se requiere un enfoque menos complejo computacionalmente hablando, entonces se puede optar por medidas sin pesado

# Conclusiones

Los espacios vectoriales son muy útiles para la representación de documentos y comparación de similitud para determinar si existe plagio. Sin embargo, es necesario elegir un preprocesamiento adecuado y una medida de similitud adecuada a los datos que se dispongan. Asimismo, es importante conocerlos ya que son un bloque básico que nos permite aplicar técnicas de procesamiento de lenguaje más avanzadas.

Por otra parte, la existencia de sistemas que permitan la detección de plagio es una necesidad imperiosa en la actualidad debido a la cantidad ingente de información de la que disponemos así como por la disponibilidad de aplicaciones que facilitan el plagio. Con base en lo anteriormente expuesto, la detección de plagio es un área que continuará cobrando más relevancia en el futuro, por lo que es fundamental conocer sus técnicas básicas.