Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Browse files

progress report

  • Loading branch information...
commit 23ec52084ecf2ac4f207c164bf13422c2099bf72 1 parent 03fc56c
@nwoodward nwoodward authored
Showing with 1 addition and 1 deletion.
  1. +1 −1  20111120 Final paper/blog-meme.tex
View
2  20111120 Final paper/blog-meme.tex
@@ -224,7 +224,7 @@ \subsubsection{Deduplication and HTML Strip}
\end{center}
\end{table}
-Once the vector generation is completed, the deduplication program scans through the vector files, finds duplicate vectors and writes a removal list, which is a list of document ids that should be deleted in the next step. If a vector is shorter than 5, the document id is added to the removal list. The program notes vector files of $\pm 2$ length of the current vector starting with the same character. Document simalarity is measured by the following formula:
+Once the vector generation is complete, the deduplication program scans through the vector files, finds duplicate vectors and writes a removal list, which is a list of document ids that should be deleted in the next step. If a vector is shorter than 5, the document id is added to the removal list. The program notes vector files of $\pm 2$ length of the current vector starting with the same character. Document simalarity is measured by the following formula:
\begin{displaymath}
dsim=\frac{A \cap B} {min(|A|, |B|)}
\end{displaymath}
Please sign in to comment.
Something went wrong with that request. Please try again.