Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Browse files

progress report

  • Loading branch information...
commit 9bf583dd34ecbcaaba7ef1f4aabdcff92365313d 1 parent 9eaab44
@nwoodward nwoodward authored
Showing with 5 additions and 5 deletions.
  1. +5 −5 20111120 Final paper/blog-meme.tex
View
10 20111120 Final paper/blog-meme.tex
@@ -394,7 +394,7 @@ \subsection{Yahoo! LDA}
We were able to resolve the first two issues by modifying several lines of code in the script files, but we could not resolve the last problem. According to the Yahoo! LDA implementation author\footnote{Personal communication.}, the lost connection problem is caused by Hadoop. The Yahoo! LDA implementation supports checkpointing and restart mechanisms, but when we attempted a restart the environment reverted back to what it was after the first iteration of LDA. The primary difficulty was the large scale of our collection (4 million memes), and many iterations were required to produce meaningful topic collection. On average each iteration ran for seven minutes with a single machine and 1.5 minutes with five machines. The default number of iterations of Yahoo! LDA implementation is 1,000.
\subsection{Mahout Canopy Clustering}
-One of the most common and efficient clustering algorithm is $k$-means clustering. The main drawbacks of $k$-means clustering is that the number of clusters $k$ must be specified at runtime, instead of determined dynamically. As such, $k$-means clustering is very sensitive to the initial seed clusters. Unlike $k$-means, canopy clustering generates clusters based on a distance measure. Primarily for this reason we chose to estimate a suitable number of clusters and initial seed clusters using the canopy clustering approach \cite{McCallum2000}. We used the canopy implementation from the Apache Mahout libraries with the Jaccardi distance measure.
+One of the most common and efficient clustering algorithm is $k$-means clustering. The main drawbacks of $k$-means clustering is that the number of clusters $k$ must be specified at runtime, instead of determined dynamically. As such, $k$-means clustering is very sensitive to the initial seed clusters. Unlike $k$-means, canopy clustering generates clusters based on a distance measure. Primarily for this reason we chose to estimate a suitable number of clusters and initial seed clusters using the canopy clustering approach \cite{McCallum2000}. We used the canopy implementation from the Apache Mahout libraries with the Jaccard distance measure.
Canopy clustering receives 2 distance parameters $t_1, t_2$ such that $t_1 > t_2$. It starts with empty canopy list. If the distance between the vector and the canopy in the canopy list is less than $t_1$, the vector is added to the canopy. The distance from the canopy is the distance from the centroid of vectors in the canopy. If there is no canopy whose distance is less than $t_2$, meaning the point is not strongly bounded for canopies, the new canopy is created with the point. This is repeated for every point.
@@ -417,13 +417,13 @@ \subsection{Indexed MapReduce Canopy}
RBW(A, B) = {{|A \cap B|} \over {(|A|+|B|) / 2} }
\end{displaymath}
-Table \ref{table:sim} shows that Jaccardi's coefficient is much more sensitive to the length difference between a meme and a centroid. By dampening the sensitivity to the length difference, the RBW coefficient is more sensitive to the intersection of their terms.
+Table \ref{table:sim} shows that Jaccard's coefficient is much more sensitive to the length difference between a meme and a centroid. By dampening the sensitivity to the length difference, the RBW coefficient is more sensitive to the intersection of their terms.
\begin{table}[h!t!]
\begin{center}
\begin{tabular}{p{3.0cm}|p{2.2cm}|r|r}
\hline
-\textbf{Meme}&\textbf{Centroid}&\textbf{Jaccardi}&\textbf{RBW}\\
+\textbf{Meme}&\textbf{Centroid}&\textbf{Jaccard}&\textbf{RBW}\\
\hline
weapons of mass destruction programs&&1.00&1.00\\
\hline
@@ -434,7 +434,7 @@ \subsection{Indexed MapReduce Canopy}
weapons of mass destruction and ballistic missiles&weapons mass destruction programs iraq fear&0.37&0.50\\
\hline
\end{tabular}
-\caption{Comparison between Jaccardi's coefficient and the proposed RBW coefficient. Jaccardi's coefficient drops quickly as the centroid gets longer. In contrast, the RBW coefficient is less sensitive to the length difference and more sensitive to the intersection of words.}
+\caption{Comparison between Jaccard's coefficient and the proposed RBW coefficient. Jaccard's coefficient drops quickly as the centroid gets longer. In contrast, the RBW coefficient is less sensitive to the length difference and more sensitive to the intersection of words.}
\label{table:sim}
\end{center}
\end{table}
@@ -709,7 +709,7 @@ \subsubsection{Ranking with Pairwise Comparison}
\end{figure*}
\section{Conclusion}
-The issues surrounding information retrieval in the blogosphere are not likely to disappear anytime soon. With billions of blog posts each year, approaches aimed at deriving high quality information from blogs will require large-scale solutions that account for the particularities of the medium. In our project, we utilized Hadoop MapReduce, crowdsourced rankings, and machine learning to find informative memes from a large-scale blog collection. While the previous body of literature focused on extracting named entities and abstract topics from blogs, we attempted to extract meaningful chunks of information known as memes and evaluate them using machine learning technology powered by human intelligence gathered from an Amazon Mechanical Turk. In the process, we successfully completed massive text preprocessing to extract memes and devised a novel indexed approach for canopy clustering.
+The interest surrounding information retrieval in the blogosphere is not likely to disappear any time soon. With billions of blog posts each year, approaches aimed at deriving high quality information from blogs will require large-scale solutions that account for the particularities of the medium. In our project, we utilized Hadoop MapReduce, crowdsourced rankings, and machine learning to find informative memes from a large-scale blog collection. While the previous body of literature focused on extracting named entities and abstract topics from blogs, we attempted to extract meaningful chunks of information known as memes and evaluate them using machine learning technology powered by human intelligence gathered from an Amazon Mechanical Turk. In the process, we successfully completed massive text preprocessing to extract memes and devised a novel indexed approach for canopy clustering.
We used the data from the Mechanical Turk to create a feature set for training two types of support vector machines. The quality of this task is crucially dependent on not only a rigorous approach to feature engineering but also on using the appropriate machine learning framework. Our classification and regression approaches did not work as well as we had hoped. However, with better modeling, a larger training set and improved feature engineering , we are confident that the quality of our results will improve. The first step will be evaluating memes in context with our Meme Browser to build a better training set. Additionally, time series analysis will improve our representation of memes in the blogosphere by allowing us to demonstrate how they shift over time.
Please sign in to comment.
Something went wrong with that request. Please try again.