Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 326 lines (265 sloc) 20.105 kb
2318e47 @olav theory from previous project
authored
1 \section{Personalized Search}
c061bfc @olav section label
authored
2 \label{sec:search}
2318e47 @olav theory from previous project
authored
3
ffaaf4b @olav information retrieval theory
authored
4 Personalized search means adapting the results of a search engine to each individual user.
3b5afcd @olav theory text 3.0
authored
5 As we shall see, this field has a lot in common with recommender systems.
7b4ae38 @olav theory:search text 2.0
authored
6 In both situations, we wish to predict how relevant each item will be to each user.
3b5afcd @olav theory text 3.0
authored
7 Before delving into the techniques of personalizing search results, we will quickly present
ffaaf4b @olav information retrieval theory
authored
8 the basics of \emph{information retrieval} (IR).
2318e47 @olav theory from previous project
authored
9
ffaaf4b @olav information retrieval theory
authored
10 \subsection{Information Retrieval}
c078305 @olav more section labels
authored
11 \label{sec:ir}
2318e47 @olav theory from previous project
authored
12
09db39c @olav fix page citations
authored
13 \citet[p.1]{Manning2008} define IR as \emph{``finding material (usually documents) of
ffaaf4b @olav information retrieval theory
authored
14 an unstructured nature (usually text) that satisfies an information need
6a872de @olav fix quotes and citations; turn colon-sentences to duals.
authored
15 from within large collections (usually stored on computers)''}.
2318e47 @olav theory from previous project
authored
16
c4e3838 @olav final changes
authored
17 How does this relate to recommender systems? There is an important distinction.
18 The purpose of \emph{recommendations} is twofold. (1) To show the user items
7b4ae38 @olav theory:search text 2.0
authored
19 similar to another item, and (2) to allow discovery of relevant items the user did not know exist.
6a872de @olav fix quotes and citations; turn colon-sentences to duals.
authored
20 The purpose of \emph{search} is a bit different. To allow the user to find the location of
ffaaf4b @olav information retrieval theory
authored
21 information he or she knows (or hopes) exists.
4341178 @olav less 'in other words'
authored
22 The defining separator is often the knowledge of existence.
7b4ae38 @olav theory:search text 2.0
authored
23 However, as we shall see, the two fields employ similar methods and terminology.
3b5afcd @olav theory text 3.0
authored
24 In the next chapter, we will show how these can work together,
25 by allowing an IR method to constrain the item-space worked on by our recommender system.
9728f89 @olav moved ir methods outline
authored
26
09db39c @olav fix page citations
authored
27 \citet[p.23]{Baeza-Yates1999} presents a formal definition of an IR system:
6eff029 @olav equation on same line, need the space
authored
28 $\mathrm{IR} = (Documents, Queries, Framework, ranking(q_i, d_i))$.
2318e47 @olav theory from previous project
authored
29
ffaaf4b @olav information retrieval theory
authored
30 As evident by the scope of IR literature, these elements can be just about anything
31 that has to do with retrieving information. However, in what is often called
7b4ae38 @olav theory:search text 2.0
authored
32 \emph{classic IR}, the documents contain free text with little internal structure,
09db39c @olav fix page citations
authored
33 and the queries are short user-initiated descriptions of an \emph{information need} \citep[p.19]{Baeza-Yates1999}.
7b4ae38 @olav theory:search text 2.0
authored
34 For example, this model describes Web search engines, where the documents are web pages and
35 queries are short sentences or keywords input by users.
3fd2939 @olav outline
authored
36
ffaaf4b @olav information retrieval theory
authored
37 The \emph{Framework} in our quadruple refers to how documents are stored and retrieved.
38 Basic approaches to IR split each document into a set of terms (e.g. words),
7b4ae38 @olav theory:search text 2.0
authored
39 and create an \emph{inverted index}
09db39c @olav fix page citations
authored
40 \cite[p.22]{Manning2008},
7b4ae38 @olav theory:search text 2.0
authored
41 that lists documents by terms.
ffaaf4b @olav information retrieval theory
authored
42 There are numerous extensions to this framework, including:
2318e47 @olav theory from previous project
authored
43
ffaaf4b @olav information retrieval theory
authored
44 \begin{itemize*}
09db39c @olav fix page citations
authored
45 \item Positional information for phrase search \cite[p.39]{Manning2008}
46 \item Stop word removal (removing the most common terms) \cite[p.27]{Manning2008}
47 \item Stemming (reducing words to their root forms) \cite[p.32]{Manning2008}
48 \item Lemmatisation (contextual inflection removal) \cite[p.32]{Manning2008}
49 \item Query reformulation \citep[p.117]{Baeza-Yates1999}
ffaaf4b @olav information retrieval theory
authored
50 \end{itemize*}
2318e47 @olav theory from previous project
authored
51
ffaaf4b @olav information retrieval theory
authored
52 All these techniques help improve (among other things)
53 the \emph{recall} and \emph{precision} of the retrieval engine.
09db39c @olav fix page citations
authored
54 Recall, precision and relevance are well defined measures for evaluating the quality of a search engine \cite[p.5]{Manning2008}:
ffaaf4b @olav information retrieval theory
authored
55
56 \begin{itemize*}
57 \item A document is \emph{relevant} if it satisfies the user's information need.
58 \item \emph{Recall} is the fraction of relevant documents retrieved by the system.
59 \item \emph{Precision} if the fraction of retrieved documents that are relevant.
60 \end{itemize*}
61
f802ceb @olav spell check of entire document
authored
62 There are many more measures, but recall and precision succinctly define what a search engine must to
6a872de @olav fix quotes and citations; turn colon-sentences to duals.
authored
63 to be successful --- retrieve many relevant documents and few irrelevant documents.
7b4ae38 @olav theory:search text 2.0
authored
64 Failing this test is to neglect the main purpose of IR:
65 to prevent information overload by allowing efficient access
3b5afcd @olav theory text 3.0
authored
66 to relevant parts of an otherwise overwhelming collection of information.
ffaaf4b @olav information retrieval theory
authored
67
3b5afcd @olav theory text 3.0
authored
68 In relation to this thesis, the most interesting part of any IR system is its \emph{ranking function}.
69 This function computes the score of each document relative to the current query.
6a872de @olav fix quotes and citations; turn colon-sentences to duals.
authored
70 The relation to recommender systems is quite clear. Both the ranking function and RSs
3b5afcd @olav theory text 3.0
authored
71 compute the relevance of items in the current context, either based on a query or the current user.
7b4ae38 @olav theory:search text 2.0
authored
72 Indeed, IR systems use many of the same metrics to measure the similarity of queries and documents,
73 as RSs measure the similarity of items.
ffaaf4b @olav information retrieval theory
authored
74
7b4ae38 @olav theory:search text 2.0
authored
75 A common framework for storing and ranking documents is the vector space model (VSM).
ffaaf4b @olav information retrieval theory
authored
76 This model stores documents as term vectors. Each term represents a dimension, and documents are
77 vectors in this term-space. When performing a query, the query terms are also represented as a vector
78 in the same space. By computing the cosine similarity between the query and each document,
09db39c @olav fix page citations
authored
79 we get a good estimate of how well a document matches a query \citep[p.29]{Baeza-Yates1999}.
ffaaf4b @olav information retrieval theory
authored
80
7b4ae38 @olav theory:search text 2.0
authored
81 The next question is what to store at each $(document, term)$ coordinate in the vector space
82 (called the document/term weights).
ffaaf4b @olav information retrieval theory
authored
83 Storing simple 1 or 0 values representing whether or not terms are present gives a model
84 where a document's relevance is proportional to how
7b4ae38 @olav theory:search text 2.0
authored
85 many query terms it includes.
86 However, this is not very precise.
ffaaf4b @olav information retrieval theory
authored
87 For example, by this definition, a document containing every conceivable query term
88 would be the most relevant to any query.
09db39c @olav fix page citations
authored
89 A better idea is to use something like the TF-IDF weighting scheme \citep[p.29]{Baeza-Yates1999}:
ffaaf4b @olav information retrieval theory
authored
90
860a51c @olav move autonomy section to correct place. fix equation spacing
authored
91 \begin{eqsp}
ffaaf4b @olav information retrieval theory
authored
92 w_{t,d} = tf_{t,d} \times idf_{t}
fbdd7aa @olav improved symbols used in formulas
authored
93 = \frac{ f(t,d) }{ \sum_{k \in d} f(k,d) } \times
ffaaf4b @olav information retrieval theory
authored
94 \log \frac{N}{n_{t}}.
860a51c @olav move autonomy section to correct place. fix equation spacing
authored
95 \end{eqsp}
bdf08a9 @olav equation spacing
authored
96
ffaaf4b @olav information retrieval theory
authored
97 The final weight is computed by multiplying the term frequency score (TF) $tf_{t,d}$ with the
98 inverse document frequency (IDF) $idf_{t}$. TF evaluates how well the term describes the document contents,
99 while IDF punish terms that appear in many documents.
fbdd7aa @olav improved symbols used in formulas
authored
100 $f(t,d)$ gives the frequency of a term in a document. $N$ is the total number of documents,
ffaaf4b @olav information retrieval theory
authored
101 and $n_{t}$ the number of documents in which $t$ appears. The effect of the IDF factor is dampened by taking its
102 log-value. Together, TF and IDF ranks documents higher by words that discriminate well within the document corpus,
103 and ignores words that appear in so many documents that they have little to no predictive capacity.
104
105 While simple, TF-IDF has proven itself resilient when compared to more complex methods,
f802ceb @olav spell check of entire document
authored
106 and many more complex methods have been built on its foundations (e.g. BM25, one of the most successful
ffaaf4b @olav information retrieval theory
authored
107 probabilistic weighting algorithms \citep{Robertson2010}).
108
7b4ae38 @olav theory:search text 2.0
authored
109 There are as many IR models as there are domains that need search.
110 Even the basic VSM can be constructed in a myriad of ways. Other models include the simple
ffaaf4b @olav information retrieval theory
authored
111 \emph{boolean search model}, where queries are based on boolean algebra. Probabilistic models
112 frame the similarity question as the probability that the document is relevant.
3b5afcd @olav theory text 3.0
authored
113 Latent Semantic Indexing (LSI) is the application of SVD to IR by performing dimensionality reduction of the term-space
114 into concept-space.
ffaaf4b @olav information retrieval theory
authored
115 See \cite{Manning2008}, \cite{Robertson2010} or \cite{Baeza-Yates1999} for a more comprehensive introduction to models in IR.
116
b28aa30 @olav more on signals
authored
117 The important take-away is that, while serving different use cases, RSs and IR systems
7b4ae38 @olav theory:search text 2.0
authored
118 employ much of the same technology, only with different input and expected output.
b28aa30 @olav more on signals
authored
119
ffaaf4b @olav information retrieval theory
authored
120
a278caa @olav improve headlines
authored
121 \subsection{Ranking Signals}
625adef @olav clear up language
authored
122 \label{subsec:signals}
ffaaf4b @olav information retrieval theory
authored
123
7b4ae38 @olav theory:search text 2.0
authored
124 Modern Web search engines have long ago moved on from simple ranking metrics such as TF-IDF.
125 While traditional metrics may form the foundation of modern search engines, a lot more thought go into the final results.
126 Different types of ranking functions are combined to produce the final \emph{search engine results page} (SERP),
ffaaf4b @olav information retrieval theory
authored
127 with each ranking function often being referred to as a \emph{signal}. Alternate names include
f802ceb @olav spell check of entire document
authored
128 \emph{re-ranking} or \emph{re-scoring} functions.
ffaaf4b @olav information retrieval theory
authored
129
6a872de @olav fix quotes and citations; turn colon-sentences to duals.
authored
130 Google, the company behind the popular online search engine, writes that \emph{``Today we use more than 200 signals, including PageRank,
131 to order websites, and we update these algorithms on a weekly basis.
ffaaf4b @olav information retrieval theory
authored
132 For example, we offer personalized search results based on your web history and
6a872de @olav fix quotes and citations; turn colon-sentences to duals.
authored
133 location.''}\footnote{\url{google.com/corporate/tech.html} --- accessed 11.04.2011}
7b4ae38 @olav theory:search text 2.0
authored
134 Bing, a Web search engine from Microsoft, uses the same terminology:
6a872de @olav fix quotes and citations; turn colon-sentences to duals.
authored
135 \emph{``We use over 1,000 different signals and features in our ranking
136 algorithm.''}\footnote{\url{bing.com/community/site_blogs/b/search/archive/2011/02/01/thoughts-on-search-quality.aspx} --- accessed 11.04.2011}
ffaaf4b @olav information retrieval theory
authored
137
b28aa30 @olav more on signals
authored
138 Signals are often products of the document structure of the current domain.
09db39c @olav fix page citations
authored
139 \citet[p.5]{Bender2005} points to the use of the proximity of query terms in matching documents.
b28aa30 @olav more on signals
authored
140 Those where the terms appear close together are natural candidates for a higher ranking.
141 Other signals, still reliant on the documents themselves, are more domain oriented.
3b5afcd @olav theory text 3.0
authored
142 One signal they point out is how words in a larger or bold font can be weighted
f802ceb @olav spell check of entire document
authored
143 higher than normally typeset words.
3b5afcd @olav theory text 3.0
authored
144 In this way, the design of a document is used to choose the most important terms.
b28aa30 @olav more on signals
authored
145
09db39c @olav fix page citations
authored
146 Signals can also depend on the query. \citet[p.145]{Manning2008} describes a system that takes
035eb07 @olav note to self: do not use the term 'each' in every single sentence
authored
147 multi-word queries, breaks them up into different permutations and runs the new queries against the same
b28aa30 @olav more on signals
authored
148 document index and ranking function. Each query corresponds to its own ranked set of results,
149 which are sent to a \emph{rank aggregation function} which turns the accumulated ranking evidence
150 into one coherent result. We will have more to say on rank aggregation in Section \ref{sec:aggregate}.
151
152 Signals can also be external to the collection or relational within the collection.
c4e3838 @olav final changes
authored
153 PageRank \cite[p.4]{Bender2005} is perhaps the best known of the relational signals.
f802ceb @olav spell check of entire document
authored
154 The algorithm forms a probability distribution over web pages, ranking their perceived
ffaaf4b @olav information retrieval theory
authored
155 authority or importance according to a simple iterative estimation.
035eb07 @olav note to self: do not use the term 'each' in every single sentence
authored
156 Every web site is given its rank based on how many pages that link to it.
ffaaf4b @olav information retrieval theory
authored
157 For each page that provides links, the score it contributes to the linked-to page is
7b4ae38 @olav theory:search text 2.0
authored
158 its own page rank, inversely proportional to the number of outbound links the page has.
09db39c @olav fix page citations
authored
159 Another intuitive justification for a site's PageRank is the \emph{random surfer model} \cite[p.4]{Bender2005}.
6a872de @olav fix quotes and citations; turn colon-sentences to duals.
authored
160 The probability that the random surfer visits a page is its PageRank. The \emph{randomness} is introduced
ffaaf4b @olav information retrieval theory
authored
161 by a damping parameter $d$, which is the probability that a user will stop browsing and start at a new random page:
162
860a51c @olav move autonomy section to correct place. fix equation spacing
authored
163 \begin{eqsp}
164 \mathrm{PageRank}(x) = \frac{1 - d}{N} + d \sum_{y \in B_x} \frac{\mathrm{PageRank}(y)}{\mathrm{Links}(y)}.
165 \end{eqsp}
166
167 $B_x$ is the set of pages linking to page $x$, and $\mathrm{Links}(y)$ is the number of outbound links from page $y$.
f802ceb @olav spell check of entire document
authored
168 The first term distributes an equal PageRank score to all pages that have no outbound links, as $N$ is the total number of pages.
b28aa30 @olav more on signals
authored
169 This iterative algorithm is run until convergence inside a small delta.
ffaaf4b @olav information retrieval theory
authored
170
c4e3838 @olav final changes
authored
171 Let us now finally take a look \emph{personalized search},
172 a field where such signals may play an important part.
ffaaf4b @olav information retrieval theory
authored
173
174
a278caa @olav improve headlines
authored
175 \subsection{Personalizing Search Results}
d07ca70 @olav personalization theory
authored
176
177 Search engines, especially online search engines, face a huge challenge.
c4e3838 @olav final changes
authored
178 In addition to the wide range of websites, the ambiguity of language and
179 the restricted nature of queries comes the wildly differing users.
d07ca70 @olav personalization theory
authored
180 Each user is unique. Even when considering one user, there might be many
181 different use cases, for example when using the same search engine at work and at home.
182 Another identified problem is that users use search engines for navigation as well as pure search.
6a872de @olav fix quotes and citations; turn colon-sentences to duals.
authored
183 \citet{Teevan2007} found that as many as 40\% of all queries to the Yahoo! search engine were ``re-finding queries'',
d07ca70 @olav personalization theory
authored
184 i.e. attempts to find information the user had accessed before.
185
7b4ae38 @olav theory:search text 2.0
authored
186 \emph{Personalized search} attempts to solve these problems by introducing individually adaptive search results.
d07ca70 @olav personalization theory
authored
187 These techniques are based on user modeling (as introduced in Section \ref{sec:modeling}),
188 and attempts to build predictive models based on mined user preferences.
c4e3838 @olav final changes
authored
189 Commonly, this is done through query log analysis (e.g. \cite{Liu2002, Sugiyama2004, Shen2005, Speretta2000}).
4341178 @olav less 'in other words'
authored
190 These are often model-based techniques with implicit knowledge gathering agents,
c4e3838 @olav final changes
authored
191 that create individual, long-term user models
192 (these terms are described in Section \ref{sec:recommender}).
d07ca70 @olav personalization theory
authored
193
09db39c @olav fix page citations
authored
194 There are two leading approaches to personalizing search results \cite[p.2]{Noll2007}.
4ead132 @olav note on approaches
authored
195 The first method is query reformulation, where the actual user query is enhanced in some way, before traditional IR
196 retrieves and ranks documents. The second method is results re-ranking, where the IR results are sorted
197 based on personalized metrics. This section describes the latter approach.
198
d07ca70 @olav personalization theory
authored
199 To demonstrate how these methods compare to traditional recommendation systems,
6a872de @olav fix quotes and citations; turn colon-sentences to duals.
authored
200 we will explore a few different approaches to personalized search.
201 (1) \emph{Personalized topic-sensitive PageRank},
6b19651 @olav folksonomy personalized search theory
authored
202 (2) \emph{folksonomy-based personalization} and
203 (3) \emph{social network search ranking}.
204
c4e3838 @olav final changes
authored
205
206 \paragraph{(1) Personalized topic-sensitive PageRank}
d45f695 @olav italic headings for numbered paragraphs
authored
207 \citet{Haveliwala2003} introduced a topic-sensitive PageRank algorithm, that they found
6a872de @olav fix quotes and citations; turn colon-sentences to duals.
authored
208 to \emph{``generate more accurate rankings than with a single, generic PageRank vector''}.
3b5afcd @olav theory text 3.0
authored
209 They show how to create topic-specific PageRank vectors for a number of pre-set topics,
035eb07 @olav note to self: do not use the term 'each' in every single sentence
authored
210 creating many rankings per page, one per topic.
211 This new PageRank is computed based on an existing set of websites that belong to one or more topics.
6a872de @olav fix quotes and citations; turn colon-sentences to duals.
authored
212 \citet{Qiu2006} achieved ``significant improvements'' to this approach by adding a personally adaptive layer
f802ceb @olav spell check of entire document
authored
213 to the topic-sensitive PageRank algorithm, creating a \emph{personalized PageRank algorithm}.
6b19651 @olav folksonomy personalized search theory
authored
214
215 In addition to the topic vector, \citeauthor{Qiu2006}
216 creates a topic-preference vector for each user. When the user has clicked on a few search results,
217 a learning algorithm kicks in and estimates approximately how likely the user is to be interested
035eb07 @olav note to self: do not use the term 'each' in every single sentence
authored
218 in the pre-set topics, creating the topic-preference vector $T$. When it is time to rank a
6b19651 @olav folksonomy personalized search theory
authored
219 page $p$ in response to the query $q$, they compute the personalized ranking:
d07ca70 @olav personalization theory
authored
220
860a51c @olav move autonomy section to correct place. fix equation spacing
authored
221 \begin{eqsp}
f58547d @olav use \times, not \cdot
authored
222 PersonalizedRanking(T,p,q) = \sum_{t=1}^{m} T(i) \times Pr(q|T(i)) \times TSPR_i(p)
860a51c @olav move autonomy section to correct place. fix equation spacing
authored
223 \end{eqsp}
bdf08a9 @olav equation spacing
authored
224
09db39c @olav fix page citations
authored
225 We will not deduce this equation here (see \citet[p.5]{Qiu2006}), but let us explain it.
6b19651 @olav folksonomy personalized search theory
authored
226 $T$ is the user-specific topic preference vector.
227 $i$ is the index of a topic and $m$ the total number of topics.
228 $Pr(q|T(i))$ is the probability that the query belongs in topic i.
229 This can be as simple as the total number of times the query terms appear in websites under topic $i$.
230 $TSPR_i(p)$ is the topic-sensitive PageRank score for page $p$ in topic $i$. Basically, this is
231 the normal PageRank vector computed within a pre-set topic $i$.
232
035eb07 @olav note to self: do not use the term 'each' in every single sentence
authored
233 The construction of $T(i)$, i.e. the training phase of the algorithm, is performed by mining the query logs from individual users.
234 By identifying how many sites the user has visited in per topic, computing $T$ can be done through linear regression or
3b5afcd @olav theory text 3.0
authored
235 by using a Maximum-likelihood estimator.
09db39c @olav fix page citations
authored
236 \citet[p.10]{Qiu2006} reports improvements of 25\% to 33\% over the Topic-sensitive PageRank approach, which
6b19651 @olav folksonomy personalized search theory
authored
237 \citet{Haveliwala2003} reports outperformed the original PageRank algorithm.
238
239 %Cube svd: \cite{Sun2005}
240
c4e3838 @olav final changes
authored
241
242 \paragraph{(2) Folksonomy-based personalization}
d45f695 @olav italic headings for numbered paragraphs
authored
243 Web applications often have more information about users and items (documents, sites or articles)
6b19651 @olav folksonomy personalized search theory
authored
244 than simple ratings. One of these extra resources are tags, simple keywords assigned from users to items.
245 The collection of users, items, tags and user-based assignment of tags to resources is called a \emph{folksonomy}.
246
247 \cite{Hotho} defines a folksonomy as a tuple $F = (U,T,R,Y,\prec)$.
248 Here, $U$, $T$ and $R$ are finite sets of users, tags and resources (items), respectively.
249 $Y$ is a ternary relation between users, tags and resources, called tag assignments.
250 $\prec$ is a user-specific tag hierarchy, applicable if the tags are organized as super- and sub-tags.
251 The \emph{personomy} $P_u$ is a user-specific part of $F$,
252 i.e. the tags, items and assignments related to one user $u$.
253 In our terms, this personomy would be the user model.
254 \citeauthor{Hotho} use folksonomies to do information retrieval based on their
255 \emph{FolkRank} search algorithm, a derivative of PageRank.
256
257 \cite{Bao2007} shows how folksonomies can be used to personalize search.
258 They first create a topic-space, where every user and document are represented.
259 Each tag in the system is a dimension in this topic-space, or tag-space.
6a872de @olav fix quotes and citations; turn colon-sentences to duals.
authored
260 Whenever a new query is issued, two things happen. First, a regular IR method
6b19651 @olav folksonomy personalized search theory
authored
261 computed a standard, non-personalized ranking of documents.
262 Second, a personalized ranking list is computed by performing a simple
263 vector-space model matching in the topic-space, for example by using
264 cosine similarity (as previously explained). The personalized list
265 is then unrelated to the actual query, and is simply a ranking of the
266 most relevant pages to the current user.
267
268 The two ranks are aggregated by a simple consensus-metric, the
09db39c @olav fix page citations
authored
269 \emph{Weighted Borda-Fuse} aggregation method \cite[p.3]{Xu2008},
f802ceb @olav spell check of entire document
authored
270 which is another name for weighted combination of the rankings:
6b19651 @olav folksonomy personalized search theory
authored
271
860a51c @olav move autonomy section to correct place. fix equation spacing
authored
272 \begin{eqsp}
f58547d @olav use \times, not \cdot
authored
273 \mathrm{rank}(u,q,p) = \alpha \times \mathrm{rank}_{IR}(q,p)
df49095 @olav clean up equations
authored
274 + (1-\alpha) \times \mathrm{rank}_{RS}(u,p)
860a51c @olav move autonomy section to correct place. fix equation spacing
authored
275 \end{eqsp}
bdf08a9 @olav equation spacing
authored
276
6b19651 @olav folksonomy personalized search theory
authored
277 \citeauthor{Xu2008} tried many combinations of weights,
278 topic selection and datasets, with the general conclusion
279 that folksonomy-based personalized search has great potential.
280 If nothing else, this example shows how easily tags can be integrated
281 to provide an individual searching experience.
282
c4e3838 @olav final changes
authored
283
284 \paragraph{(3) Social network search ranking}
d45f695 @olav italic headings for numbered paragraphs
authored
285 \cite{Carmel2009} developed a personalized search algorithm based on a user's \emph{social network}.
6b19651 @olav folksonomy personalized search theory
authored
286 By re-ranking documents according to their relation to with individuals in the current user's social network,
09db39c @olav fix page citations
authored
287 they arrived at a document ranking that ``significantly outperformed'' non-personalized social search \cite[p.1]{Carmel2009}.
bdced56 @olav less 'however' usage; update references
authored
288 Note the qualifier ``social search''. Their project searches through social data within an enterprise,
6b19651 @olav folksonomy personalized search theory
authored
289 naturally conducive to algorithmic alterations based on social concepts. However, as social data is data just as well,
290 seeing how a personalized approach improves standard IR in this domain, is helpful.
291
6a872de @olav fix quotes and citations; turn colon-sentences to duals.
authored
292 Their approach: first, documents are retrieved by a standard IR method. Second, the user's socially connected peers
6b19651 @olav folksonomy personalized search theory
authored
293 are also retrieved. Third, the initial ranked list of documents is re-ranked based on how strongly they are connected
294 to the user's peers, and how strongly those peers are connected to the user. The user-user similarity is
09db39c @olav fix page citations
authored
295 computed based on a few signals \cite[p.2]{Carmel2009}, e.g. co-authoring of documents, the use of similar tags
3b5afcd @olav theory text 3.0
authored
296 (see Example 2 in this section), or leaving comments on the same content.
6b19651 @olav folksonomy personalized search theory
authored
297 The user model also includes a list of terms the current user has employed in a social context (profile, tags, et cetera).
298 This is all done to infer implicit similarity based on social connections.
299
6a872de @olav fix quotes and citations; turn colon-sentences to duals.
authored
300 The algorithm is quite powerful, and combines three types of rankings.
6b19651 @olav folksonomy personalized search theory
authored
301 The initial IR score, the social connection score, and a term score, where the terms are tags and keywords used by a user.
302 The user model is $U(u) = (N(u), T(u))$,
303 where $N(u)$ are the social connections of $u$ and $T(u)$ the user's profile terms.
df49095 @olav clean up equations
authored
304 The function $sim(x,y)$ measures the similarity of two elements, either users or items.
3b5afcd @olav theory text 3.0
authored
305
306 The re-scoring is performed in two steps.
6b19651 @olav folksonomy personalized search theory
authored
307 First, the score based on connections and terms is computed, weighted by $\beta$ which determines the weighting of both approaches:
308
860a51c @olav move autonomy section to correct place. fix equation spacing
authored
309 \begin{eqsp}
3b5afcd @olav theory text 3.0
authored
310 S_P(q,d,U(u)) = \beta \sum_{v \in N(u)} sim(u,v) \times sim(v,d) + (1-\beta) \sum_{t \in T(u)} sim(u,t) \times sim(t,d)
860a51c @olav move autonomy section to correct place. fix equation spacing
authored
311 \end{eqsp}
bdf08a9 @olav equation spacing
authored
312
6b19651 @olav folksonomy personalized search theory
authored
313 Finally, the results are combined with the ranking returned by the IR method ($R_{IR}$).
314 A parameter $\alpha$ is used to control how much each method is weighted:
315
860a51c @olav move autonomy section to correct place. fix equation spacing
authored
316 \begin{eqsp}
df49095 @olav clean up equations
authored
317 S(q,d,U(u)) = \alpha \times R_{IR}(q,d) + (1-\alpha) \times S_P(q,d,U(u))
860a51c @olav move autonomy section to correct place. fix equation spacing
authored
318 \end{eqsp}
bdf08a9 @olav equation spacing
authored
319
6b19651 @olav folksonomy personalized search theory
authored
320 This approach, while simple, shows how social relations and social annotations can easily be used to personalize a search experience.
09db39c @olav fix page citations
authored
321 However, \citet[p.10]{Carmel2009} notes that the high quality data in their enterprise setting were important
6b19651 @olav folksonomy personalized search theory
authored
322 to achieve the improved results.
d07ca70 @olav personalization theory
authored
323
c4e3838 @olav final changes
authored
324 \clearpage
d07ca70 @olav personalization theory
authored
325
Something went wrong with that request. Please try again.