-
Notifications
You must be signed in to change notification settings - Fork 0
paoptu023/COMS6111-proj1
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
# COMS E6111 Project 1 * Group name: Project 1 Group 23 Qi Wang (qw2197), Yongjia Huo (yh2796) * Submitted files: main.py – the main program, get relevance feedback parameters.py – parameter settings and parser query_processor.py – implement the Rocchio algorithm stopwords.txt – list of stop words transcript.txt readme.txt * Use the following command to run our program: python main.py <account key> <target precision> <initial query> For example, python main.py 'HIWkFhlcqfV0SsO9ac7smysylCtGDsuMVyqgSWPPDZI' 1.0 'musk' * Description of design: Each query round we display the URL, title and description of top-10 results returned from the Bing Search API, and prompt user to mark each page as relevant or not relevant. We parse the text from URL, title and description into a list of words split by punctuations, then we use all terms not in the stop words list to construct the vector space. We use the Rocchio algorithm and tf-idf to calculate term weights, the two highest-weighted terms not presented in last query are added in the new query, terms in new query are ordered by their weights. Each round we calculate the precision@10, if the target value is achieved or the precision is zero, the program terminates. * Description of query-modification method: For each query, we add results marked as relevant into the relevant set, and results marked as nor-relevant to non-relevant set. The URL, title and description are combined into a single string separated by white space. URL is pre-processed by the parseURL() function, we only keep the parts between the last two slashes, because the keywords usually appear here. We also remove “.html” and digits which are basically garbage information. Then we replace all punctuations with white spaces, convert the string into lower characters and split it into a bunch of words. Stop words are removed. After parsing, we calculate the term frequency of relevant documents, term frequency of non-relevant documents and document frequency respectively. Then we construct the document vector using the Rocchio algorithm, the term weight of document vector is calculated by tf-idf algorithm: W(ij) = TF(ij) * log(#documents / DF(i)) The vector is sorted decreasingly by term weights. Finally, we add weights to terms in old query, the two new terms of highest weights are considered good candidates for query expansion and appended to the old query. * Bing Search Account Key: HIWkFhlcqfV0SsO9ac7smysylCtGDsuMVyqgSWPPDZI * References: Singhal, Amit. "Modern information retrieval: A brief overview." IEEE Data Eng. Bull. 24.4 (2001): 35-43. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA.
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published