From 8c1ab0ca54ddff456d5fe59b1a3b2049c606a0c6 Mon Sep 17 00:00:00 2001 From: Gideon Providence Date: Wed, 11 Jan 2012 17:03:00 -0500 Subject: [PATCH] document_corpus --- notes/document_corpus.md | 85 ++++++++++++++++++++++++++++++++++++++ notes/text_similarity_a.md | 12 +++--- 2 files changed, 91 insertions(+), 6 deletions(-) diff --git a/notes/document_corpus.md b/notes/document_corpus.md index e69de29..f0df365 100644 --- a/notes/document_corpus.md +++ b/notes/document_corpus.md @@ -0,0 +1,85 @@ +# Aether Document Corpus + +## Sources + +Documents in the Aether Document corpus will come mainly from blogs. This is due most blog articles being of a single topic and relatively +concise. All articles will be in English and of a 300 word minimum length. Articles will be on random topics and of various writing styles. + + +## Article Retrieval + +#### Overview + +All articles will be obtained from blogs with some variety of XML feed (RSS/Atom). Blogs will be polled daily for new content. The initial +blog index will be seeded by a hand-selected list. As articles are parsed, all outgoing links will be checked in an attempt to discover new +feeds. + +In addition, a web-crawler may be used for feed discovery. Results of the crawler will be screened computationally, by a human or both. In +practice, it seems many feeds discovered in this manner are spam, low-quality or non-English. Only if the discovered feeds pass screening +will they be included into the overall index. + + +#### Implementation + +Code for the implementation described below can be found [here](https://github.com/jprovidence/aether/blob/master/src/Entry.hs) and +[here](https://github.com/jprovidence/aether/blob/master/src/Feed.hs). If interested, many functions used in the processing of raw XML can be +found [here](https://github.com/jprovidence/aether/blob/master/src/Parse.hs). XML processing functions make heavy use of the Haskell +[HXT](http://www.haskell.org/haskellwiki/HXT) library and [Arrows](http://en.wikibooks.org/wiki/Haskell/Understanding_arrows) language extension. + + +All blog data will be stored in a PostgreSQL database. The schema is relatively simple and consists of two tables: + +A) Feeds + +` id [PKey], serial | num_entries, integer | last_updated, timestamp without timezone | url, text ` + +B) Entries + +` id [PKey], serial | feed_id [FKEY], integer | content, text | date, timestamp without timezone | title, text | link, text ` + + +Within the Aether program, blog data will have the following Haskell representation: + +A) Feeds +```haskell +data Feed = Feed { _url :: String + , _id :: Int + } deriving Show +``` + +B) Entries +```haskell +data Entry = Entry { description :: String + , title :: String + , link :: String + , date :: String + } deriving Show +``` + +The [HDBC](http://hackage.haskell.org/package/HDBC) library will be used to communicate with the database. Generally, boilerplate database +code will be encapsulated in the functions `transact` and `wrap` which run transactions requiring a transaction (update, insert) and those +that do not (queries) respectively. Both are higher order functions of type `(Connection -> IO b) -> IO b`. Utility functions are also +defined to simplify common operations, such as Feed lookup given a url or id. + +New feeds are incorporated into the index according to the following routine: + +1. Visit feed URL and retrive XML data. +2. Submit data to the parser, which will return `[Entry]`. +3. Commit feed data to the database: + `num_entries` can be initialized by computing the length of the `[Entry]` returned from the parser. + `last_updated` corresponds to the most recent returned pubDate. + `url` is the initial url from which all XML data was retrieved. +4. Obtain the `id` assigned by the database after step 3. +5. Map a commit action over each item in the returned `[Entry]`: + `content`, all non-HTML/javascript text between `` tags. + `feed_id`, retrieved in step 4. + `date`, pubDate of the article + `title`, title of the article + `link`, origLink if available, otherwise feed link. + + +Feeds are updated by: + +1. Re-visiting the feed URL and retriving XML data. +2. Selecting only those entries with dates after the `last_updated` parameter. +3. Mapping a commit action over all entries returned from step 2 (if any). diff --git a/notes/text_similarity_a.md b/notes/text_similarity_a.md index a00f366..da9bd53 100644 --- a/notes/text_similarity_a.md +++ b/notes/text_similarity_a.md @@ -62,7 +62,7 @@ normalizedScore = ((score - ls) / range) * 100 #### Vector-based algorithms (VBAs) -###### Construction +**Construction** These algorithms operate by converting the document into a vector derived from its individual terms. Similarity is determined by comparing vectors. @@ -93,7 +93,7 @@ Each 'sub-function' can be assessed relative to its competitors. Haskell implementations can be found [here](https://github.com/jprovidence/aether/blob/master/src/Text/Sim.hs), Brief descriptions of the sub-functions implemented are listed below. -**Tallying Functions** +###### Tallying Functions - Total-Relative: `significance = term frequency / total word count` @@ -105,12 +105,12 @@ the sub-functions implemented are listed below. `significance = 1 / (term frequency / highest term frequency)` -**Dimension Equalization** +###### Dimension Equalization - Intersection: Only scores corresponding to words present in both documents are considered. - Injection: Zeros are inserted into the vectors at the indicies corresponding to words exclusive to the other document. -**Comparison** +###### Comparison Vectors are typically compared by determining the angle or distance between them. Three functions have been selected for evaluation that use one of these approaches. Rather than describe them again here, I have provided links to their respective wikipedia articles, @@ -121,7 +121,7 @@ which are more thorough than I could be. - [Chebyhev distance](http://en.wikipedia.org/wiki/Chebyshev_distance) -###### Document Preparation +**Document Preparation** The easyiest manner in which to apply these functions is to a document string is to split the it at whitespace, tabulations, etc. The result is an array of terms which can easily be tallied and processed. The issue of stopwords (a, the, it...) can vastly skew @@ -129,7 +129,7 @@ results, as these words are numerous in completely unrelated documents. They mus terms (those that contain punctuation and strange capitalization) can be remedied by downcasing and filtering for punctuation. -###### Consideration of nouns +**Consideration of Nouns** At this point, most 'low hanging' points of the hypothesis can be tested. Each sub-function combiniation can be applied to the prepared documents and scoring should, to some degree, be a ranking by similarity. Relative sub-function accuracies can be determined and general