document_corpus

jprovidence · Jan 11, 2012 · 8c1ab0c · 8c1ab0c
1 parent ab1e3b1
commit 8c1ab0c
Show file tree

Hide file tree

Showing 2 changed files with 91 additions and 6 deletions.
diff --git a/notes/document_corpus.md b/notes/document_corpus.md
@@ -0,0 +1,85 @@
+# Aether Document Corpus
+
+## Sources
+
+Documents in the Aether Document corpus will come mainly from blogs. This is due most blog articles being of a single topic and relatively
+concise. All articles will be in English and of a 300 word minimum length. Articles will be on random topics and of various writing styles.
+
+
+## Article Retrieval
+
+#### Overview
+
+All articles will be obtained from blogs with some variety of XML feed (RSS/Atom). Blogs will be polled daily for new content. The initial
+blog index will be seeded by a hand-selected list. As articles are parsed, all outgoing links will be checked in an attempt to discover new
+feeds. 
+
+In addition, a web-crawler may be used for feed discovery. Results of the crawler will be screened computationally, by a human or both. In
+practice, it seems many feeds discovered in this manner are spam, low-quality or non-English. Only if the discovered feeds pass screening
+will they be included into the overall index. 
+
+
+#### Implementation
+
+Code for the implementation described below can be found [here](https://github.com/jprovidence/aether/blob/master/src/Entry.hs) and
+[here](https://github.com/jprovidence/aether/blob/master/src/Feed.hs). If interested, many functions used in the processing of raw XML can be
+found [here](https://github.com/jprovidence/aether/blob/master/src/Parse.hs). XML processing functions make heavy use of the Haskell
+[HXT](http://www.haskell.org/haskellwiki/HXT) library and [Arrows](http://en.wikibooks.org/wiki/Haskell/Understanding_arrows) language extension.
+
+
+All blog data will be stored in a PostgreSQL database. The schema is relatively simple and consists of two tables:
+
+A) Feeds
+
+` id [PKey], serial   |   num_entries, integer  |   last_updated, timestamp without timezone   |   url, text `
+
+B) Entries
+
+` id [PKey], serial   |   feed_id [FKEY], integer   |   content, text   |   date, timestamp without timezone   |   title, text   |   link, text `
+
+
+Within the Aether program, blog data will have the following Haskell representation:
+
+A) Feeds
+```haskell
+data Feed = Feed { _url :: String
+                 , _id  :: Int
+                 } deriving Show
+```
+
+B) Entries
+```haskell
+data Entry = Entry { description :: String
+                   , title       :: String
+                   , link        :: String
+                   , date        :: String
+                   } deriving Show 
+```
+
+The [HDBC](http://hackage.haskell.org/package/HDBC) library will be used to communicate with the database. Generally, boilerplate database
+code will be encapsulated in the functions `transact` and `wrap` which run transactions requiring a transaction (update, insert) and those 
+that do not (queries) respectively. Both are higher order functions of type `(Connection -> IO b) -> IO b`. Utility functions are also 
+defined to simplify common operations, such as Feed lookup given a url or id.
+
+New feeds are incorporated into the index according to the following routine:
+
+1. Visit feed URL and retrive XML data.
+2. Submit data to the parser, which will return `[Entry]`.
+3. Commit feed data to the database:
+   `num_entries` can be initialized by computing the length of the `[Entry]` returned from the parser.
+   `last_updated` corresponds to the most recent returned pubDate.
+   `url` is the initial url from which all XML data was retrieved.
+4. Obtain the `id` assigned by the database after step 3.
+5. Map a commit action over each item in the returned `[Entry]`:
+   `content`, all non-HTML/javascript text between `<description>` tags.
+   `feed_id`, retrieved in step 4.
+   `date`, pubDate of the article
+   `title`, title of the article
+   `link`, origLink if available, otherwise feed link.
+
+
+Feeds are updated by:
+
+1. Re-visiting the feed URL and retriving XML data.
+2. Selecting only those entries with dates after the `last_updated` parameter.
+3. Mapping a commit action over all entries returned from step 2 (if any).
diff --git a/notes/text_similarity_a.md b/notes/text_similarity_a.md
@@ -62,7 +62,7 @@ normalizedScore = ((score - ls) / range) * 100
 
 #### Vector-based algorithms (VBAs)
 
-###### Construction
+**Construction**
 
 These algorithms operate by converting the document into a vector derived from its individual terms. Similarity is determined by 
 comparing vectors.
@@ -93,7 +93,7 @@ Each 'sub-function' can be assessed relative to its competitors.
 Haskell implementations can be found [here](https://github.com/jprovidence/aether/blob/master/src/Text/Sim.hs), Brief descriptions of 
 the sub-functions implemented are listed below.
 
-**Tallying Functions**
+###### Tallying Functions
 
 - Total-Relative: 
   `significance = term frequency / total word count`
@@ -105,12 +105,12 @@ the sub-functions implemented are listed below.
   `significance = 1 / (term frequency / highest term frequency)`
 
 
-**Dimension Equalization** 
+###### Dimension Equalization
 
 - Intersection: Only scores corresponding to words present in both documents are considered. 
 - Injection: Zeros are inserted into the vectors at the indicies corresponding to words exclusive to the other document.
 
-**Comparison**
+###### Comparison
 
 Vectors are typically compared by determining the angle or distance between them. Three functions have been selected for evaluation
 that use one of these approaches. Rather than describe them again here, I have provided links to their respective wikipedia articles, 
@@ -121,15 +121,15 @@ which are more thorough than I could be.
 - [Chebyhev distance](http://en.wikipedia.org/wiki/Chebyshev_distance)
 
 
-###### Document Preparation
+**Document Preparation**
 
 The easyiest manner in which to apply these functions is to a document string is to split the it at whitespace, tabulations, etc.
 The result is an array of terms which can easily be tallied and processed. The issue of stopwords (a, the, it...) can vastly skew
 results, as these words are numerous in completely unrelated documents. They must be filtered out. Additionally, the problem of edge-case 
 terms (those that contain punctuation and strange capitalization) can be remedied by downcasing and filtering for punctuation.   
 
 
-###### Consideration of nouns
+**Consideration of Nouns**
 
 At this point, most 'low hanging' points of the hypothesis can be tested. Each sub-function combiniation can be applied to the prepared
 documents and scoring should, to some degree, be a ranking by similarity. Relative sub-function accuracies can be determined and general