Skip to content

Commit

Permalink
document_corpus
Browse files Browse the repository at this point in the history
  • Loading branch information
jprovidence committed Jan 11, 2012
1 parent ab1e3b1 commit 8c1ab0c
Show file tree
Hide file tree
Showing 2 changed files with 91 additions and 6 deletions.
85 changes: 85 additions & 0 deletions notes/document_corpus.md
@@ -0,0 +1,85 @@
# Aether Document Corpus

## Sources

Documents in the Aether Document corpus will come mainly from blogs. This is due most blog articles being of a single topic and relatively
concise. All articles will be in English and of a 300 word minimum length. Articles will be on random topics and of various writing styles.


## Article Retrieval

#### Overview

All articles will be obtained from blogs with some variety of XML feed (RSS/Atom). Blogs will be polled daily for new content. The initial
blog index will be seeded by a hand-selected list. As articles are parsed, all outgoing links will be checked in an attempt to discover new
feeds.

In addition, a web-crawler may be used for feed discovery. Results of the crawler will be screened computationally, by a human or both. In
practice, it seems many feeds discovered in this manner are spam, low-quality or non-English. Only if the discovered feeds pass screening
will they be included into the overall index.


#### Implementation

Code for the implementation described below can be found [here](https://github.com/jprovidence/aether/blob/master/src/Entry.hs) and
[here](https://github.com/jprovidence/aether/blob/master/src/Feed.hs). If interested, many functions used in the processing of raw XML can be
found [here](https://github.com/jprovidence/aether/blob/master/src/Parse.hs). XML processing functions make heavy use of the Haskell
[HXT](http://www.haskell.org/haskellwiki/HXT) library and [Arrows](http://en.wikibooks.org/wiki/Haskell/Understanding_arrows) language extension.


All blog data will be stored in a PostgreSQL database. The schema is relatively simple and consists of two tables:

A) Feeds

` id [PKey], serial | num_entries, integer | last_updated, timestamp without timezone | url, text `

B) Entries

` id [PKey], serial | feed_id [FKEY], integer | content, text | date, timestamp without timezone | title, text | link, text `


Within the Aether program, blog data will have the following Haskell representation:

A) Feeds
```haskell
data Feed = Feed { _url :: String
, _id :: Int
} deriving Show
```

B) Entries
```haskell
data Entry = Entry { description :: String
, title :: String
, link :: String
, date :: String
} deriving Show
```

The [HDBC](http://hackage.haskell.org/package/HDBC) library will be used to communicate with the database. Generally, boilerplate database
code will be encapsulated in the functions `transact` and `wrap` which run transactions requiring a transaction (update, insert) and those
that do not (queries) respectively. Both are higher order functions of type `(Connection -> IO b) -> IO b`. Utility functions are also
defined to simplify common operations, such as Feed lookup given a url or id.

New feeds are incorporated into the index according to the following routine:

1. Visit feed URL and retrive XML data.
2. Submit data to the parser, which will return `[Entry]`.
3. Commit feed data to the database:
`num_entries` can be initialized by computing the length of the `[Entry]` returned from the parser.
`last_updated` corresponds to the most recent returned pubDate.
`url` is the initial url from which all XML data was retrieved.
4. Obtain the `id` assigned by the database after step 3.
5. Map a commit action over each item in the returned `[Entry]`:
`content`, all non-HTML/javascript text between `<description>` tags.
`feed_id`, retrieved in step 4.
`date`, pubDate of the article
`title`, title of the article
`link`, origLink if available, otherwise feed link.


Feeds are updated by:

1. Re-visiting the feed URL and retriving XML data.
2. Selecting only those entries with dates after the `last_updated` parameter.
3. Mapping a commit action over all entries returned from step 2 (if any).
12 changes: 6 additions & 6 deletions notes/text_similarity_a.md
Expand Up @@ -62,7 +62,7 @@ normalizedScore = ((score - ls) / range) * 100

#### Vector-based algorithms (VBAs)

###### Construction
**Construction**

These algorithms operate by converting the document into a vector derived from its individual terms. Similarity is determined by
comparing vectors.
Expand Down Expand Up @@ -93,7 +93,7 @@ Each 'sub-function' can be assessed relative to its competitors.
Haskell implementations can be found [here](https://github.com/jprovidence/aether/blob/master/src/Text/Sim.hs), Brief descriptions of
the sub-functions implemented are listed below.

**Tallying Functions**
###### Tallying Functions

- Total-Relative:
`significance = term frequency / total word count`
Expand All @@ -105,12 +105,12 @@ the sub-functions implemented are listed below.
`significance = 1 / (term frequency / highest term frequency)`


**Dimension Equalization**
###### Dimension Equalization

- Intersection: Only scores corresponding to words present in both documents are considered.
- Injection: Zeros are inserted into the vectors at the indicies corresponding to words exclusive to the other document.

**Comparison**
###### Comparison

Vectors are typically compared by determining the angle or distance between them. Three functions have been selected for evaluation
that use one of these approaches. Rather than describe them again here, I have provided links to their respective wikipedia articles,
Expand All @@ -121,15 +121,15 @@ which are more thorough than I could be.
- [Chebyhev distance](http://en.wikipedia.org/wiki/Chebyshev_distance)


###### Document Preparation
**Document Preparation**

The easyiest manner in which to apply these functions is to a document string is to split the it at whitespace, tabulations, etc.
The result is an array of terms which can easily be tallied and processed. The issue of stopwords (a, the, it...) can vastly skew
results, as these words are numerous in completely unrelated documents. They must be filtered out. Additionally, the problem of edge-case
terms (those that contain punctuation and strange capitalization) can be remedied by downcasing and filtering for punctuation.


###### Consideration of nouns
**Consideration of Nouns**

At this point, most 'low hanging' points of the hypothesis can be tested. Each sub-function combiniation can be applied to the prepared
documents and scoring should, to some degree, be a ranking by similarity. Relative sub-function accuracies can be determined and general
Expand Down

0 comments on commit 8c1ab0c

Please sign in to comment.