Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 286 lines (271 sloc) 14.786 kb
c59aa29 @jfinkels added everything that wasn't a library JAR or a class file from the orig...
authored
1 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
2 <html>
3 <head>
4 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
5 <meta http-equiv="Content-Style-Type" content="text/css">
6 <title>Perseus Hopper Opensource README</title>
7 <style type="text/css">
8 body {font-size: 14px}
9 </style>
10 </head>
11 <body>
12 <h1><b>The Perseus Digital Library 4.0</b></h1>
13 <h2><b>Introduction</b></h2>
14 <p>This package represents a Java reimplementation of much the older Perl Perseus Digital Library hopper web
15 application. It does not aim to replicate every last bit of the Perl hopper&rsquo;s functionality, but instead
16 concentrates on the features that were most commonly used: displaying of text and associated resources,
17 searching, morphological analysis and statistical information, to name a few.</p>
18
19 <h2>Installation</h2>
20 <p>If you want to generate all of the data yourself, please use INSTALLEVERYTHING.html. If you want to install
21 the hopper with the provided database dumps and processed XML files, please use INSTALLWITHDATA.html. The
22 hopper can only be installed on Linux and Mac OS X. It currently will not work on any Windows operating
23 systems.</p>
24
25 <h2>Design Goals</h2>
26 <p>In designing this package, the following served as guidelines:</p>
27 <ul>
28 <li>Keep everything in Java. This has various advantages. Operating system becomes more of a choice.
29 Installation is simpler &mdash; basically a matter of collecting JAR files.</li>
30 <li>Move data from flat files to RDBMS. Much work has been put into making relational databases fast and
31 useful. Our data is inherently relational, so it can be most clearly organized and retrieved in table
32 format.</li>
33 <li>Provide a clear API. The Perl hopper is an incredibly powerful tool for creating text processing
34 applications. But the amount of detailed, often confusing, knowledge required to program with it is
35 daunting. A combination of more modular object design and Java&rsquo;s self-documentation abilities should
36 make it much simpler to build novel applications from existing code.</li>
37 </ul>
38
39 <h2>Status</h2>
40 <p>The Java hopper began as a front-end replacement for the old hopper, using old files and databases to
41 power a JSP-based web interface that aimed for a newer and generally &ldquo;cleaner&rdquo; look. Since then, more and
42 more front-end and back-end features of the Perl system have been replicated in Java; as of now, the Java
43 system can function entirely on its own.</p>
44 <p>At present, almost all the texts from the old hopper behave as expected in the new hopper. The exceptions
45 are generally texts with interesting layouts and subdocument configurations, such as some of the
46 commentaries. Also, the new hopper does not support the TEI&rsquo;s step-based citation schemes (these appear in
47 only a few texts, such as Butler&rsquo;s <i>Iliad</i> translation).</p>
48 <p>There is still a great deal of work that could be done to replicate the Perl hopper&rsquo;s features, although
49 it is not certain whether such work would be worth the effort; a great many specialized features, such as
50 the the TLG and DDBDP searching tools and the Perseus Atlas, have not been reimplemented.</p>
51 <p>The Java hopper does include one piece of functionality not present in the Perl hopper: the named-entity
52 browser, which provides several new mechanisms for browsing and searching places, people and dates in the
53 collection.</p>
54
55 <h2>Java Dependencies and Supporting Technologies</h2>
56 <p>The Java hopper uses a host of third-party frameworks to simplify many of its tasks. To interact with
57 the database, it initially used JDBC and raw SQL queries; most of these interactions have since been
58 rewritten, along with the corresponding model objects, in Hibernate, an object-relational management
59 (ORM) system. The front-end began its existence as a collection of JSP pages, and, while in large part
60 it remains as such, most newer functionality is backed by the Spring framework and uses either FreeMarker
61 or JSP templates with controller classes. Searching is handled by the Lucene library, and XML processing
62 is handled by a great many libraries including Xalan, Xerces and JDOM. Ant serves as a build tool, and
63 Tomcat (or an alternate servlet container) powers the front-end functionality.</p>
64 <p>Here follows a list of used third-party libraries, all of which are either included in the lib directory
65 or readily downloadable. Note that this is <i>not</i> a comprehensive list; in particular, third-party
66 libraries that need only be installed because another third-party library depends on them may not be listed
67 here.</p>
68 <ul>
69 <li>Commons CLI (<code>commons-cli.jar</code>) &mdash; a library for parsing command-line arguments</li>
70 <li>Commons Collections (<code>commons-collections.jar</code>) &mdash; adds some additional useful collections</li>
71 <li>Commons Configuration (<code>commons-configuration.jar</code>) &mdash; simplifies reading configuration
72 files</li>
73 <li>Commons DBCP (<code>commons-dbcp.jar</code>) &mdash; database connection/datasource libraries</li>
74 <li>Commons DBUtils (<code>commons-dbutils.jar</code>) &mdash; provides some wrappers to simplify SQL/JDBC
75 interactions</li>
76 <li>Commons Digester (<code>commons-digester.jar</code>) &mdash; simplifies reading XML files</li>
77 <li>Commons Lang (<code>commons-lang.jar</code>) &mdash; provides some additional functionality for core
78 classes</li>
79 <li>Commons Logging (<code>commons-logging.jar</code>) &mdash; a logging wrapper used by lots of frameworks</li>
80 <li>Commons Pool (<code>commons-pool.jar</code>) &mdash; database connection pooling</li>
81 <li>DOM4J (<code>dom4j-1.6.jar</code>) &mdash; DOM parsing library used by Hibernate</li>
82 <li>FreeMarker (<code>freemarker.jar</code>) &mdash; a templating system used by many of the Spring pages</li>
83 <li>Hibernate and supporting libraries (<code>hib/*.jar</code>) &mdash; Hibernate, the ORM framework, and various
84 supporting JARs</li>
85 <li>Java Mail libraries (<code>mail/*.jar</code>) &mdash; Mail libraries used for sending error reports</li>
86 <li>JDBC MySQL Connector (<code>jdbc-mysql.jar</code>) &mdash; MySQL connection adapter for JDBC</li>
87 <li>JDOM (<code>jdom.jar</code>) &mdash; DOM parsing library used by much of the Hopper code</li>
88 <li>JUnit (<code>junit.jar</code>) &mdash; Java unit testing framework</li>
89 <li>JSP/tag library files (<code>jstl.jar, standard.jar</code>) &mdash; JSP tag libraries</li>
90 <li>Log4J (<code>log4j.jar</code>) &mdash; logging library used for most of the Hopper&rsquo;s logging</li>
91 <li>Lucene (<code>lucene/*.jar</code>) &mdash; Java search framework</li>
92 <li>Spring (<code>spring*.jar</code>) &mdash; Spring web framework and various support libraries</li>
93 <li>UNC Greek Transcoder (<code>transcoder.jar</code>) &mdash; library for transcoding ancient Greek</li>
94 <li>Xalan (various) &mdash; Apache XSL transform library</li>
95 <li>Xerces (various) &mdash; Apache XML parsing library</li>
96 <li>XML Commons Resolver (<code>resolver.jar</code>) &mdash; XML catalog resolver</li>
97 </ul>
98
99 <h2>A Tour of the <code>reading</code> Package</h2>
100 <p>The following is an attempt to explain the various directories that the Java hopper contains and their
101 functionality. Note that this list ignores directories that are generated by the hopper in the course of
102 loading data (e.g., the directories containing the Lucene indices and the Java class files), as they can
103 theoretically go anywhere, depending on configuration.</p>
104 <dl>
105
106 <dt><code>reading/</code></dt>
107
108 <dd>
109 <dl>
110 <dt><code>jsp/</code> -- various display and support files</dt>
111 <dd><code>includes/</code> -- headers and other page fragments used by JSP files</dd>
112 <dd><code>index/</code> -- JSP files for the front page</dd>
113 <dd>
114 <dl>
115 <dt><code>META-INF/</code> -- additional support files</dt>
116 <dd><code>context.xml</code> -- Tomcat context configuration file</dd>
117 </dl>
118 </dd>
119 <dd>
120 <dl>
121 <dt><code>WEB-INF/</code> -- suport files of various sorts</dt>
122 <dd><code>freemarker/</code> -- FreeMarker display files</dd>
123 <dd><code>taglib/</code> -- JSP tag library description files</dd>
124 <dd><code>xsl/</code> -- various display/CTS-related XSL stylesheets</dd>
125 </dl>
126 </dd>
127 </dl>
128 </dd>
129
130 <dd>
131 <dl>
132 <dt><code>lib/</code> -- supporting libraries</dt>
133 <dd><code>endorsed/</code> -- libraries that should be placed in a Java and/or Tomcat endorsed directory</dd>
134 </dl>
135 </dd>
136
137 <dd>
138 <dl>
139 <dt><code>properties/</code> -- configuration files for the hopper</dt>
140 <dd><code>abbreviations/</code> -- abbreviation files</dd>
141 </dl>
142 </dd>
143
144 <dd><code>sql/</code> -- MySQL table description files</dd>
145
146 <dd>
147 <dl>
148 <dt><code>src/</code></dt>
149 <dd>
150 <dl>
151 <dt><code>perseus/</code></dt>
152 <dd>
153 <dl>
154 <dt><code>artarch/</code> -- art/archaeology functionality</dt>
155 <dd><code>image/</code> -- image functionality</dd>
156 </dl>
157 </dd><dd><code>chunking/</code> -- classes for loading XML texts</dd>
158 <dd><code>controllers/</code> -- controllers for the views (follows same structure as src/perseus directory)</dd>
159 <dd><code>cts/</code> -- Canonical Text Services implementations</dd>
160 <dd><code>display/</code> -- display-related helper classes</dd>
161 <dd><code>document/</code> -- lots of classes that relate to documents in some way</dd>
162 <dd>
163 <dl>
164 <dt><code>eval/</code> -- classes for voting/evaluation</dt>
165 <dd><code>morph/</code> --- morphological form evaluation</dd>
166 </dl>
167 </dd>
168 <dd>
169 <dl>
170 <dt><code>ie/</code> -- information extraction code</dt>
171 <dd>
172 <dl>
173 <dt><code>entity/</code> -- named entity functionality</dt>
174 <dd><code>adapters/</code></dd>
175 <dd><code>service/</code></dd>
176 </dl>
177 </dd>
178 <dd><code>freq/</code> -- frequencies of every sort</dd>
179 </dl>
180 </dd>
181 <dd><code>language/</code> -- classes relating to languages</dd>
182 <dd><code>morph/</code> -- lots of classes relating to morphology in some way</dd>
183 <dd><code>qa/</code> -- some testing classes</dd>
184 <dd>
185 <dl>
186 <dt><code>search/</code> -- searching functionality (the current version is in the nu/ subdirectory)</dt>
187 <dd><code>greek/</code></dd>
188 <dd><code>latin/</code></dd>
189 <dd><code>nu/</code></dd>
190 </dl>
191 </dd>
192 <dd><code>servlet/</code> -- some servlets</dd>
193 <dd><code>sharing/</code> -- some code for playing nice with others</dd>
194 <dd><code>util/</code> -- lots of utility classes</dd>
195 <dd><code>visualizations/</code> -- classes relating to the different types of visualization APIs used</dd>
196 <dd><code>vocab/</code> -- vocabulary list functionality</dd>
197 <dd><code>voting/</code> -- sense/morph/entity voting functionality</dd>
198 </dl>
199 </dd>
200 </dl>
201 </dd>
202
203 <dd>
204 <dl>
205 <dt><code>static/</code> -- static files (if setting up a developement server, this represents the webserver's "docroot"
206 directory)</dt>
207 <dd><code>css/</code> -- CSS files</dd>
208 <dd><code>img/</code> -- images</dd>
209 <dd><code>js/</code> -- JavaScript</dd>
210 <dd><code>xml/</code> -- static XML files</dd>
211 </dl>
212 </dd>
213 <dd>
214 <dl>
215 <dt><code>xslt/</code> -- various XSL stylesheets</dt>
216 <dd>
217 <dl>
218 <dt><code>build/</code> -- stylesheets used in loading data</dt>
219 <dd><code>document/</code></dd>
220 <dd><code>ie/</code></dd>
221 </dl>
222 </dd>
223 <dd><code>services/</code> -- miscellaneous</dd>
224 </dl>
225 </dd>
226 </dl>
227
228 <h2><b>Weaknesses</b></h2>
229 <ul>
230 <li>Although the Java rewrite was a great improvement on the Perl hopper, it has inherited some of the
231 Perl code&rsquo;s problems, thanks to its relying on old database tables for so long. In particular, the code
232 to handle texts with subdocuments is confusing and requires the relevant objects to be far more self-aware
233 than they should be.</li>
234 <li>The current codebase is not very modular; the code for the back-end and the web application, and all
235 the support files, is currently compiled to a single JAR. Ideally, it would compile into several different
236 modules that could be substituted in and out as the system&rsquo;s needs dictated. In general, the separation
237 between display and content could be greatly improved&mdash;many of the JSP pages could be broken out into
238 controllers and view templates, which would make it far easier for other digital libraries to implement
239 a different interface based on the Perseus back-end.</li>
240 <li>The Java rewrite required itself a prohibitive amount of rewriting as the developers gained more
241 familiarity with available third-party libraries. Much of the functionality began as raw SQL code embedded
242 within model objects and was rewritten over time in Hibernate, with Data Access Objects (DAOs) used to
243 separate the objects from the database code. This rewriting represented a great deal of time that could
244 have been used to add more features. Of course, none of the development staff could have been expected
245 to be acquainted with the full breadth of Java libraries, but more up-front planning and research would
246 have saved a great deal of time later on.</li>
247 </ul>
248
249 <h2><b>Some Known Issues</b></h2>
250 <ul>
251 <li>When running the hopper, Tomcat tends to crash periodically with an OutOfMemoryError, at which point
252 it needs to be restarted. (It is not clear whether the fault is in the hopper, or in Spring, Hibernate or
253 the Java virtual machine; a Google search on the subject produces various forum discussions that establish
254 little for certain.)</li>
255 <li>Tomcat&rsquo;s logs, if left unchecked, will keep growing and growing (in particular, the catalina.out file)
256 and may eventually need to be deleted to reclaim disk space. Much of this can probably be solved with some
257 tweaks to Tomcat&rsquo;s logging configuration.</li>
258 <li>Some texts could not be released with the source code because they cannot be freely distributed.
259 However, these texts are always available on the <a href="http://www.perseus.tufts.edu/">Perseus website</a>.</li>
260 <li>Certain chunks of texts are known to be very large.In most cases,a truncated form of the chunk is
261 shown with the option to view the entire chunk. However, some chunks are simply too large for the system
262 to handle and currently cannot be displayed. These include: </li>
263 <ul>
264 <li>Duke Databank of Documentary Papyri collection</li>
265 <ul>
266 <li>Michigan Papyri (1999.05.0163): documents 223, 224, and 225</li>
267 <li>Papyrus de la Sorbonne (1999.05.0203): document 69</li>
268 </ul>
269 <li>American History collection</li>
270 </ul>
271 <ul>
272 <ul>
273 <li>Harper's Encyclopedia of US History (2001.05.0132): Christopher Columbus and Abraham Lincoln entries</li>
274 <li>Knight's Mechanical Encyclopedia (2001.05.0138): firearm entry</li>
275 </ul>
276 <li>Renaissance collection</li>
277 </ul>
278 <ul>
279 <ul>
280 <li>The History of England After the Conquest, An Electronic Edition (1999.03.0085): year 1585 and Queen Elizabeth regnal year 18</li>
281 </ul>
282 </ul>
283 </ul>
284 </body>
285 </html>
Something went wrong with that request. Please try again.