Skip to content

Commit

Permalink
Prepare v1.5.14 release
Browse files Browse the repository at this point in the history
  • Loading branch information
emmanuel-keller committed Feb 29, 2016
1 parent 08a7000 commit f375fba
Show file tree
Hide file tree
Showing 10 changed files with 105 additions and 91 deletions.
16 changes: 16 additions & 0 deletions CHANGELOG.txt
Expand Up @@ -8,6 +8,22 @@ https://github.com/jaeksoft/opensearchserver/issues (GH)
http://sourceforge.net/p/opensearchserve/feature-request/ (SF)
http://sourceforge.net/p/opensearchserve/bug-report/ (SF)

OpenSearchServer 1.5.14

New features:
- GH-1706: Autocompletion thread should be interruptible
- GH-1700: Tomcat version update 7.0.68
- GH-1689: Disabling link detection
- GH-1676: Support of PhantomJS in the HTMLParser
- GH-1671: Add a customizable char tokenizer
- GH-1669: An XML/JSON API for field terms extractions

Bug fixes:
- GH-1729: org.apache.cxf.interceptor.Fault, Tomcat 7.0.53, when starting crawl
- GH-1697: Ignored Number of suggestion for auto-completion
- GH-1697: Scheduler URL database error
- GH-1659: Crawler deadlocks after running while serving requests

OpenSearchServer 1.5.13

New features:
Expand Down
38 changes: 11 additions & 27 deletions CHANGES.txt
@@ -1,27 +1,11 @@
release date=12:00 05.05.2015,version=1.5.13,urgency=low,by=Emmanuel Keller <ekeller@open-search-server.com>,distribution=unknown
* GH-1653: REST Crawler - HTTP Header authentication method
* GH-1652: REST Crawler - an integrated sequence for paginated APIs support
* GH-1648: File event scheduler task
* GH-1635: For SMB crawling and AD login SID should also be extracted
* GH-1624: Add the field host in the field mapping of the file crawler
* GH-1617: The REST crawler should support local files indexation
* GH-1611: Support of multiple indexes
* GH-1601: Use compressed bit set for less memory consumption
* GH-1598: Facets can be limited in number
* GH-1597: Facets can be sorted by term or count
* GH-1595: Parallel sorting to better use multi-core system
* GH-1592: The pattern inclusion API (v2) failed to inject the URLs in the URL database
* GH-1582: Programmatically retrieving hosts from Web index
* GH-1562: Fault tolerancy in the database crawler
* GH-1559: Handle short date format in sitemap (yyyy-mm-dd)
* GH-1541: Ability to disable obeying robots.txt using the API
* GH-1537: Collapsing size to 0 merges the returned field
* GH-1524: Add fetch size parameter in the database crawler
* GH-1509: Hunspell native implementation is slow
* GH-1502: A scheduler task able to launch a crawl script
* GH-1494: Error HttpHostConnectException should not abort the crawl session
* GH-1480: Multithreaded OCR
* GH-1476: New index template for credentials storing
* GH-1475: Add an encryption filter
* GH-1471: Authentication based on external index
* GH-1470: Renderer: Add range to date facet
release date=12:00 04.04.2016,version=1.5.14,urgency=low,by=Emmanuel Keller <ekeller@open-search-server.com>,distribution=unknown
* GH-1729: org.apache.cxf.interceptor.Fault, Tomcat 7.0.53, when starting crawl
* GH-1706: Autocompletion thread should be interruptible
* GH-1700: Tomcat version update 7.0.68
* GH-1697: Ignored Number of suggestion for auto-completion
* GH-1696: Scheduler URL database error
* GH-1689: Disabling link detection
* GH-1676: Support of PhantomJS in the HTMLParser
* GH-1671: Add a customizable char tokenizer
* GH-1669: An XML/JSON API for field terms extractions
* GH-1659: Crawler deadlocks after running while serving requests
2 changes: 1 addition & 1 deletion NOTICE.txt
@@ -1,5 +1,5 @@
OpenSearchServer
Copyright 2008-2014 Emmanuel Keller / Jaeksoft
Copyright 2008-2016 Emmanuel Keller / Jaeksoft
http://www.open-search-server.com

OpenSearchServer is free software: you can redistribute it and/or
Expand Down
6 changes: 2 additions & 4 deletions README.md
Expand Up @@ -2,7 +2,7 @@ OpenSearchServer
================
http://www.opensearchserver.com

Copyright Emmanuel Keller / Jaeksoft (2008-2015)
Copyright Emmanuel Keller / Jaeksoft (2008-2016)
This software is licensed under the GPL v3.

OpenSearchServer is a powerful, enterprise-class, search engine program. Using the web user interface, the crawlers (web, file, database, ...) and the REST/RESTFul API you will be able to integrate quickly and easily advanced full-text search capabilities in your application. OpenSearchServer runs on Linux/Unix/BSD/Windows.
Expand Down Expand Up @@ -90,8 +90,6 @@ Features
### General
- REST API (XML and JSON)
- SOAP Web Service
- Monitoring module
- Index replication
- Scheduler for management of periodic tasks
- WordPress plugin and Drupal module
- Scheduler for management of periodic tasks
32 changes: 16 additions & 16 deletions pom.xml
Expand Up @@ -117,12 +117,12 @@
<dependency>
<groupId>it.unimi.dsi</groupId>
<artifactId>fastutil</artifactId>
<version>7.0.9</version>
<version>7.0.10</version>
</dependency>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.5</version>
<version>2.6.2</version>
</dependency>
<dependency>
<groupId>com.ibm.icu</groupId>
Expand All @@ -147,7 +147,7 @@
<dependency>
<groupId>args4j</groupId>
<artifactId>args4j</artifactId>
<version>2.32</version>
<version>2.33</version>
</dependency>
<dependency>
<groupId>org.quartz-scheduler</groupId>
Expand Down Expand Up @@ -217,7 +217,7 @@
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox-ant</artifactId>
<version>1.8.10</version>
<version>1.8.11</version>
</dependency>
<dependency>
<groupId>org.icepdf</groupId>
Expand Down Expand Up @@ -247,7 +247,7 @@
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.1</version>
<version>2.7.2</version>
</dependency>
<dependency>
<groupId>commons-codec</groupId>
Expand Down Expand Up @@ -333,7 +333,7 @@
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.13</version>
<version>1.7.18</version>
</dependency>
<dependency>
<groupId>org.zkoss.zk</groupId>
Expand Down Expand Up @@ -446,7 +446,7 @@
<dependency>
<groupId>org.antlr</groupId>
<artifactId>antlr4-runtime</artifactId>
<version>4.5.1-1</version>
<version>4.5.2-1</version>
</dependency>
<dependency>
<groupId>net.sf.opencsv</groupId>
Expand All @@ -456,12 +456,12 @@
<dependency>
<groupId>org.codehaus.groovy</groupId>
<artifactId>groovy-all</artifactId>
<version>2.4.5</version>
<version>2.4.6</version>
</dependency>
<dependency>
<groupId>org.mongodb</groupId>
<artifactId>mongo-java-driver</artifactId>
<version>3.2.0</version>
<version>3.2.2</version>
</dependency>
<dependency>
<groupId>net.sf.jmimemagic</groupId>
Expand All @@ -476,7 +476,7 @@
<dependency>
<groupId>org.roaringbitmap</groupId>
<artifactId>RoaringBitmap</artifactId>
<version>0.5.15</version>
<version>0.6.3</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
Expand All @@ -491,7 +491,7 @@
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<version>9.4.1207</version>
<version>9.4.1208.jre7</version>
</dependency>
<dependency>
<groupId>org.hsqldb</groupId>
Expand Down Expand Up @@ -554,13 +554,13 @@
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
<cxf.version>2.7.18</cxf.version>
<lucene.version>3.6.2</lucene.version>
<httpclient.version>4.5.1</httpclient.version>
<httpclient.version>4.5.2</httpclient.version>
<httpcore.version>4.4.4</httpcore.version>
<poi.version>3.12</poi.version>
<selenium.version>2.48.2</selenium.version>
<poi.version>3.13</poi.version>
<selenium.version>2.52.0</selenium.version>
<zk.version>6.5.4</zk.version>
<jackson.version>2.6.4</jackson.version>
<tomcat.version>7.0.67</tomcat.version>
<jackson.version>2.6.5</jackson.version>
<tomcat.version>7.0.68</tomcat.version>
</properties>
<description>OpenSearchServer is a powerful, enterprise-class, search engine program. Using the web user interface, the crawlers (web, file, database, ...) and the REST/RESTFul API you will be able to integrate quickly and easily advanced full-text search capabilities in your application. OpenSearchServer runs on Windows and Linux/Unix/BSD.</description>
<organization>
Expand Down
4 changes: 2 additions & 2 deletions shell/start.bat
Expand Up @@ -6,7 +6,7 @@ rem Move to the directory containing this script
cd %cd%

set LANG=en_US.UTF-8
set JAVA_OPTS=%JAVA_OPTS% -Dfile.encoding=UTF-8
set JAVA_OPTS=%JAVA_OPTS% -Dfile.encoding=UTF-8 -Djava.protocol.handler.pkgs=jcifs

rem The directory containing the indexes
set OPENSEARCHSERVER_DATA=%cd%\data
Expand All @@ -18,4 +18,4 @@ rem Any JAVA option. Often used to allocate more memory. Uncomment this line to
rem set JAVA_OPTS=%JAVA_OPTS% -Xms1G -Xmx1G

rem Starting the server
java %JAVA_OPTS% -jar opensearchserver.jar -extractDirectory server -httpPort %SERVER_PORT% -uriEncoding UTF-8 -Doss.externalparser.classpath=%cd%/lib/ext/*
java %JAVA_OPTS% -jar opensearchserver.jar -extractDirectory server -httpPort %SERVER_PORT% -uriEncoding UTF-8
4 changes: 1 addition & 3 deletions shell/start.sh
Expand Up @@ -6,7 +6,7 @@ cd `dirname "$0"`
#
LANG=en_US.UTF-8
export LANG
JAVA_OPTS="$JAVA_OPTS -Dfile.encoding=UTF-8"
JAVA_OPTS="$JAVA_OPTS -Dfile.encoding=UTF-8 -Djava.protocol.handler.pkgs=jcifs"

# The directory containing the indexes (must be exported)
OPENSEARCHSERVER_DATA=data
Expand All @@ -22,8 +22,6 @@ SERVER_PORT=9090
eval java $JAVA_OPTS -jar opensearchserver.jar \
-extractDirectory server \
-httpPort ${SERVER_PORT} \
-Djava.protocol.handler.pkgs=jcifs \
-Doss.externalparser.classpath=lib/ext/* \
-uriEncoding UTF-8 \
>> "logs/oss.log" 2>&1 "&"

Expand Down
2 changes: 1 addition & 1 deletion src/deb/init.d/opensearchserver
Expand Up @@ -24,7 +24,7 @@ OPENSEARCHSERVER_DIR=/var/lib/opensearchserver
OPENSEARCHSERVER_SHARE=/usr/share/opensearchserver
OPENSEARCHSERVER_JAR=$OPENSEARCHSERVER_SHARE/opensearchserver.jar
SERVER_DIR=$OPENSEARCHSERVER_DIR/server
SERVER_OPTS="$JAVA_OPTS -Dfile.encoding=UTF-8 -jar $OPENSEARCHSERVER_JAR -extractDirectory $SERVER_DIR -httpPort $SERVER_PORT -uriEncoding UTF-8"
SERVER_OPTS="$JAVA_OPTS -Djava.protocol.handler.pkgs=jcifs -Dfile.encoding=UTF-8 -jar $OPENSEARCHSERVER_JAR -extractDirectory $SERVER_DIR -httpPort $SERVER_PORT -uriEncoding UTF-8"
SERVER_LOG=/var/log/opensearchserver/server.out
SERVER_PID="/var/run/opensearchserver.pid"
export SERVER_USER=opensearchserver
Expand Down
15 changes: 15 additions & 0 deletions src/main/java/com/jaeksoft/searchlib/Server.java
Expand Up @@ -16,6 +16,8 @@
import com.github.jankroken.commandline.annotations.Option;
import com.github.jankroken.commandline.annotations.ShortSwitch;
import com.github.jankroken.commandline.annotations.SingleArgument;
import com.github.jankroken.commandline.annotations.Toggle;
import com.jaeksoft.searchlib.util.FileUtils;
import com.jaeksoft.searchlib.util.ThreadUtils;
import com.jaeksoft.searchlib.web.StartStopListener;

Expand All @@ -25,9 +27,13 @@ public class Server {

private Server(Arguments arguments) {
File baseDir = new File(arguments.extractDirectory == null ? "server" : arguments.extractDirectory);
if (baseDir.exists())
if (arguments.resetExtract && baseDir.isDirectory())
FileUtils.deleteDirectoryQuietly(baseDir);
if (!baseDir.exists())
baseDir.mkdir();
tomcat = new Tomcat();
tomcat.noDefaultWebXmlPath();
tomcat.setPort(arguments.httpPort == null ? 9090 : arguments.httpPort);
tomcat.setBaseDir(baseDir.getAbsolutePath());
tomcat.getHost().setAppBase(baseDir.getAbsolutePath());
Expand All @@ -41,6 +47,7 @@ public static class Arguments {
private String extractDirectory = null;
private Integer httpPort = null;
private String uriEncoding = null;
private boolean resetExtract = false;

@Option
@LongSwitch("extractDirectory")
Expand All @@ -66,6 +73,14 @@ public void setUriEncoding(String uriEncoding) {
this.uriEncoding = uriEncoding;
}

@Option
@LongSwitch("resetExtract")
@ShortSwitch("r")
@Toggle(true)
public void setResetExtract(boolean resetExtract) {
this.resetExtract = resetExtract;
}

}

private void start(boolean await) throws IOException, URISyntaxException {
Expand Down
77 changes: 40 additions & 37 deletions src/main/java/com/jaeksoft/searchlib/parser/PptParser.java
Expand Up @@ -25,11 +25,13 @@
package com.jaeksoft.searchlib.parser;

import java.io.IOException;
import java.util.List;

import org.apache.poi.hslf.model.Slide;
import org.apache.poi.hslf.model.TextRun;
import org.apache.poi.hslf.record.TextHeaderAtom;
import org.apache.poi.hslf.usermodel.SlideShow;
import org.apache.poi.hslf.usermodel.HSLFSlide;
import org.apache.poi.hslf.usermodel.HSLFSlideShow;
import org.apache.poi.hslf.usermodel.HSLFTextParagraph;
import org.apache.poi.hslf.usermodel.HSLFTextRun;

import com.jaeksoft.searchlib.SearchLibException;
import com.jaeksoft.searchlib.analysis.ClassPropertyEnum;
Expand All @@ -43,9 +45,8 @@ public class PptParser extends Parser {

public static final String[] DEFAULT_EXTENSIONS = { "ppt" };

private static ParserFieldEnum[] fl = { ParserFieldEnum.parser_name,
ParserFieldEnum.title, ParserFieldEnum.note, ParserFieldEnum.body,
ParserFieldEnum.other };
private static ParserFieldEnum[] fl = { ParserFieldEnum.parser_name, ParserFieldEnum.title, ParserFieldEnum.note,
ParserFieldEnum.body, ParserFieldEnum.other };

public PptParser() {
super(fl);
Expand All @@ -58,43 +59,45 @@ public void initProperties() throws SearchLibException {
}

@Override
protected void parseContent(StreamLimiter streamLimiter, LanguageEnum lang)
throws IOException {
protected void parseContent(StreamLimiter streamLimiter, LanguageEnum lang) throws IOException {

SlideShow ppt = new SlideShow(streamLimiter.getNewInputStream());
Slide[] slides = ppt.getSlides();
HSLFSlideShow ppt = new HSLFSlideShow(streamLimiter.getNewInputStream());
List<HSLFSlide> slides = ppt.getSlides();
ParserResultItem result = getNewParserResultItem();
for (Slide slide : slides) {
TextRun[] textRuns = slide.getTextRuns();
for (TextRun textRun : textRuns) {
ParserFieldEnum field;
switch (textRun.getRunType()) {
case TextHeaderAtom.TITLE_TYPE:
case TextHeaderAtom.CENTER_TITLE_TYPE:
field = ParserFieldEnum.title;
break;
case TextHeaderAtom.NOTES_TYPE:
field = ParserFieldEnum.note;
break;
case TextHeaderAtom.BODY_TYPE:
case TextHeaderAtom.CENTRE_BODY_TYPE:
case TextHeaderAtom.HALF_BODY_TYPE:
case TextHeaderAtom.QUARTER_BODY_TYPE:
field = ParserFieldEnum.body;
break;
case TextHeaderAtom.OTHER_TYPE:
default:
field = ParserFieldEnum.other;
break;
for (HSLFSlide slide : slides) {
List<List<HSLFTextParagraph>> textLevel0 = slide.getTextParagraphs();
for (List<HSLFTextParagraph> textLevel1 : textLevel0) {
for (HSLFTextParagraph textPara : textLevel1) {
ParserFieldEnum field;
switch (textPara.getRunType()) {
case TextHeaderAtom.TITLE_TYPE:
case TextHeaderAtom.CENTER_TITLE_TYPE:
field = ParserFieldEnum.title;
break;
case TextHeaderAtom.NOTES_TYPE:
field = ParserFieldEnum.note;
break;
case TextHeaderAtom.BODY_TYPE:
case TextHeaderAtom.CENTRE_BODY_TYPE:
case TextHeaderAtom.HALF_BODY_TYPE:
case TextHeaderAtom.QUARTER_BODY_TYPE:
field = ParserFieldEnum.body;
break;
case TextHeaderAtom.OTHER_TYPE:
default:
field = ParserFieldEnum.other;
break;
}
StringBuilder sb = new StringBuilder();
for (HSLFTextRun textRun : textPara.getTextRuns()) {
sb.append(textRun.getRawText());
sb.append(' ');
}
result.addField(field, StringUtils.replaceConsecutiveSpaces(sb.toString(), " "));
}
String[] frags = textRun.getText().split("\\n");
for (String frag : frags)
result.addField(field,
StringUtils.replaceConsecutiveSpaces(frag, " "));
}
}
result.langDetection(10000, ParserFieldEnum.body);

}

}

0 comments on commit f375fba

Please sign in to comment.