Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISSUE-39 Prefer Elasticsearch TransportClient over Node #42

Merged
merged 8 commits into from Aug 26, 2016
Merged

Conversation

lewismc
Copy link
Collaborator

@lewismc lewismc commented Aug 17, 2016

@Yongyao @quintinali this PR addresses a number of issues

  • Replace ALL configuration from Map<String, String> to java.util.Properites
  • Use of Elasticsearch TransportClient to connect to an existing ES cluster instead of creating new nodes all the time.
  • General large scale code cleanup. I would have liked to separate this into a separate patch however it was necessary in many instances due to the refactoring of Map<String, String> to java.util.Properites
  • Introduction of MudrodConstants.java which enables us to use constants for referencing Mudrod configuration options defined and read from config.xml. I will submit a separate patch for widespread use of MudrodConstants throughout the codebase in a subsequent pull request.

@lewismc
Copy link
Collaborator Author

lewismc commented Aug 17, 2016

I will push another bunch of commits to stabilize this branch.

@lewismc
Copy link
Collaborator Author

lewismc commented Aug 17, 2016

@Yongyao @quintinali in addition the most recent commit upgrades Elasticsearch from 1.7.X to 2.3.X.
Please test this pull request when you can. I would appreciate some feedback.
Thank you

@Yongyao
Copy link
Collaborator

Yongyao commented Aug 18, 2016

@quintinali Can you help test this pull request tomorrow? Thanks

@lewismc
Copy link
Collaborator Author

lewismc commented Aug 18, 2016

Any comments here folks?

@quintinali
Copy link
Collaborator

quintinali commented Aug 19, 2016

@lewismc I can not run mudrod successfully with program argument "-f -logDir E:\mudrodCoreTestData"

java.lang.NullPointerException
at esiptestbed.mudrod.discoveryengine.MudrodAbstract.initMudrod(MudrodAbstract.java:60)
at esiptestbed.mudrod.discoveryengine.MudrodAbstract.(MudrodAbstract.java:55)
at esiptestbed.mudrod.discoveryengine.DiscoveryEngineAbstract.(DiscoveryEngineAbstract.java:31)
at esiptestbed.mudrod.discoveryengine.WeblogDiscoveryEngine.(WeblogDiscoveryEngine.java:44)
at esiptestbed.mudrod.main.MudrodEngine.startProcessing(MudrodEngine.java:156)
at esiptestbed.mudrod.main.MudrodEngine.main(MudrodEngine.java:231)

Through debugging I found that the value of es, spark is null.

If I add three lines in the main function as following, this bug could be solved. But I am not sure whether you want to add these initial functions there?
` MudrodEngine me = new MudrodEngine();
me.props.put(LOG_DIR, dataDir);

  me.loadConfig();
  me.es = me.startESDriver();
  me.spark = me.startSparkDriver();

  switch (processingType) {
  case LOG_INGEST:
    me.startLogIngest();
    break;
  case FULL_INGEST:
    loadFullConfig(me, dataDir);
    me.startProcessing(); 
    break;
  default:
    break;
  }`

@quintinali
Copy link
Collaborator

@lewismc We used "private transient Node node" in ESdriver before, many functions in ESdriver use this parameter. Since node are not initialized when host is not null in this latest version, functions invoking node will fail, such as refreshIndex().

@lewismc
Copy link
Collaborator Author

lewismc commented Aug 19, 2016

@quintinali thank you v much for feed back. I will fix now.

@lewismc
Copy link
Collaborator Author

lewismc commented Aug 19, 2016

OK

@codecov-io
Copy link

codecov-io commented Aug 24, 2016

Current coverage is 0.00% (diff: 0.00%)

Merging #42 into master will not change coverage

@@           master    #42   diff @@
====================================
  Files          49     49          
  Lines        2689   2704    +15   
  Methods         0      0          
  Messages        0      0          
  Branches      317    318     +1   
====================================
  Hits            0      0          
- Misses       2689   2704    +15   
  Partials        0      0          

Powered by Codecov. Last update 2b2489c...a1d1018

@lewismc lewismc modified the milestones: 09/02/2016, 08/19/2016 Aug 24, 2016
@lewismc
Copy link
Collaborator Author

lewismc commented Aug 24, 2016

@Yongyao @quintinali please can both of you try this PR out again? I've made a number of updates which further clean up the codebase.
Please ensure that you have an elasticsearch server 2.3.5 running on localhost:9200

@lewismc
Copy link
Collaborator Author

lewismc commented Aug 24, 2016

@Yongyao @quintinali I just pushed another commit. Please test it out based on the elasticsearch server v 2.3.5 running on localhost:9200 and based upon the instructions below.

When I invoke the MudrodEngine with the following parameters -l -logDir /Users/lmcgibbn/Desktop/logDir/ I am able to debug through to the following lines

https://github.com/mudrod/mudrod/blob/master/core/src/main/java/esiptestbed/mudrod/discoveryengine/WeblogDiscoveryEngine.java#L113-L114

      } else if (file.isDirectory() && file.getName().matches(".*\\d+.*") && file.getName().contains(config.get("httpPrefix"))) {
        inputList.add(file.getName().replace(config.get("httpPrefix"), ""));
      }

The files I have locally are of type .json not of file suffix <para name="httpPrefix">access_log.</para> as is present within config.xml.
Please tests this code out. It is essential that we get this code stable soon and merge it in to master.

Please tag me here once you have debug'd the code and have some issues.
Thanks

@Yongyao
Copy link
Collaborator

Yongyao commented Aug 25, 2016

@lewismc That's correct. The logDir I shared with you is the intermediate results, which means you don't need raw logs anymore, and you can run -f -logDir /Users/lmcgibbn/Desktop/logDir/ directly.

I just tested the code with -f -logDir /Users/lmcgibbn/Desktop/logDir/. There is still some ES upgrading issue.

java.lang.NullPointerException
at esiptestbed.mudrod.driver.ESDriver.refreshIndex(ESDriver.java:320)
at esiptestbed.mudrod.driver.ESDriver.destroyBulkProcessor(ESDriver.java:133)
at esiptestbed.mudrod.driver.ESDriver.deleteAllByQuery(ESDriver.java:206)
at esiptestbed.mudrod.driver.ESDriver.deleteType(ESDriver.java:210)
at esiptestbed.mudrod.utils.LinkageTriple.insertTriples(LinkageTriple.java:76)
at esiptestbed.mudrod.semantics.SemanticAnalyzer.saveToES(SemanticAnalyzer.java:85)
at esiptestbed.mudrod.weblog.process.ClickStreamAnalyzer.execute(ClickStreamAnalyzer.java:54)
at esiptestbed.mudrod.discoveryengine.WeblogDiscoveryEngine.process(WeblogDiscoveryEngine.java:139)
at esiptestbed.mudrod.main.MudrodEngine.startProcessing(MudrodEngine.java:158)
at esiptestbed.mudrod.main.MudrodEngine.main(MudrodEngine.java:235)

As you can see, most of the problem is in ESDriver. Please let me know, if you need help.

@Yongyao
Copy link
Collaborator

Yongyao commented Aug 25, 2016

I think the code below is what is causing the issue. A lot of methods are using "Node", which is null, because it always goes into the first if condition.

// Prefer TransportClient
if (hosts != null && port > 1) {
TransportClient transportClient = TransportClient.builder().settings(settings).build();
for (String host: hosts)
transportClient.addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName(host), port));
client = transportClient;
} else if (clusterName != null) {
node = nodeBuilder().settings(settings).client(true).node();
client = node.client();
}

@lewismc
Copy link
Collaborator Author

lewismc commented Aug 25, 2016

Thanks I'll scope later today.

On Thursday, August 25, 2016, Yongyao Jiang notifications@github.com
wrote:

I think the code below is what is causing the issue. A lot of methods are
using "Node", which is null, because it always goes into the first if
condition.

// Prefer TransportClient
if (hosts != null && port > 1) {
TransportClient transportClient = TransportClient.builder().
settings(settings).build();
for (String host: hosts)
transportClient.addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName(host),
port));
client = transportClient;
} else if (clusterName != null) {
node = nodeBuilder().settings(settings).client(true).node();
client = node.client();
}


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#42 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABHJlyXJSPuAcN65VD4OJH1ACSoLbS2Iks5qjb0EgaJpZM4JmywP
.

Lewis
Dr. Lewis J. McGibbney Ph.D, B.Sc
Director
Phone: +1(626)498-3090
Skype: lewis.john.mcgibbney
Email: lewis.mcgibbney@gmail.com

@lewismc
Copy link
Collaborator Author

lewismc commented Aug 25, 2016

The code is correct. We should always prefer to use a client rather than a
Node. Creating a new Node is NOT what we want to do every time. This is a
waste or resources and is too expensive when running common tasks such as
log ingest, etc.
I will run the code with the full log ingest option on a new, empty,
unmapped ES instance and fix the remainder of the code.

This patch had turned out to me a blocker for my ontology development. The
quicker we can resolve this, the better.

What error is thrown when you execute a full log ingest? Are you running
it against a brand new unmapped ESv2.3.5 instance? Please post your error
belo such that I can reproduce the error consistently. Thanks Yongyao

On Thursday, August 25, 2016, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

Thanks I'll scope later today.

On Thursday, August 25, 2016, Yongyao Jiang <notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

I think the code below is what is causing the issue. A lot of methods are
using "Node", which is null, because it always goes into the first if
condition.

// Prefer TransportClient
if (hosts != null && port > 1) {
TransportClient transportClient = TransportClient.builder().sett
ings(settings).build();
for (String host: hosts)
transportClient.addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName(host),
port));
client = transportClient;
} else if (clusterName != null) {
node = nodeBuilder().settings(settings).client(true).node();
client = node.client();
}


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#42 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABHJlyXJSPuAcN65VD4OJH1ACSoLbS2Iks5qjb0EgaJpZM4JmywP
.

Lewis
Dr. Lewis J. McGibbney Ph.D, B.Sc
Director
Phone: +1(626)498-3090
Skype: lewis.john.mcgibbney
Email: lewis.mcgibbney@gmail.com
javascript:_e(%7B%7D,'cvml','lewis.mcgibbney@gmail.com');

Lewis
Dr. Lewis J. McGibbney Ph.D, B.Sc
Director
Phone: +1(626)498-3090
Skype: lewis.john.mcgibbney
Email: lewis.mcgibbney@gmail.com

@Yongyao
Copy link
Collaborator

Yongyao commented Aug 25, 2016

@lewismc I have already fixed the problem in ESdriver. Please see my commit
23eb5e2

I have also tested the -f command. The results is correct. I will continue to test the -l command (preprosessing) this afternoon, which very likely has some error since the ES API has changes. I will report the result here later today.

Also, some minor issues here.
The first time I downloaded this PR, there are two errors that I have to fix it by hand.

  1. The pom.xml of core and service are not up to date. Maven-update update project
  2. The type package-info is already defined package-info.java /mudrod-core/src/main/java/esiptestbed/mudrod/main line 1 Java Problem

@lewismc
Copy link
Collaborator Author

lewismc commented Aug 25, 2016

Ok please keep testing. I need you and Yun to keep testing so that we can
move on.
Really appreciate the feedback.

On Thursday, August 25, 2016, Yongyao Jiang notifications@github.com
wrote:

@lewismc https://github.com/lewismc I have already fixed the problem in
ESdriver. Please see my commit
23eb5e2
23eb5e2

I have also tested the -f command. The results is correct. I will continue
to test the -l command (preprosessing) this afternoon, which very likely
has some error since the ES API has changes. I will report the result here
later today.

Also, some minor issues here.
The first time I downloaded this PR, there are two errors that I have to
fix it by hand.

  1. The pom.xml of core and service are not up to date. Maven-update update
    project
  2. The type package-info is already defined package-info.java
    /mudrod-core/src/main/java/esiptestbed/mudrod/main line 1 Java Problem


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#42 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABHJl9Mk5nfVPz5gkZsN1zr9xC8bGAq9ks5qjcVfgaJpZM4JmywP
.

Lewis
Dr. Lewis J. McGibbney Ph.D, B.Sc
Director
Phone: +1(626)498-3090
Skype: lewis.john.mcgibbney
Email: lewis.mcgibbney@gmail.com

@Yongyao
Copy link
Collaborator

Yongyao commented Aug 25, 2016

@lewismc I am testing the -l command. It works for very small logs, but it takes much longer to ingest the same amount of logs. Just give you an idea of what it is like, the log file that takes about 1hr now takes more than 1hour (still running). There was a mistake in importLog.java, but I have fixed it. I will keep testing, and let you know tomorrow morning at latest.

BTW, a crawler detection problem.

 public static final String GOOGLE_BOT = "gsa-crawler (Enterprise; T4-JPDGU3TRCQAXZ; earthdata-sa@lists.nasa.gov,srinivasa.s.tummala@nasa.gov)";
  public static final String GOOGLE_BOT_21 = "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)";
  public static final String BING_BOT = "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)";
  public static final String YAHOO_BOT = "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)";
  public static final String ROGER_BOT = "rogerbot/1.0 (http://www.moz.com/dp/rogerbot, rogerbot-crawler@moz.com)";
  public static final String YACY_BOT = "yacybot (/global; amd64 Windows Server 2008 R2 6.1; java 1.8.0_31; Europe/de) http://yacy.net/bot.html";
  public static final String YANDEX_BOT = "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)";
  public static final String GOOGLE_IMAGE = "Googlebot-Image/1.0";
  public static final String RANDOMBOT_1 = "Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)";
  public static final String NO_AGENT_BOT = "-";
  public static final String PERL_BOT = "libwww-perl/";
  public static final String APACHE_HHTP = "Apache-HttpClient/";
  public static final String JAVA_CLIENT = "Java/";
  public static final String CURL = "curl/";
 if (agent.equals(GOOGLE_BOT) || agent.equals(GOOGLE_BOT_21)
        || agent.equals(BING_BOT) || agent.equals(YAHOO_BOT)
        || agent.equals(ROGER_BOT) || agent.equals(YACY_BOT)
        || agent.equals(YANDEX_BOT) || agent.equals(GOOGLE_IMAGE)
        || agent.equals(RANDOMBOT_1) || agent.equals(NO_AGENT_BOT)
        || agent.contains(PERL_BOT) || agent.contains(APACHE_HHTP)
        || agent.contains(JAVA_CLIENT) || agent.contains(CURL)) {
      return true;
    } else {
      return false;
    }

Do you think it is a good practice to do so? I have found a lot of new agents in the log of 2016. Maybe change "equal()" to "contain("crawler") or contain("bot")" ?

@lewismc
Copy link
Collaborator Author

lewismc commented Aug 25, 2016

Well ingestion should not be slower and we need to write ingestion tests to
benchmark log ingestion. Please open an issue for this in Github issues.

Regarding crawler detection, the current crawler http agent name det croon
logic is horrible. We need to change this and I have suggestions for how we
do that however that is another issue. Please also log an issue for that in
Github, we can then write the implementations and also write tests to
validate.

Thanks

On Thursday, August 25, 2016, Yongyao Jiang notifications@github.com
wrote:

@lewismc https://github.com/lewismc I am testing the -l command. It
works for very small logs, but it takes much longer to ingest the same
amount of logs. Just give you an idea of what it is like, the log file that
takes about 1hr now takes more than 1hour (still running). There was a
mistake in importLog.java, but I have fixed it. I will keep testing, and
let you know tomorrow morning at latest.

BTW, a crawler detection problem.

public static final String GOOGLE_BOT = "gsa-crawler (Enterprise; T4-JPDGU3TRCQAXZ; earthdata-sa@lists.nasa.gov javascript:_e(%7B%7D,'cvml','earthdata-sa@lists.nasa.gov');,srinivasa.s.tummala@nasa.gov javascript:_e(%7B%7D,'cvml','srinivasa.s.tummala@nasa.gov');)";
public static final String GOOGLE_BOT_21 = "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)";
public static final String BING_BOT = "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)";
public static final String YAHOO_BOT = "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)";
public static final String ROGER_BOT = "rogerbot/1.0 (http://www.moz.com/dp/rogerbot, rogerbot-crawler@moz.com javascript:_e(%7B%7D,'cvml','rogerbot-crawler@moz.com');)";
public static final String YACY_BOT = "yacybot (/global; amd64 Windows Server 2008 R2 6.1; java 1.8.0_31; Europe/de) http://yacy.net/bot.html";
public static final String YANDEX_BOT = "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)";
public static final String GOOGLE_IMAGE = "Googlebot-Image/1.0";
public static final String RANDOMBOT_1 = "Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)";
public static final String NO_AGENT_BOT = "-";
public static final String PERL_BOT = "libwww-perl/";
public static final String APACHE_HHTP = "Apache-HttpClient/";
public static final String JAVA_CLIENT = "Java/";
public static final String CURL = "curl/";

if (agent.equals(GOOGLE_BOT) || agent.equals(GOOGLE_BOT_21)
|| agent.equals(BING_BOT) || agent.equals(YAHOO_BOT)
|| agent.equals(ROGER_BOT) || agent.equals(YACY_BOT)
|| agent.equals(YANDEX_BOT) || agent.equals(GOOGLE_IMAGE)
|| agent.equals(RANDOMBOT_1) || agent.equals(NO_AGENT_BOT)
|| agent.contains(PERL_BOT) || agent.contains(APACHE_HHTP)
|| agent.contains(JAVA_CLIENT) || agent.contains(CURL)) {
return true;
} else {
return false;
}

Do you think it is a good practice to do so? I have found a lot of new
agents in the log of 2016. Maybe change "equal()" to "contain("crawler") or
contain("bot")" ?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#42 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABHJlzO2PJzzw22ERxzWTa16mIxxFy60ks5qjgPogaJpZM4JmywP
.

Lewis
Dr. Lewis J. McGibbney Ph.D, B.Sc
Director
Phone: +1(626)498-3090
Skype: lewis.john.mcgibbney
Email: lewis.mcgibbney@gmail.com

@lewismc
Copy link
Collaborator Author

lewismc commented Aug 26, 2016

Hi @Yongyao @quintinali I just pushed some more commits to the ISSUE-39 branch. I am able to perform full ingest of local log files as well as other linkage information.
Unless there is some blocking issue, I would like to propose to merge this branch into master and we can progress with other issues.
You can see from my screenshot that I was able to ingest some 246K documents very quickly indeed.
screen shot 2016-08-25 at 7 06 26 pm

@Yongyao
Copy link
Collaborator

Yongyao commented Aug 26, 2016

Yes, there is no problem with -f command right now, but this is not real full ingest. In order not to block your further work, I made it skip the preprocessing step last week. The 246K hits are not logs.

I think you can continue the ontology work based on this code. In the meanwhile, I will test the real preprocessing code and try to improve it.

Sent from my iPhone

On Aug 25, 2016, at 22:00, Lewis John McGibbney notifications@github.com wrote:

Hi @Yongyao @quintinali I just pushed some more commits to the ISSUE-39 branch. I am able to perform full ingest of local log files as well as other linkage information.
Unless there is some blocking issue, I would like to propose to merge this branch into master and we can progress with other issues.
You can see from my screenshot that I was able to ingest some 246K documents very quickly indeed.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@lewismc
Copy link
Collaborator Author

lewismc commented Aug 26, 2016

Yes, there is no problem with -f command right now, but this is not real full ingest. In order not to block your further work, I made it skip the preprocessing step last week. The 246K hits are not logs.

I am not understanding this. Can you please explain what else needs to be done for a full ingest to take place? Maybe we can discuss tomorrow @Yongyao as I am clearly missing some information here.
Is it possible to merge this code into master and then work towards the solution? The patch is too large already in my honest opinion. It is getting messy and I would like to see it merged sooner rather than later.

Can we work to implement the full ingest code after this has been merged into master?

@Yongyao
Copy link
Collaborator

Yongyao commented Aug 26, 2016

I agree. It's fine to merge it to the master. I'm happy to talk tomorrow, and 11am-12pm, 1:30-4pm work for me.

Sent from my iPhone

On Aug 25, 2016, at 23:02, Lewis John McGibbney notifications@github.com wrote:

Yes, there is no problem with -f command right now, but this is not real full ingest. In order not to block your further work, I made it skip the preprocessing step last week. The 246K hits are not logs.

I am not understanding this. Can you please explain what else needs to be done for a full ingest to take place? Maybe we can discuss tomorrow @Yongyao as I am clearly missing some information here.
Is it possible to merge this code into master and then work towards the solution? The patch is too large already in my honest opinion. It is getting messy and I would like to see it merged sooner rather than later.

Can we work to implement the full ingest code after this has been merged into master?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@lewismc lewismc merged commit a1d1018 into master Aug 26, 2016
@lewismc lewismc deleted the ISSUE-39 branch August 26, 2016 03:26
@lewismc
Copy link
Collaborator Author

lewismc commented Aug 26, 2016

Cool. I will try you at the earlier time. Thanks @Yongyao and @quintinali for reviews of this code. I appreciate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants