Use dictionary for dorks storage with regular dumps to json file #29

johnnykv · 2013-01-17T22:36:21Z

While rewriting the dork_db with sqlalchemy i realized that the dorks database does not in any way justify using a full-blown RDBMS... So here goes my take on a much more simple and maintainable approach.

From commit message:

Complete rewrite of dork_db.py.
Dorks are now completely stored in memory.
Every 10'th update the memory representation gets dumped to a JSON file.

- Complete rewrite of dork_db.py. - Dorks are now completely stored in memory. - Every 10'th update the memory representation is dumped to a json file

glaslos · 2013-01-18T10:25:40Z

I don't agree with your solution but at the same time I agree that this has to change.
Let me explain: The sqlite dork db was implemented for set-ups without a full-blown database. For example my sensors don't report into a local database but send the events via hpfeeds into a central database. This is also quite useful if you run Glastopf on "weak" hardware like virtual servers or RaspberryPi like systems. This is also why I think the in-memory solution is not a good one as a general approach.

Other options to solve this:

If you are logging events to a local database the dork db is basically a smaller copy of the events db. Instead of having separate db's we could also leverage the data in the event database or create a linked table with just unique request paths. This would also reduce the size of the events database.
If you are running Glastopf on a low performance machine, we have to get something less heavy than the whole thing in memory. I had dork databases with close to 100k entries which shouldn't be kept in memory. I'd go with keeping them in some format in a file and if we need dorks, read X random dorks from it. Downside is that we can't make any run-time selection algorithms like "Give me the 200 most attacked paths as dorks"...

What do you think? (Going to look at your code after lunch...)

johnnykv · 2013-01-18T11:43:44Z

I see your point, i did not consider use cases involving embedded systems.
Loosing capability to do stuff like "200 most attacked paths..." would also be a shame. Actually i was in doubt myself, might just be me getting tired of sql :-)

Take a look at the implementation (pretty much finished) done with sqlalchemy, if you think that is better no biggie - ill just push that instead.

import datetime
import threading
import logging
from sqlalchemy import Table, Column, Integer, String, MetaData
from sqlalchemy import create_engine, select


logger = logging.getLogger(__name__)


class DorkDB(object):
    """
    Responsible for communication with the dork database.
    """

    sqlite_lock = threading.Lock()

    def __init__(self, dork_connection_string="sqlite:///db/dork.db"):
        meta = MetaData()
        self.tables = self.create(meta)
        self.engine = create_engine(dork_connection_string)
        #Create database if it does not exist
        meta.create_all(self.engine)
        self.conn = self.engine.connect()

    def create(self, meta):
        tables = {}
        tablenames = ["intitle", "intext", "inurl", "filetype", "ext", "allinurl"]
        for table in tablenames:
            tables[table] = Table(table, meta,
                                  Column('content', String, primary_key=True),
                                  Column('count', Integer),
                                  Column('firsttime', String),
                                  Column('lasttime', String),
                                  )
        return tables

    def insert(self, insert_list):
        if len(insert_list) == 0:
            return
        #TODO: exception handling - or fail hard? 
        with DorkDB.sqlite_lock:
            conn = self.engine.connect()
            print "start"
            for item in insert_list:
                tablename = item['table']
                table = self.tables[tablename]
                content = item['content']

                #skip empty
                if not content:
                    continue

                dt_string = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
                #check table if content exists - content is primary key.
                db_content = conn.execute(
                    select([table]).
                    where(table.c.content == content)).fetchone()
                if db_content == None:
                    conn.execute(
                        table.insert().values({'content': content,
                                               'count': 1,
                                               'firsttime': dt_string,
                                               'lasttime': dt_string}))
                else:
                    #update existing entry
                    conn.execute(
                        table.update().
                        where(table.c.content == content).
                        values(lastime=dt_string,
                               count=table.c.count + 1))
        #TODO: Clean up db?


    def get_dork_list(self, tablename, starts_with=None):
        with DorkDB.sqlite_lock:
            table = self.tables[tablename]

            if starts_with == None:
                result = self.conn.execute(select([table]))
            else:
                result = self.conn.execute(
                    table.select().
                    where(table.c.content.like('%{0}'.format(starts_with))))

        return_list = []
        for entry in result:
            return_list.append(entry[0])

        return return_list

glaslos · 2013-01-18T12:14:24Z

I think that is the way to go... But still feel free to criticize me I haven't put a lot of thinking in it. Also what do you think of having the events and the dorks in the same database in different tables? Instead of inserting events and dorks separately, we can insert the dorks as soon as we insert the event. Also request_url in the events.db and content in the dork.db are the same, right? We could use the content column in the dorks.db as request_url in the events.db and save some space.

johnnykv · 2013-01-18T12:35:56Z

I like that idea, also easy to implement. Only problem would be if your sensor only uses hpfeeds for logging you can't use the dork stuff.

glaslos · 2013-01-18T12:46:43Z

Just start with assuming we can create a sqlite database. I'll talk to Mark during the workshop to figure out if we can get a query interface in HPFeeds. Like my sensor send a request for dorks to hpfeeds and a "machine-learning-uber-beast" selects the perfect dorks for me and publishes them to a channel. This could mean instead of showing the same top 10 dorks on all sensors we can distribute the attack surface. If you think thats too slow, we can cache the dorks for X minutes on every sensor. The uber-brain also gets the events and is able to evaluate the effectiveness of the used domain, the dorks, various configuration setting and location of the sensor.

johnnykv · 2013-01-18T13:00:44Z

I will give it a shot, should be pretty easy to implement.

I like the idea of being able to query hpfeeds data - actually i like it so much that i already made a API for just that :-) Would be a no-nobrainer to extend that to output dorks.

glaslos · 2013-01-18T13:04:23Z

Well that's very cool! We might skip the request via HPFeeds and go directly via HTTP to your API.

glaslos · 2013-01-18T15:44:55Z

What do you think about using your RAPI service as bootstrapping for Glastopf sensors? So instead of loading the same dorks from the same database for every Glastopf sensor, let them ask your service for 1k of mixed dorks (or let them provide some parameters if they are interested in something specific) which they then use to create the first dork pages.
you could provide an RAPI call to provide a special configuration with an identifier for every sensor, HPFeeds key and then we can track what kind of data we get back from them.

johnnykv · 2013-01-18T16:21:35Z

Yeah, entirely doable. Would require some interaction with the hpfeeds auth system. Ill get started with a simple simple unauthorized dork servicer for tryout. Ok, way out of topic on this issue. Closing :)

Use dictionary for dorb-db instead of sql

77205b8

- Complete rewrite of dork_db.py. - Dorks are now completely stored in memory. - Every 10'th update the memory representation is dumped to a json file

ghost assigned glaslos Jan 18, 2013

removed debug method

0a2543e

johnnykv closed this Jan 18, 2013

This was referenced Jan 18, 2013

Dork service johnnykv/mnemosyne#2

Closed

Extract dorks from mnemosyne webservice #30

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use dictionary for dorks storage with regular dumps to json file #29

Use dictionary for dorks storage with regular dumps to json file #29

johnnykv commented Jan 17, 2013

glaslos commented Jan 18, 2013

johnnykv commented Jan 18, 2013

glaslos commented Jan 18, 2013

johnnykv commented Jan 18, 2013

glaslos commented Jan 18, 2013

johnnykv commented Jan 18, 2013

glaslos commented Jan 18, 2013

glaslos commented Jan 18, 2013

johnnykv commented Jan 18, 2013

Use dictionary for dorks storage with regular dumps to json file #29

Use dictionary for dorks storage with regular dumps to json file #29

Conversation

johnnykv commented Jan 17, 2013

glaslos commented Jan 18, 2013

johnnykv commented Jan 18, 2013

glaslos commented Jan 18, 2013

johnnykv commented Jan 18, 2013

glaslos commented Jan 18, 2013

johnnykv commented Jan 18, 2013

glaslos commented Jan 18, 2013

glaslos commented Jan 18, 2013

johnnykv commented Jan 18, 2013