Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor databases #1025

Merged
merged 25 commits into from Feb 13, 2018
Merged

refactor databases #1025

merged 25 commits into from Feb 13, 2018

Conversation

jackrobison
Copy link
Member

No description provided.

@jackrobison jackrobison force-pushed the database-refactor branch 3 times, most recently from 0df7d75 to 1defab6 Compare December 12, 2017 17:24
@lyoshenka
Copy link
Member

@jackrobison who needs to review this? i glanced at it briefly and i still think more columns should be not null

@jackrobison
Copy link
Member Author

@lyoshenka please check out fd0a7c3 , I rebased in some changes


create table stream_blob (
stream_hash char(96) not null,
blob_hash char(96),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my note disappeared somehow. blob_hash should be not null. a stream_blob's purpose is to connect a stream and a blob. if there's no blob, there's no stream_blob

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackrobison this still applies

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is to handle stream terminators, which don't have a blob hash but do have an iv, position, and length.

@kauffj
Copy link
Member

kauffj commented Dec 19, 2017

I think file_list should probably retain all of: outpoint, claim_id, channel_claim_id (doesn't exist as first class field currently and is instead currently buried 4 fields deep in an inconsistent way), and metadata (but this could presumably be what is currently file.metadata.stream.metadata).

I don't see how an application using LBRY can re-render to a user what they downloaded without these fields, or an efficient way to select them again.


log = logging.getLogger(__name__)


class DiskBlobManager(DHTHashSupplier):
def __init__(self, hash_announcer, blob_dir, db_dir):
def __init__(self, hash_announcer, blob_dir, storage):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to require a class or interface for arguments? i feel like storage should be an interface with functions like "addBlobs", "getBlobs", etc. Then we can implement multiple types of storage (e.g. sqlite for production, in-memory for testing) and it's clear what the contract is.

A simple way of doing this now without too much refactoring is to create an interface that has all the functions that are already in use. Then we can pare it down later.


create table stream_blob (
stream_hash char(96) not null,
blob_hash char(96),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackrobison this still applies

Copy link
Member

@eukreign eukreign left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of my comments are structural but a lot of the other ones are code style, specifically where arguments go if they don't fit on the same line as the function call.

It would be nice to have a style guide for this.

Personally, I find putting arguments on a new line to be much more aesthetically pleasing and most of the time all of the arguments then fit on a single line, so it reads easier. With the "pan handle" approach (arguments line up with function call open parenthesis) you often end up having to overflow the arguments onto more lines, this issue compounds when the argument itself is long... prime case in this particular pull request is SQL statements which actually fit most of the time on a single line if you do the new-line-4-spaces-indent for the arguments.

So,

txid, nout = lbryfile_cursor.execute("select txid, n from lbry_file_metadata "
                                     "where lbry_file=?",
                                     (file_infos[stream_hash]['rowid'],)).fetchone()

vs.

txid, nout = lbryfile_cursor.execute(
    "select txid, n from lbry_file_metadata where lbry_file=?", (file_infos[stream_hash]['rowid'],)
).fetchone()

I think the second option above looks nicer and works well in more cases than the "pan handle" approach.

return d

d.addCallback(format_info)
return d
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could above be broken down into two functions to make it easier to test? Something like:

@defer.inlineCallbacks
def format_sd_info(stream_info, include_blobs):
    fields = {
        'stream_type': EncryptedFileStreamType,
        'stream_name': stream_info[1],
        'key': stream_info[0],
        'suggested_file_name': stream_info[2],
        'stream_hash': stream_hash
    }
    if include_blobs:
        blobs = yield storage.get_blobs_for_stream(stream_hash)
        formatted_blobs = []
        for blob_info in blobs:
            blob = {}
            if blob_info.length != 0:
                blob['blob_hash'] = str(blob_info.blob_hash)
            blob['blob_num'] = blob_info.blob_num
            blob['iv'] = str(blob_info.iv)
            blob['length'] = blob_info.length
            formatted_blobs.append(blob)
        fields['blobs'] = formatted_blobs
    defer.returnValue(fields)


@defer.inlineCallbacks
def get_sd_info(storage, stream_hash, include_blobs):
    stream_info = yield storage.get_stream_info(stream_hash)
    formatted_info = yield format_sd_info(stream_info, include_blobs)
    defer.returnValue(formatted_info)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -75,7 +63,7 @@ def _immediate_announce(self, blob_hashes):
def blob_completed(self, blob, next_announce_time=None, should_announce=True):
if next_announce_time is None:
next_announce_time = self.get_next_announce_time()
yield self._add_completed_blob(blob.blob_hash, blob.length,
yield self.storage.add_completed_blob(blob.blob_hash, blob.length,
next_announce_time, should_announce)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we don't have a style guide for this yet but I've always personally preferred the arguments-on-new-line-indented-4-spaces whenever there are too many arguments to a function, this seems more consistent (indentation wise with the rest of the soruce) and looks nicer (less jagged and you can fit more arguments on a single line vs. trying to line it up with parenthesis):

        yield self.storage.add_completed_blob(
            blob.blob_hash, blob.length, next_announce_time, should_announce
        )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


def set_should_announce(self, blob_hash, should_announce):
if blob_hash in self.blobs:
blob = self.blobs[blob_hash]
if blob.get_is_verified():
return self._set_should_announce(blob_hash,
return self.storage.set_should_announce(blob_hash,
self.get_next_announce_time(),
should_announce)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as before about putting all arguments on a new line and indented 4 spaces.

return self.storage.set_should_announce(
    blob_hash, self.get_next_announce_time(), should_announce
)

@@ -300,7 +302,7 @@ def _setup_other_components(self):
else:
self.blob_manager = DiskBlobManager(self.hash_announcer,
self.blob_dir,
self.db_dir)
self.storage)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as before about all arguments on single new line below function call indented 4 spaces.

d = self.storage.setup()
d.addCallback(lambda _: self.wallet.start())
d.addCallback(lambda _: self.blob_tracker.start())
return d
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change doesn't have negative performance side affect? Seems like before the manager setup and wallet start up happened in parallel and now they are serial. Just making sure I'm understanding this change correctly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't - setup for storage is just creating the sqlite tables.

def get_sd_blob_hash_for_stream(self, stream_hash):
d = self.db.runQuery("select sd_hash from stream where stream_hash=?", (stream_hash,))
d.addCallback(lambda r: None if not r else r[0][0])
return d
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use _get_one_or_none.

def get_stream_hash_for_sd_hash(self, sd_blob_hash):
d = self.db.runQuery("select stream_hash from stream where sd_hash = ?", (sd_blob_hash, ))
d.addCallback(lambda r: None if not r else r[0][0])
return d
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use _get_one_or_none.

def get_lbry_file_status(self, rowid):
d = self.db.runQuery("select status from file where rowid = ?", (rowid,))
d.addCallback(lambda r: (r[0][0] if len(r) else None))
return d
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use _get_one_or_none.

def get_rowid_for_stream_hash(self, stream_hash):
d = self.db.runQuery("select rowid from file where stream_hash=?", (stream_hash, ))
d.addCallback(lambda r: (r[0][0] if len(r) else None))
return d
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use _get_one_or_none.

blob['length'] = blob_info.length
formatted_blobs.append(blob)
sd_blob['blobs'] = formatted_blobs
return defer.succeed(sd_blob)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a bit of déjà vu here, is this function defined already somewhere else? lbrynet/core/StreamDescriptor.py They may do something slightly different but the function names are identical and they appear to be very similar internally.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now the formatting is only in lbrynet/core/StreamDescriptor.py

@lbry-bot lbry-bot assigned jackrobison and unassigned jackrobison Jan 29, 2018
@jackrobison jackrobison force-pushed the database-refactor branch 6 times, most recently from 569956c to 706645a Compare February 6, 2018 03:00
Copy link
Member

@eukreign eukreign left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is such a massive pull request :-) Difficult to review in one sitting!

Anyways, I added a couple of minor new things but there are still some comments from the last time I reviewed that are valid.

@@ -290,7 +290,7 @@ spelling-store-unknown-words=no
[FORMAT]

# Maximum number of characters on a single line.
max-line-length=100
max-line-length=120
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking that maybe 120 is too much. In cases where you want to have two files side by side, even in PyCharm at slightly enlarged font (like mine), I can't comfortably fit two 120 lines side by side. 100 fits though... what do you think? Also, Python PEP8 suggests 100 as the max: https://www.python.org/dev/peps/pep-0008/#maximum-line-length

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like 120 but don't particularly prefer it to 100.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like 120 in general, but on the rare occasion that you want to have two files open at the same time, it helps a lot to have slightly more reasonable lines.

Actually, since you use the PyCharm git tools, I'd think you would prefer the shorter line length too since PyCharm shows diffs side by side.

out = yield EncryptedFileCreator.create_lbry_file(
session, manager, filename, handle, key, iv_generator())
out = yield EncryptedFileCreator.create_lbry_file(self.session, manager, filename, handle,
key, iv_generator())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a reverse formatting operation back to the pan handle :-D

-remove CachedClaim and wallet storage classes
-remove lbrynet/lbry_file/StreamDescriptor.py
-remove stream info manager (DBEncryptedMetadataManager)
-split `add_lbry_file` into separate `add_published_file` and `add_downloaded_file` functions
-set the download path upon adding file to the db, use the source file path for publishes
-remove the lbry file manager-wide download directory, set for each file individually
-add claim `metadata`, `claim_name`, `claim_id`, `outpoint`, `txid`, `nout`, `channel_claim_id`, and `channel_name` attributes to EncryptedFileDownloader
@jackrobison jackrobison force-pushed the database-refactor branch 2 times, most recently from 969cc12 to 16fe885 Compare February 13, 2018 19:41
@jackrobison jackrobison merged commit 3da25b8 into master Feb 13, 2018
@lyoshenka lyoshenka deleted the database-refactor branch February 23, 2018 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants