Skip to content
This repository has been archived by the owner on Feb 8, 2018. It is now read-only.

Load up npm #4153

Merged
merged 73 commits into from Oct 26, 2016
Merged

Load up npm #4153

merged 73 commits into from Oct 26, 2016

Conversation

chadwhitacre
Copy link
Contributor

@chadwhitacre chadwhitacre commented Oct 22, 2016

Part of #4148.

We need to load npm into our database and keep it up to date, in order to enable the /on/npm/, /on/npm/foo/, and /search changes we want to make.

This was referenced Oct 22, 2016
@aandis
Copy link
Contributor

aandis commented Oct 22, 2016

This needs to have the work done so far in #4155 so I'm just gonna grab it from there.

@chadwhitacre
Copy link
Contributor Author

@aandis Yeah, and we can drop the /on/npm/ stuff from this. Probably best to rebase here so we have a cleaner history. Are you comfortable with rebasing? :-)

@aandis
Copy link
Contributor

aandis commented Oct 22, 2016

Rebased to remove /on/npm work. Previous head was 4d92a22

@aandis
Copy link
Contributor

aandis commented Oct 22, 2016

Rebased again on top of #4155 to bring in all work from there. Commits show on https://github.com/gratipay/gratipay.com/pull/4153/commits but not here. weird.

@aandis
Copy link
Contributor

aandis commented Oct 22, 2016

Actually let's scratch that rebase. I wanted to use Package.insert but we'll need a multi insert statement here.

@aandis
Copy link
Contributor

aandis commented Oct 22, 2016

@whit537 how do I run the chomp script?

@chadwhitacre
Copy link
Contributor Author

chadwhitacre commented Oct 22, 2016

how do I run the chomp script?

@aandis Once you pip install -e . (as under make env) you should have it on your virtualenv's PATH via an entry_point in setup.py. We should write tests for all the pieces that go into it, though. I tried to factor it so it's easy to swap out the HTTP request for all to test with a much smaller subset of the full JSON.

@aandis
Copy link
Contributor

aandis commented Oct 22, 2016

Got it. env/bin/chomp

@chadwhitacre
Copy link
Contributor Author

I wanted to use Package.insert but we'll need a multi insert statement here.

We may need to think differently about initial load and subsequent updates. For initial load we'll have a lot to do and it might be worth looking at COPY. Remember that for subsequent runs we'll need to update as well as insert. Sounds like we may not need to worry about delete (once a name is claimed, even if unpublished it gets a security package).

@chadwhitacre
Copy link
Contributor Author

I had figured we could pass -1 as the arg to chomp to mean "reprocess everything".

Fetching and processing readmes is going to add significant expense. It may make sense to decouple that and just load the database with all and then process readmes.

@aandis
Copy link
Contributor

aandis commented Oct 22, 2016

it might be worth looking at COPY

I thought COPY could only be used with files. looking into it.

@chadwhitacre
Copy link
Contributor Author

chadwhitacre commented Oct 22, 2016

So maybe we have these three?

  1. A script that takes all and spits out whatever COPY wants. We run that manually to initially populate the table (will we be able to use that same COPY method when adding packages from other package managers like RubyGems and PyPI? Or does that only work for populating new/empty tables?).
  2. A script that takes all and upserts any record with modified time newer than the last time the script was run (every 30 minutes?).
  3. A script that does select * from packages where readme is null order by mtime asc limit 1 and then hits the registry api and processes with markdown.marky. It runs continuously (either via cron, or with its own eternal loop).

I was thinking (1) and (2) could be variants of the same chomp script. Probably (3) is a separate script?

@aandis
Copy link
Contributor

aandis commented Oct 22, 2016

A script that takes all and spits out whatever COPY wants. We run that manually to initially populate the table

Manual work shouldn't be needed.

(will we be able to use that same COPY method when adding packages from other package managers like RubyGems and PyPI? Or does that only work for populating new/empty tables?).

COPY FROM copies data from a file to a table (appending the data to whatever is in the table already)

https://www.postgresql.org/docs/current/static/sql-copy.html

@chadwhitacre
Copy link
Contributor Author

Manual work shouldn't be needed.

Okay. I mostly meant that we can just run that script as a one-off rather than having to install it in a long-lived dyno/cronjob/whatever.

@aandis
Copy link
Contributor

aandis commented Oct 22, 2016

A script that takes all and upserts any record with modified time newer than the last time the script was run (every 30 minutes?).

Do we get modified_time for a package inside all?

@chadwhitacre
Copy link
Contributor Author

Do we get modified_time for a package inside all?

Yes!

@aandis
Copy link
Contributor

aandis commented Oct 22, 2016

Sounds good. I am gonna start working on 1. Might doze off in an hour. Will pick up tomorrow from where you leave it.

@chadwhitacre
Copy link
Contributor Author

Awesome, @aandis, sounds good. I have #4117 just about topped up. I'll jump in here in a bit ...

@aandis
Copy link
Contributor

aandis commented Oct 22, 2016

@whit537 built up a bit on 1. Should need to call copy with the data created to insert it. Gonna head to bed now. Will pick it up tomorrow. :)

( id bigserial PRIMARY KEY
, package_id bigint NOT NULL REFERENCES packages ON UPDATE CASCADE ON DELETE RESTRICT
, email text NOT NULL
, UNIQUE (package_id, email)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want a UNIQUE here because a package can have multiple author/maintainer emails, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the emails be different? This is just a way of preventing duplicate inserts of a email for a given package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right. I wasn't thinking straight. :)


CREATE TABLE packages
( id bigserial PRIMARY KEY
, package_manager_id bigint NOT NULL REFERENCES package_managers ON UPDATE CASCADE ON DELETE RESTRICT
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure we don't need the ON UPDATE ... ON DELETE since we're referencing an immutable field.


def render(markdown):
return Markup(m.html(
markdown,
extensions=m.EXT_AUTOLINK | m.EXT_STRIKETHROUGH | m.EXT_NO_INTRA_EMPHASIS,
render_flags=m.HTML_SKIP_HTML | m.HTML_TOC | m.HTML_SMARTYPANTS | m.HTML_SAFELINK
))

def marky(markdown):
return subprocess.call()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should get rebased on master now that #4154 is in.

@chadwhitacre
Copy link
Contributor Author

chadwhitacre commented Oct 24, 2016

I'm gonna dive into this today on the plane.

@chadwhitacre
Copy link
Contributor Author

Here's a few commits. More later ...

@chadwhitacre
Copy link
Contributor Author

Just checking ...

screen shot 2016-10-26 at 8 59 40 am

@chadwhitacre
Copy link
Contributor Author

Yes! 💃

~ $ curl $URL | sync-npm serialize /dev/stdin | sync-npm upsert /dev/stdin
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   279  100   279    0     0   6578      0 --:--:-- --:--:-- --:--:--  6642
skipping None
skipping {u'description': '', u'package_manager': 'npm', u'emails': [], u'name': '_updated'}
processed 1 packages in   0 seconds
~ $

@chadwhitacre
Copy link
Contributor Author

Double yes! 💃 💃

tranquil-ocean-34844::DATABASE=> select * from packages;
┌────┬─────────────────┬─────────────────┬───────────────────────┬────────┬────────────┬─────────────┬────────
│ id │ package_manager │      name       │      description      │ readme │ readme_raw │ readme_type │        
├────┼─────────────────┼─────────────────┼───────────────────────┼────────┼────────────┼─────────────┼────────
│  2 │ npm             │ testing-package │ A package for testing │        │            │             │ {alice@
└────┴─────────────────┴─────────────────┴───────────────────────┴────────┴────────────┴─────────────┴────────
(1 row)

lines 1-7/7 (END) 

@chadwhitacre
Copy link
Contributor Author

Okay! Let's light this candle! 🕯️

This location is already in LD_LIBRARY_PATH.
@chadwhitacre
Copy link
Contributor Author

I've set up a 10-minute scheduler job on the test app. If that works then my plan is to:

  • switch to the real URL
  • merge and deploy this PR
  • schedule a daily run for now

@chadwhitacre
Copy link
Contributor Author

YEEEEEEEEEEEEEEEEEEE

screen shot 2016-10-26 at 10 20 07 am

@chadwhitacre
Copy link
Contributor Author

I'm going to land this as-is, and reticket loading up readmes as well as incrementalizing updates.

@chadwhitacre
Copy link
Contributor Author

Ready to go once Travis is green ...

@chadwhitacre
Copy link
Contributor Author

Okay! Here we go! ...

@chadwhitacre chadwhitacre merged commit c6a3bab into master Oct 26, 2016
@chadwhitacre chadwhitacre deleted the chomp branch October 26, 2016 14:33
@chadwhitacre
Copy link
Contributor Author

!m @aandis

!m *

@chadwhitacre
Copy link
Contributor Author

Coming back from #4148 (comment)

processed 371932 packages in 308 seconds
Traceback (most recent call last):
  File "/app/.heroku/python/bin/sync-npm", line 9, in <module>
    load_entry_point('gratipay', 'console_scripts', 'sync-npm')()
  File "/app/gratipay/package_managers/sync.py", line 137, in main
    globals()[args.command](args)
  File "/app/gratipay/package_managers/sync.py", line 101, in upsert
    'FROM STDIN WITH (FORMAT csv)', fp)
psycopg2.IntegrityError: null value in column "description" violates not-null constraint
DETAIL:  Failing row contains (2, npm, 0, null, , , , {zolikahan@gmail.com}).
CONTEXT:  COPY updates, line 1: "npm,0,,"{""zolikahan@gmail.com""}""

@chadwhitacre
Copy link
Contributor Author

More later!

@chadwhitacre
Copy link
Contributor Author

Local. 👍

gratipay-test=# select count(*) from packages;
┌────────┐
│ count  │
├────────┤
│ 372147 │
└────────┘
(1 row)

@chadwhitacre
Copy link
Contributor Author

Okay! Now what's up with the tests?

@chadwhitacre
Copy link
Contributor Author

Picking up in #4158 ...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants