Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include timestamps in download data #2416

Open
purcell opened this issue Jan 17, 2015 · 10 comments
Open

Include timestamps in download data #2416

purcell opened this issue Jan 17, 2015 · 10 comments

Comments

@purcell
Copy link
Member

purcell commented Jan 17, 2015

I'd like to be able to show charts of download activity over time for individual packages, which would involve dumping per-package json which includes timestamps (or at least dates). Do we have all the old logfiles in order to produce this? The sqlite aggregation would need to be changed, of course.

@milkypostman
Copy link
Member

We have it. There is no SQLite that I know of going on. I'd have to rewrite
it to use SQLite but can do that easily. Cool idea. I'll work on the
backend part.

On Saturday, January 17, 2015, Steve Purcell notifications@github.com
wrote:

I'd like to be able to show charts of download activity over time for
individual packages, which would involve dumping per-package json which
includes timestamps (or at least dates). Do we have all the old logfiles in
order to produce this? The sqlite aggregation would need to be changed, of
course.


Reply to this email directly or view it on GitHub
#2416.

@purcell
Copy link
Member Author

purcell commented Jan 18, 2015

I was pretty sure the logs get slurped into sqlite, and then results queried from there get splooged into downloads.json.

@milkypostman
Copy link
Member

Did you make that change? Cause I didn't. We talked about it but then I
fixed storing all dates from every IP. That fixed my naive momory only
version.

On Sunday, January 18, 2015, Steve Purcell notifications@github.com wrote:

I was pretty sure the logs get slurped into sqlite, and then results
queried from there get splooged into downloads.json.


Reply to this email directly or view it on GitHub
#2416 (comment).

@purcell
Copy link
Member Author

purcell commented Jan 18, 2015

Looks like you did it in October: 8acf077

LOL.

@milkypostman
Copy link
Member

OMG MY MEMORY IS GOING!!!!! Arrrgh

Ok

On Sunday, January 18, 2015, Steve Purcell notifications@github.com wrote:

Looks like you did it in October: 8acf077
8acf077

LOL.


Reply to this email directly or view it on GitHub
#2416 (comment).

@purcell
Copy link
Member Author

purcell commented Jan 18, 2015

You must have been on a 3-day coding vision quest at the time, and now you have only patchy memories...

@milkypostman
Copy link
Member

Yes, now I remotely remember a conversation about how I probably didn't
need to use SQLite but would just do it anyways cause it seemed like the
better way to do it.

On Sunday, January 18, 2015, Steve Purcell notifications@github.com wrote:

You must have been on a 3-day coding vision quest at the time, and now you
have only patchy memories...


Reply to this email directly or view it on GitHub
#2416 (comment).

@purcell
Copy link
Member Author

purcell commented Jan 18, 2015

And you were right!

@cpitclaudel
Copy link
Contributor

Hi people,

Some thoughts about this:

  • The data doesn't seem to currently be available: the downloads database to only store packages and IP addresses, but not download times (is that right?). If so, then the schema would need to be changed, and an extra column added.
  • This would probably be separate from the existing download_counts file; the download data for every package will be too large to download in full upon visiting melpa.org (loading the website already sends every client about 1MB worth of data, though gzipped and thus reduced to about 200k). Given the design of the current website (no dynamic queries IIUC), it could make sense to have one "download-stats" file per package. Each of these files could be a simple json dictionary of date → count.

Here is some code that shows a prototype of how this might work:

import collections
import datetime
import json
import random
import sqlite3
import string
import time

CREATE_QUERY = "CREATE TABLE pkg_ip_time (pkg STRING NOT NULL, ip INT, dl_time DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP)"
COUNT_QUERY = "SELECT pkg, DATE(dl_time), COUNT(ip) FROM pkg_ip_time GROUP BY pkg, DATE(dl_time)"

def main():
    connection = sqlite3.connect(':memory:')
    cursor = connection.cursor()

    print("0. Creating table")
    cursor.execute(CREATE_QUERY)

    print("1. Creating random records")
    packages = string.ascii_letters
    stats = [(random.choice(packages), random.randint(0, 2 ** 32),
              datetime.datetime.now() - datetime.timedelta(seconds=random.randint(0, 60*60*24*30)))
             for _ in range(100*1000)]

    print("2. Inserting created records")
    cursor.executemany("INSERT INTO pkg_ip_time VALUES (?, ?, ?)", stats)
    connection.commit()

    print("3. Generating statistics")
    start = time.time()
    stats_per_package = collections.defaultdict(dict)
    for package, date, downloads in cursor.execute(COUNT_QUERY):
        stats_per_package[package][date] = downloads
    for package, stats in stats_per_package.items():
        with open("melpa-{}-stats.json".format(package), mode="w") as stats_file:
            json.dump(stats, stats_file)
    print(time.time() - start)

main()

I imagine the dl_time field would in fact allow NULLS, so as to permit importing the existing data. On the client side, I imagine we'd want to load an extra charting library, and just feed it that data :)

Let me know if there are other things I can look into to help :)

@cpitclaudel
Copy link
Contributor

Ping? I'd love to get such a feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants