Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use non-lowercased project names #4

Closed
jayvdb opened this issue May 8, 2019 · 8 comments
Closed

Use non-lowercased project names #4

jayvdb opened this issue May 8, 2019 · 8 comments

Comments

@jayvdb
Copy link

jayvdb commented May 8, 2019

All project names are lower case, not matching the name shown on pypi.org. e.g. pyyaml instead of PyYAML. I suspect that may be the data this project has, in which case the problem is upstream.

That lowercasing is not very helpful - the name of projects can (and does) change over time in all sorts of ways, not just the case.

Applying lowercase can be done after the fact - it is a simple transform, but it is not reversible without the post-processing of all entries as suggested in the followup comments on #1

My use-case is I need to match the list up with openSUSE package names, which must use the PyPI package name, exactly, including casing and hyphen-vs-dash. The task is slightly more difficult and slower if I dont have the exact name to begin with.

If it cant be obtained from the source data, it is likely quicker for me to add post-processing to get the real name , rather than try to get exact results from case insensitive openSUSE package searches.

@hugovk
Copy link
Owner

hugovk commented May 9, 2019

This repo doesn't alter the names, it dumps the result from pypinfo:

/usr/local/bin/pypinfo --json --indent 0 --limit 5000 --days 30 "" project > top-pypi-packages-30-days.json

Having a quick look in pypinfo, it's not changing the name of projects received from the Google BigQuery client.


pypinfo does have this:

def normalize(name):
    """https://www.python.org/dev/peps/pep-0503/#normalized-names"""
    return re.sub(r'[-_.]+', '-', name).lower()

But that's only used for normalising the input when wanting info about a single project, and is blank in this case.

https://www.python.org/dev/peps/pep-0503/#normalized-names says:

This PEP references the concept of a "normalized" project name. As per PEP 426 the only valid characters in a name are the ASCII alphabet, ASCII numbers, ., -, and _. The name should be lowercased with all runs of the characters ., -, or _ replaced with a single - character. This can be implemented in Python with the re module:

(And then gives the same function.)


I didn't check if the Google BigQuery can also return the un-normalised name, if so, that'd need a change to pypinfo before being added here.

If that's not possible or easy, then I'd be fine adding extra data here. Rather than post-processing, I think a second JSON file would be better rather than post-processing.


Or are the openSUSE package names identical to the PyPI names (eg. PyYAML)?

If so, can you normalise PyYAML into pyyaml and then use the data here?

@jayvdb
Copy link
Author

jayvdb commented May 9, 2019

Or are the openSUSE package names identical to the PyPI names (eg. PyYAML)?

yes, with a python- prefix.

https://build.opensuse.org/package/show/openSUSE:Factory/python-PyYAML

I would prefer to be using this data first, and looking up against openSUSE, rather than the other way around, or building a database of both and cross referencing.

I'll see what is happening inside pypinfo

@jayvdb
Copy link
Author

jayvdb commented May 10, 2019

The schema is at https://bigquery.cloud.google.com/table/the-psf:pypi.downloads20161022?tab=schema , and both url and file.filename have the proper project name, and I have got them working with adhoc queries. So now I just need to propose a PR to pypinfo to use the filename. It might be slightly slower, depending on whether bigquery supports some more advanced SQL join syntax, and possibly even using https://bigquery.cloud.google.com/table/the-psf:pypi.simple_requests instead.

@hugovk
Copy link
Owner

hugovk commented May 10, 2019

Sounds good! One concern is the amount of BigQuery quota used, to ensure two requests can be made each week with the free quota. Hopefully it won't increase the amount used too much, but it would be nice to see the difference.

pypinfo reports how big each query is, you can see it in the json here.

@hugovk
Copy link
Owner

hugovk commented May 10, 2019

Good list! (I need to make a list of things using this data, too.)

Of those, https://github.com/psincraian/pepy and https://github.com/crflynn/pypistats.org are websites which essentially cache BigQuery data.

The latter is especially good and has an API, for which I've written a CLI client:

https://pypistats.org/api/
https://github.com/hugovk/pypistats

The data is limited to 6 months, and both pepy and pypistats.org don't have this specific mapping we're talking about. But maybe they could?

@jayvdb
Copy link
Author

jayvdb commented May 10, 2019

One concern is the amount of BigQuery quota used, to ensure two requests can be made each week with the free quota.

It shouldnt be extra queries - just slightly slower queries, assuming the SQL engine is halfway decent.

Based on your recommendation, I've created issues in both of those projects to see which, if any, have an interest.

You'll be interested to learn that pepy is growing an API psincraian/pepy@b3cf4ee

@jayvdb
Copy link
Author

jayvdb commented May 13, 2019

Now I have the SQL changes needed (see queries at psincraian/pepy#128 (comment)), I've also created an issue at ofek/pypinfo#73 before doing the change there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants