-
-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-normalised package name #128
Comments
I would really like to be able to search with the original name and the normalized one. I checked that in BigQuery there isn't any column where is only the original project name, and as you mentioned in the other issue, it can be extracted it from url or filename. Do you have a query to see the output and to test it manually? Maybe you also could normalize the original names that are in setup.py, I think it would work. |
I've read up on bq, and it supports the sql features i need to achieve this in a single query. Im on mobile atm, so will test theory tmr. However, your project may benefit from a simpler approach, which will be slightly more processing. Instead of grouping by project name, group by project name and filename. The counts will be downloads of each file. Then select the project name and filename and store both in your schema. Then add the counts for each project in your batch processing to get the totals as before. Now you can also show version stats, or wheel or sdist stats, if those interest you. |
The minimal query change for this project would be something like #standardSQL
SELECT substr(max(file.filename),1,LENGTH(file.project)) as unnormalised_name, count(*) AS downloads, file.project as normalised_name
FROM `the-psf.pypi.downloads20190505`
WHERE file.project = 'pyyaml'
GROUP BY file.project but with and Also That query assumes that the 'max' filename is the best filename, but really it is mostly relying on the 'fact' that all filenames per project have the same prefix (which isnt quite correct - often there are packages which used Another approach, which seems to 'cost' more time/money/etc, is collecting all filenames into a cell in each row. #standardSQL
SELECT file_project_name as project_name, sum(file_download_count) AS downloads, ARRAY_AGG(filename) as filenames
FROM
(
SELECT file.project as file_project_name, file.filename as filename, count(*) AS file_download_count
FROM `the-psf.pypi.downloads20190505`
WHERE file.project = 'pyyaml'
GROUP BY file.project, file.filename
) as file_downloads
GROUP BY file_project_name Then the application server can at least run diagnostics on all filenames per project to ensure they are all appropriate and can some some basic analytics on the types of files being downloaded, but it lacks the ability to understand importance of any deductions as it doesnt have the download counts per file. Now that I see the results, I am even more convinced that tools like pepy really shouldnt be letting bigquery aggregate by project name, but should instead gather per-file stats and then use app logic to aggregate the file stats smartly. e.g. here is the results for pyyaml on 20190505 #standardSQL
SELECT file.project as file_project_name, file.filename as filename, count(*) AS file_download_count
FROM `the-psf.pypi.downloads20190505`
WHERE file.project = 'pyyaml'
GROUP BY file.project, file.filename
ORDER BY file_download_count desc There are lots of oddities in that dataset, and it is worth collecting the raw data and finding ways to show it. But that is a bigger schema change, and requires a migration to handle backwards compatibility any retained historical data which isnt available and re-fetched. |
Hi @jayvdb, Thanks for the clarification and the details. I really like the idea to have the downloads grouped by filename and then show more relevant stats to the user. Regarding the historical data I am not worried, we can delete it. All this data is in bigquery and I only need the downloads of the last 30 days. The lean approach is to have the project name in pepy as another column, this should be straightforward to implement and to show the results. If you want to have it quickly you can create a pull request 😃 I will open another issue to have the downloads stats per version to discuss it 😉 |
I would like to use the data to correlate with openSUSE package names, which use the 'real' name supplied in setup.py, i.e. not-normalised.
I've been doing a bit of research at hugovk/top-pypi-packages#4, and the raw data from bigquery can include this, with a very small perf hit, or it can be added afterwards by doing lookups against PyPI directly, which would be a significant extra effort to build and more work for the batch jobs.
I'm putting in the effort - I want to know where I should put the effort, here or somewhere else. If here, would you prefer it coming from bigquery, or added afterwards from PyPI. The latter makes sense if there are other details from PyPI that you believe are needed in this project which cant be obtained from the bigquery dataset.
The text was updated successfully, but these errors were encountered: