Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to filter out mirror and bot downloads by default #4

Closed
runfalk opened this issue Jun 15, 2017 · 14 comments
Closed

Try to filter out mirror and bot downloads by default #4

runfalk opened this issue Jun 15, 2017 · 14 comments

Comments

@runfalk
Copy link

runfalk commented Jun 15, 2017

Hi,

I tried your project and thik it's really cool. I maintain a small project with few download. When I queried the dataset for the last year it said 20k downloads. This most likely includes spiders and PyPI mirrors. For this to be useful for smaller projects it would be great to try and filter these things out by default.

I for instance have downloads from Python version 1.17, which does not seem right. The majority of my downloads come from the system None (which I assume happens when downloaded over HTTP by clicking the link on the website. Maybe a baseline could be computed based on the minimum number of downloads for any project during the same time interval?

@ofek
Copy link
Owner

ofek commented Jun 15, 2017

@runfalk Thanks for your interest! I don't think mirrors are considered https://pypi.python.org/mirrors. As far as incorrect or unknown data, I think that might be outdated versions of pip.

@dstufft Would you mind weighing in?

@dstufft
Copy link

dstufft commented Jun 15, 2017

It can be mirroring clients, or it can be OS level clients, or it can be outdated versions of pip. details.installer.name will tell you what we think it is, given what information is available (bandersnatch being the most common mirroring tool).

@runfalk
Copy link
Author

runfalk commented Jun 15, 2017

Okay, so here's the case for my library:

$ pypinfo -sd -365 Spans installer
installer_name download_count 
-------------- -------------- 
bandersnatch   18134          
pip            1908           
requests       369            
Browser        241            
None           66             
z3c.pypimirror 63             
OS             16             
devpi          12             
setuptools     8              
conda          4

At least bandersnatch and z3c.pypimirror are mirrors here. Looking at Flask there does not seem to be many other obvious ones:

$ pypinfo -sd -365 Flask installer
installer_name download_count 
-------------- -------------- 
pip            19318831       
setuptools     871291         
None           195617         
Browser        39987          
bandersnatch   31654          
pex            24459          
devpi          18639          
distribute     17042          
requests       15057          
Artifactory    3887           
OS             2769           
Homebrew       265            
conda          196            
z3c.pypimirror 135            
pep381client   26

@ofek
Copy link
Owner

ofek commented Jun 15, 2017

You can also use --json to manipulate the data as you wish.

@runfalk
Copy link
Author

runfalk commented Jun 16, 2017

Oh, I know I can manipulate it. I just figured that it could be a good thing to maintain a list of known mirrors to avoid them affecting the result. When I tried it for my library I got a false sense of "Wow! That many are using this?". It turned out that ~2000/20000 were possible real downloads.

There could be an option --include-mirrors to bring them back. Would you be interested in a PR for this?

@dstufft Is there any more information on this data, like how far back it goes? Is it 2016-01-22?

@ofek
Copy link
Owner

ofek commented Jun 16, 2017

Is a download from a mirror not 'real'?

@runfalk
Copy link
Author

runfalk commented Jun 17, 2017

It's real in the sense that it's a download. However, it doesn't provide any insight to your user demographics. Knowing that my package was mirrored 18000 times in a year doesn't mean my package was even tried out by anyone. It's most likely someone making a full mirror of PyPI.

I can't think of a use case where mirror downloads provides useful information about a project.

@rahiel
Copy link
Contributor

rahiel commented Jun 24, 2017

Filtering out mirror/bot downloads shouldn't be done by default. Pypinfo is a tool to get download statistics from biquery, it shouldn't judge the data by filtering stuff by itself, this should be left to the user.

The filtering is still useful if you want to know how many actual user downloads a package gets, so an optional flag makes sense. I'm filtering out these clients:

non_user_installers = [
    "bandersnatch",
    "pep381client",
    "z3c.pypimirror"
]

@ofek
Copy link
Owner

ofek commented Jun 24, 2017

If one does pip install ... is there any chance the installer would show not pip?

@rahiel
Copy link
Contributor

rahiel commented Jun 24, 2017

Pip properly sets its user agent, I guess pypi parses it . So I think no.

@ofek
Copy link
Owner

ofek commented Jun 24, 2017

@rahiel Ok ty.

If anyone wants to submit a PR I'll gladly accept a --pip pip-only flag or a --real flag that ignores certain installers.

@dstufft
Copy link

dstufft commented Jun 24, 2017

If you're using a sufficiently old pip yea I think so, like.. <1.4 I think? Whatever it is, it's several years old by now.

@hugovk
Copy link
Collaborator

hugovk commented Oct 16, 2017

Here's a PR for a --pip flag: #15.

@ofek
Copy link
Owner

ofek commented Oct 16, 2017

@hugovk Thank you!

@ofek ofek closed this as completed Oct 16, 2017
suv27 pushed a commit to suv27/pypinfo that referenced this issue Oct 5, 2019
fix ofek#1, refatora classe para atender aos padroes do PEP
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants