-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shorten multiple URLs at once #195
Comments
What does the API for this look like? Why is it useful? |
For the former I was thinking either supporting something like how ix.io supports multiple files, or having the URLs encoded in some way that doesn't use a specific delimiter and then splitting on that. Mostly I want support for it for my script that renders HTML emails to plain-text and then shortens URLs longer than what is reasonable for CLI use, because doing them one by one can get be quite slow for larger emails. On the other hand I could potentially rewrite it and have the shortener be run async. |
Absolutely; this makes sense. Because the current API response isn't a collection/list, it seems inconsistent/bad to conditionally return a collection (robust callers would then need to implement the inverse logic). Could this be a different endpoint instead? Something like:
I've also been tossing around the idea of versioning the API, and/or making a completely separate DebtFree™ service. Usage would be something like:
The response might look like: [
{
# paste attributes
},
{
# paste attributes
}
] Does that look like what you expected? I suppose one potential problem might be matching up the response items with the request--ix.io for example is not only inconsistent about re-ordering the response, but also has a very buggy multipart parser: https://ptpb.pw/bQrV (see the trailing truncated multipart boundaries included in the paste responses, and inconsistent re-ordering with
Is this code on github? I wouldn't be surprised if multiple parallel requests ended up being faster anyway, and helping you with this is definitely going to be easier than trying to make pb not suck. |
Hm, that might make sense. Or something like
and then getting something like
Using the id's passed to -F as the keys. This could also potentially be an arbitrary string I guess, hm.
It's essentially a slightly modified version of the tiny.pl script available here, but I'm planning on rewriting it from scratch anyway, just been to lazy to do so as of yet. |
($line =~ /(\w+:\/\/\S+)/) https://regex101.com/r/DwZzBf/1 Is that the behavior you actually want? I feel like limiting replacement to just So it looks like the expected behavior is this:
Doing multiple concurrent replacements means we can't use I think something slightly more memory-optimized than simplest way to handle the inner loop with concurrency might look like:
out_queue = collections.deque()
async def pump_queue(wait=False):
while True:
try:
line = out_queue.popleft()
except IndexError:
break
if isinstance(line, str):
yield line
elif line.done():
yield line.result()
elif wait:
yield await line
else:
out_queue.appendleft(line)
break
for line in file:
if the_line_has_spooky_urls(line):
line = await function_that_fixes_the_line_in_future(line)
out_queue.extend(line)
# write lines that are ready to go now; while maintaining line ordering
async for line in pump_queue():
sys.stdout.write(line)
# write remaining lines; waiting for each line to become available
async for line in pump_queue(wait=True):
sys.stdout.write(line) In the best case (pb is infinitely fast, and the email is infinitely long--or, the email has no URLs), memory utilization should be™ constant. In the probable-case, requests will overlap, and memory utilization will fluctuate depending on the size of the text between the queued URLs. In the worst case, memory utilization will be: |
That's pretty fucking hilarious, because, according to perldoc.
So the child process must exit before the result of that expression can be evaluated, and you can't read from broken pipes, so |
I benchmarked a hacked up version of this; the currently-unit-tested parts of the code are on github; this currently just includes a regex stream-parser context thing, and the To compare performance, I trivially modified tiny.pl to replace URLs in the same way microscopic does: gist, then ran them both over a random spam email from my old maildir. The raw benchmark data: https://gist.github.com/buhman/009179b09d996b55fc1565bd9bac2710; below I discard the outlier 5-second result from Results (lower is better):
Predictions/trends: While tiny.pl execution time is very obviously ~linear to the number of URLs in the input, if I did a super fancy multiple-line graph (do it?) for time/number-of-urls, microscopic's performance, for a sufficiently large number of URLs, would also approach linear, because, contrast to microscopic, pb's concurrency capabilities are non-infinite. This is probably fine for real emails that don't contain >thousands of URLs though. Additional optimizations not represented here:
|
I should have expected you to go slightly over-board with this, shouldn't I, heh.
Pretty much. I've just used the current regex as-is because it's been "good enough" in practice, and been too lazy to actually replace it as long as it mostly works. |
I would describe this more as "inspiration" instead. I present a pre-release version of Usage: pip install microscopic==0.1.1.dev3
curl -L https://ptpb.pw/gSpk > ./example-pipeline.yaml
# copy arbitrary url-tastic HTML to ./message.html
microscopic ./example-pipeline.yaml Currently, the only supported pipeline is the
As suggested in the name, the This pipeline is assembled using arbitrary combinations of components registered via pkg_resources; hopefully it should be obvious how the mapping is supposed to work by reading the example. Let me know what you think. |
That does look rather nice indeed. |
No description provided.
The text was updated successfully, but these errors were encountered: