Skip to content
This repository has been archived by the owner on Jan 31, 2021. It is now read-only.

Suggestion: Comment limiting + Don't discard on fail. #35

Open
2 tasks
HT-7 opened this issue Nov 14, 2019 · 12 comments
Open
2 tasks

Suggestion: Comment limiting + Don't discard on fail. #35

HT-7 opened this issue Nov 14, 2019 · 12 comments

Comments

@HT-7
Copy link

HT-7 commented Nov 14, 2019

Because videos with more than 20000 comments tend to overwhelm the online comment scraper, leading to a fail, I have two suggestions:

  • Ability to optionally limit number of captured comments to _____.
  • If the comment capture is unsuccessful, then let the user download the already scraped comments instead of discarding the comments.

Don't forget: The csv and json files should contain the information of how many comments the video has in total.

This will allow the user to see the total amount of comments on a video from a comment file which does not contain every comment (due to manual limiting or failed capture).

More metadata:

  • Has the comment been ever edited?
  • Comment rank (top comment is on rank 1). Same for commment replies.
@spiralofhope
Copy link

spiralofhope commented Nov 15, 2019

Perhaps dump the comments every x comments by default, and have a parameter to customize that.

For example, generate:

comment-foo-1.csv
comment-foo-2.csv
...

This could also trivially be tied to a simple pause mechanic between dumps to throttle scraper use.


Regarding more metadata, that would be best mentioned in its own ticket, but it occurs to me that it would be nigh incredible if it were made easy to scrape multiple times and essentially generate an "edit history". I suppose that's best left for another tool though.

@ftomasin
Copy link

Is there a possibility to limit the amount of captured comments yet?

Perhaps dump the comments every x comments by default, and have a parameter to customize that.
This could also trivially be tied to a simple pause mechanic between dumps to throttle scraper use.

Is there any chance to get a follow up on this?

@spiralofhope
Copy link

I've downloaded large comment sets without seeing this problem myself. Perhaps it's me (my net connection?) or YouTube has improved/changed in some way.

I can help with testing if anyone can give an example or two (even with inconsistent failures).

@ftomasin
Copy link

I was trying to get the comments for this video https://www.youtube.com/watch?v=koPmuEyP3a0 in order to extract information for a research project.

After realizing in multiple occasions that I was not able to run the scraper from beginning to end I was trying to gather as many comments as possible by using --stream, but the results are very unstructured.

The scraper crashes with unknown error.
I'm thinking that the number of comments is just too much. That's why I was searching for a solution to limit the comments scraped in order to get results I can interpret.

Thank you for your help!

@spiralofhope
Copy link

I have it running on that URL and so far it hasn't crashed.

I wonder if there is a limitation with the brute-force nature of the youtube-comment-scraper code running with node, like some sort of memory use limit. I don't think so, because my memory usage for that process seems fairly stable; I can see that it's trying though.

I'll let the process continue, and report back later.

For reference, my command is:

node /usr/local/bin/youtube-comment-scraper --format csv --outputFile foo.csv -- koPmuEyP3a0

Maybe my choice to use csv matters; I'm not sure.

@spiralofhope
Copy link

Wouldn't you know it, my hunch was right. It did end up crashing, and there is a mention of memory garbage collection.

I don't know that this will help the youtube-comment-scraper author directly, but it does give me a hint as to how I might fumble around on my own. Maybe node has memory options.

crash output
<--- Last few GCs --->

[29965:0xf2b9f0]  3958876 ms: Mark-sweep 1272.2 (1455.5) -> 1272.3 (1426.5) MB, 427.2 / 0.0 ms  (average mu = 0.180, current mu = 0.003) last resort GC in old space requested
[29965:0xf2b9f0]  3959309 ms: Mark-sweep 1272.3 (1426.5) -> 1272.1 (1425.5) MB, 432.8 / 0.0 ms  (average mu = 0.100, current mu = 0.000) last resort GC in old space requested


<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x376cb3eaee11 <JSObject>
    0: builtin exit frame: parse(this=0x376cb3ebc8a9 <Object map = 0x35f42ce0cac9>,0x2cb24a4025d9 <undefined>,0x10de1cb02201 <Very long string[161650]>,0x376cb3ebc8a9 <Object map = 0x35f42ce0cac9>)

    1: /* anonymous */ [0x324445377089] [/usr/local/lib/node_modules/youtube-comment-scraper-cli/node_modules/request/request.js:1152] [bytecode=0x1ac57ffeb19 offset=396](this=0x3fe0e05657e1 <Request...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
 1: 0x7f1e762bd2a8 node::Abort() [/lib/x86_64-linux-gnu/libnode.so.64]
 2: 0x7f1e762bd2f1  [/lib/x86_64-linux-gnu/libnode.so.64]
 3: 0x7f1e7649def2 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/lib/x86_64-linux-gnu/libnode.so.64]
 4: 0x7f1e7649e148 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/lib/x86_64-linux-gnu/libnode.so.64]
 5: 0x7f1e7682cdc2  [/lib/x86_64-linux-gnu/libnode.so.64]
 6: 0x7f1e76840967 v8::internal::Heap::AllocateRawWithRetryOrFail(int, v8::internal::AllocationSpace, v8::internal::AllocationAlignment) [/lib/x86_64-linux-gnu/libnode.so.64]
 7: 0x7f1e7680eed9 v8::internal::Factory::AllocateRawWithImmortalMap(int, v8::internal::PretenureFlag, v8::internal::Map*, v8::internal::AllocationAlignment) [/lib/x86_64-linux-gnu/libnode.so.64]
 8: 0x7f1e768162d1 v8::internal::Factory::NewRawTwoByteString(int, v8::internal::PretenureFlag) [/lib/x86_64-linux-gnu/libnode.so.64]
 9: 0x7f1e768e99b6 v8::internal::Handle<v8::internal::String> v8::internal::JsonParser<false>::SlowScanJsonString<v8::internal::SeqTwoByteString, unsigned short>(v8::internal::Handle<v8::internal::String>, int, int) [/lib/x86_64-linux-gnu/libnode.so.64]
10: 0x7f1e768ec038 v8::internal::JsonParser<false>::ParseJsonValue() [/lib/x86_64-linux-gnu/libnode.so.64]
11: 0x7f1e768eb1c5 v8::internal::JsonParser<false>::ParseJsonObject() [/lib/x86_64-linux-gnu/libnode.so.64]
12: 0x7f1e768ecbdd v8::internal::JsonParser<false>::ParseJson() [/lib/x86_64-linux-gnu/libnode.so.64]
13: 0x7f1e76566a70  [/lib/x86_64-linux-gnu/libnode.so.64]
14: 0x19b0d62d464b 
Aborted

@spiralofhope
Copy link

I experimented with some parameters, to no avail:

node --max-old-space-size=10000 /usr/local/bin/youtube-comment-scraper --format csv --outputFile output.csv -- koPmuEyP3a0      
✕ unknown error
node --experimental-modules --experimental-repl-await --experimental-vm-modules --experimental-worker  --max-old-space-size=10000 /usr/local/bin/youtube-comment-scraper --format csv --outputFile output.csv -- koPmuEyP3a0
(node:20921) ExperimentalWarning: The ESM module loader is experimental.
✕ unknown error

@ftomasin
Copy link

Thank you for your effort!

I was getting similar results while increasing the max-old-space-size.

@spiralofhope
Copy link

I checked the node.js issues tracker for "mark-sweep" and there might be some related items in there; I don't know.

I do want to see this issue worked-around, but at this point I'm in way over my head. Hopefully @philbot9 will understand things better.

@philbot9
Copy link
Owner

Using the --stream flag is your best option. That way the program won't be storing everything in memory.

If there are further issues regarding the youtube-comment-scraper-cli tool, please post them over on that project's repo.

@spiralofhope
Copy link

spiralofhope commented May 27, 2020

I have created Investigate ''unknown error'' on koPmuEyP3a0 for continued efforts using --stream.

@spiralofhope
Copy link

@ftomasin I was able to use another download tool for your video of interest.

https://github.com/egbertbouman/youtube-comment-downloader

./downloader.py --youtubeid=koPmuEyP3a0 --output=koPmuEyP3a0.json

https://mega.nz/file/PVQVmSwD#sjIg_cPIBBZHeb6_FOOCVyOrJGvncm5B5fQql5kyfz4

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants