Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve/merge scripts, avoid RegEx, format when crawling #945

Merged
merged 34 commits into from Apr 8, 2022

Conversation

Hans5958
Copy link
Member

@Hans5958 Hans5958 commented Apr 7, 2022

Inspired by #908*, I made this. I suppose wanted to merge this for a while, but I just separated so you guys can check the changed files better because it seperated, but now here we are.

Also I migrated it to be JSON-based instead of RegEx-based, and would you look at that speedup! Crawler would be also simplified to just use JSON.

And also more additions such as converting and on subreddits, and resolving #707, so that's something.

* This is a different implementation of it.

Fix #707

@netlify
Copy link

netlify bot commented Apr 7, 2022

Deploy Preview for place-atlas ready!

Name Link
🔨 Latest commit ef0f1fd
🔍 Latest deploy log https://app.netlify.com/sites/place-atlas/deploys/624fc0690a7d45000801df50
😎 Deploy Preview https://deploy-preview-945--place-atlas.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@Hans5958 Hans5958 changed the base branch from master to cleanup April 7, 2022 05:07
@Hans5958
Copy link
Member Author

Hans5958 commented Apr 7, 2022

I actually just saw that #908 also gone JSON-based (I think)? Oh, well.

@Hans5958
Copy link
Member Author

Hans5958 commented Apr 7, 2022

Crawler is untested, but in theory it should work. If someone can test this then I would appreciate it.

@Hans5958 Hans5958 force-pushed the regex-is-old-school-af branch 2 times, most recently from 82e87d8 to e72dd87 Compare April 7, 2022 09:03
@ab-gh ab-gh added the tooling label Apr 7, 2022
@Hans5958
Copy link
Member Author

Hans5958 commented Apr 7, 2022

Hmm, what should be done with the user links?

@Hans5958 Hans5958 force-pushed the regex-is-old-school-af branch 3 times, most recently from 2f4bc97 to 143d2d5 Compare April 7, 2022 13:08
@AnonymousRandomPerson
Copy link
Contributor

Is there ambiguity about user links? The main two cases are handled, being Reddit's Markdown format and URLs without protocol.

@Hans5958
Copy link
Member Author

Hans5958 commented Apr 7, 2022

Is there ambiguity about user links? The main two cases are handled, being Reddit's Markdown format and URLs without protocol.

I mean those that they put on the subreddit fields

@AnonymousRandomPerson
Copy link
Contributor

If the Reddit link is just a full URL to the subreddit, it should be simple to detect where the /r/<sub_name> starts and strip the link down to that. If for some reason they put a non-subreddit link, you can move that to website or print out if there's a conflict.

@Hans5958
Copy link
Member Author

Hans5958 commented Apr 7, 2022

If the Reddit link is just a full URL to the subreddit, it should be simple to detect where the /r/<sub_name> starts and strip the link down to that. If for some reason they put a non-subreddit link, you can move that to website or print out if there's a conflict.

I see. I have implement that on CSTW. I want to clarify again that what I mean is this. I thought it can be supported on the Atlas, but who knows?

[
	{"id": "txs8wu", "submitted_by": "_Neroxis", "name": "Neroxis", "description": "The profile picture of the redditor _Neroxis.", "website": "http://neroxis.net", "subreddit": "/u/_neroxis", "center": [1530.5, 163.5], "path": [[1522.5, 159.5], [1527.5, 154.5], [1534.5, 154.5], [1539.5, 160.5], [1539.5, 166.5], [1534.5, 171.5], [1527.5, 171.5], [1522.5, 166.5]]},
]

@Hans5958
Copy link
Member Author

Hans5958 commented Apr 7, 2022

For some reason either my connection is drunk or what, but I have pushed 3 commit yet it didn't appear. Really classic.

dbbeb436
ff11bd8e
e8cd8d04

@AnonymousRandomPerson
Copy link
Contributor

Oh interesting, a user link. I don't think that's supported right now; we could probably handle that manually by moving it to the description.

@Hans5958
Copy link
Member Author

Hans5958 commented Apr 7, 2022

Oh interesting, a user link. I don't think that's supported right now; we could probably handle that manually by moving it to the description.

So, that would be like this?

The profile picture of the redditor _Neroxis.

/u/_Neroxis

I mean, I think it would be better (or worse) to support it directly on the JS, but IDK.

@AnonymousRandomPerson
Copy link
Contributor

In this case,

The profile picture of the redditor /u/_Neroxis.

would work.

I don't think you need to script that, since it's a case-by-case basis what looks the best. Maybe at most print out that the issue is there so someone can manually fix it.

@Hans5958
Copy link
Member Author

Hans5958 commented Apr 7, 2022

Ah, case-by-case basis. Gotcha.

One question, does description parsed as Markdown, or how newlines work? Two newlines is a paragraph split? or single?

@AnonymousRandomPerson
Copy link
Contributor

The description is not parsed as Markdown, so Markdown syntax like links and bold/italics will show up raw and incorrect.

I think two newlines is a paragraph split.

@Hans5958
Copy link
Member Author

Hans5958 commented Apr 7, 2022

The description is not parsed as Markdown, so Markdown syntax like links and bold/italics will show up raw and incorrect.

I think two newlines is a paragraph split.

Ah, okay, so I can just leave it like that.

By the way...

Maybe at most print out that the issue is there so someone can manually fix it.

I just did that. It's also called "validation," pretty weird. Maybe sometime we can merge it with the REAL validation script. Anyways, there has been a lot of errors with it. You can't wait to see those.

@AnonymousRandomPerson
Copy link
Contributor

I think we can keep it separate. The current validation script serves its purpose of making sure the JSON isn't invalid, which is a showstopper for Atlas running. These other validations are things for us to manually fix, good to have but not blocking the operation of the app.

@nico-abram
Copy link
Member

Also, actually, ([^\"n]) won't escape [ and ].

We're trying to unescape them, it seems to work for me?
imagen
Not sure what you mean by that.

But you're right about reddit escaping the backslash. Your current regex should work on submissions not using a code block or the markdown editor.
imagen

Also adapted from Nick

Co-authored-by: Nicolas Abram <abramlujan@gmail.com>
@Hans5958
Copy link
Member Author

Hans5958 commented Apr 8, 2022

*unescape, apologize.

Oh, I misread, apologize again for that.

@nico-abram
Copy link
Member

Could you keep this as either 0, 1 or 2 for now?
imagen

We wanna avoid re-accepting entries that were explicitly removed via edits

@Hans5958
Copy link
Member Author

Hans5958 commented Apr 8, 2022

Oh shoot forgot to port that one. I saw you got 0 so I'll do 0.

@Hans5958 Hans5958 marked this pull request as ready for review April 8, 2022 04:16
@nico-abram
Copy link
Member

Looks good to me, one final question: Does this properly handle entries with multiple subreddits? It looks like the js code splits them by commas: var subreddits = entry.subreddit.split(",");

@Hans5958
Copy link
Member Author

Hans5958 commented Apr 8, 2022

Yes, and actually also trims the spaces, so both /r/sub1, /r/sub2 and /r/sub1,/r/sub2 is valid. The validator also represents it.

https://github.com/placeAtlas/atlas/blob/808c9ba4fe112094e0e622ca4bd36da02660cbdb/web/_js/infoblock.js#L47-L61

Fun fact: The JS script also includes those that have no r/, but I made the validator warns it so future contributors can confirm and change it into a proper format, or delete it.
@Hans5958 Hans5958 force-pushed the regex-is-old-school-af branch 2 times, most recently from 43897eb to 5b4ca08 Compare April 8, 2022 04:54
as in, new entries will be formatted automatically, but full format may be needed when there is a change on the formatter itself
@nico-abram nico-abram merged commit 6f792d9 into placeAtlas:cleanup Apr 8, 2022
@nico-abram nico-abram mentioned this pull request Apr 8, 2022
@Hans5958
Copy link
Member Author

Hans5958 commented Apr 8, 2022

That was a panic in #1072. But, alas! 🎉

I'd like to thank everyone who have helped me on this pull request. May this be a useful addition for everyone who uses the site. :)

Here are the errors of today's full format. If anyone wants to fix it then go ahead.

Formatting ../web/atlas.json...
0 checked.
500 checked.
subreddit of entry twmyah is still invalid! V̵̝̆̊͘ͅͅO̵̒̎͐͝ͅỊ̸̙͎̌̕͝D̸̰͎̀̑̏́͜
subreddit of entry twm4ix is still invalid! /r/Touhou, r/, Hatsune
subreddit of entry twm1va is still invalid! /r/Aphex, twin
website of entry twle0e is still invalid! https://reddit.com/r/theowlhouse https://reddit.com/amphibia https://reddit.com/r/princessesofpower https://reddit.com/ghostandmollymcgee
subreddit of entry twlcgh is still invalid! /r\/scottthewoz
subreddit of entry twl3ve is still invalid! /b/, (it's, not, on, reddit)
1000 checked.
subreddit of entry twrmuj is still invalid! /r/Re:Zero
website of entry twre3d is still invalid! https://www.willhirsch.gay, https://www.squidindustries.co, https://www.nrbknives.com, https://www.glidr.co, https://www.zippybalisong.com, https://youtube.com/c/camaroEE
website of entry twr947 is still invalid! https://www.twitch.tv/elraenn, https://www.youtube.com/channel/UCUpMmEDtYEoZxYYRa_gh5eQ/videos, https://www.reddit.com/r/place/comments/twcbut/were_trying_to_recreate_one_of_the_first_art/
subreddit of entry twqkpr is still invalid! /r/duck_place, /r/stardewvalley,
website of entry twpx41 is still invalid! https://The Legend Of Korra: https://avatar.fandom.com/wiki/The_Legend_of_Korra
1500 checked.
website of entry twsy2c is still invalid! https://Watch CaptainSparklez with me on Twitch! https://www.twitch.tv/captainsparklez?sr=a
subreddit of entry twsweb is still invalid! /r/portugal;, /r/PortugalCaralho
2000 checked.
website of entry twuivz is still invalid! https://www.twitch.tv/moose_taffy https://www.twitch.tv/jayaitch
subreddit of entry twuak2 is still invalid! /r/Adventure, Time
subreddit of entry twu909 is still invalid! /r/Adventure, Time
website of entry twwalf is still invalid! https://Moon's Sub Discord
2500 checked.
subreddit of entry twvoxr is still invalid! /r/Parahumans/, /r/tokipona
subreddit of entry twvfp5 is still invalid! /r/Mass, Effect
subreddit of entry twv7fm is still invalid! /r/HFY/comments/bfrj07/retreat_hell
subreddit of entry twv4r8 is still invalid! /r/Place_the_wave, /r/Senzawa, /r/Lost, /r/Tunisia, /r/Re_zero, /r/Vivy,
subreddit of entry twv3ns is still invalid! /r/PlaceStart, /r/osuplace,
subreddit of entry twxp29 is still invalid! /r/Logan, ce, bg, de, dieu
subreddit of entry twxon1 is still invalid! /r/Seaofthieves/comments/tw1h45/this_is_awesome_guys_good_job
3000 checked.
subreddit of entry twxdpo is still invalid! /r/uofu/, /r/fnatic
website of entry twxahx is still invalid! https://www.mobiusdigitalgames.com/outer-wilds.html https://rainworldgame.com/
website of entry tx3dku is still invalid! https;//www.mlb.com/dodgers
website of entry tx36m4 is still invalid! https://coasterbot.com/ https://seekingthethrill.com/
subreddit of entry tx32d9 is still invalid! Solo, Project
subreddit of entry tx31lz is still invalid! /u/Dinklebean_
3500 checked.
website of entry tx2c8w is still invalid! https://private discord server
subreddit of entry tx1gud is still invalid! [/r/Splatoon](/r/Splatoon
subreddit of entry tx17tu is still invalid! /r/Nepal, /u/Horsefur
subreddit of entry tx0pb0 is still invalid! /u/Horsefur
subreddit of entry tx0ltv is still invalid! /u/Horsefur
4000 checked.
website of entry tx6f9r is still invalid! https://discord.gg/kbDYYPnP https://discord.gg/nYF5NNYU
subreddit of entry tx5zai is still invalid! https://dm./r/mexico
subreddit of entry tx5jlk is still invalid! https://preview.redd.it/zn615bk03mr81.png?width=165&format=png&auto=webp&s=0be4a49aa04bd311f2bd0a1a2b649eac75fd9f01
subreddit of entry tx59s0 is still invalid! https://osu.ppy.sh/wiki/en/Play_style
website of entry tx54tz is still invalid! https://nopixel.fandom.com/wiki/Leanbois and https://nopixel.fandom.com/wiki/Cleanbois
4500 checked.
website of entry tx4isr is still invalid! https://www.nijisanji.jp/en https://en.hololive.tv/
subreddit of entry tx42p4 is still invalid! not, affiliated, with, /r/azudaioh, made, this, on, my, own
subreddit of entry tx3zts is still invalid! https://www.reddit.com/use/r/Onutrem
subreddit of entry tx3x3x is still invalid! /u/Manipendeh, /u/altermace
subreddit of entry tx9yzg is still invalid! BeamNG
subreddit of entry tx8o6w is still invalid! r\/Vinesauce
subreddit of entry tx8jrj is still invalid! /r/Grateful, Dead
5000 checked.
subreddit of entry tx8akp is still invalid! /r/Québec
subreddit of entry tx7ntj is still invalid! /r/Québec
website of entry txbtjt is still invalid! https:///r/bolivia
website of entry txbdc8 is still invalid! https://www.google.com/search?q=mr incredible meme&sxsrf=APq-WBuOv-7aonp24Ts-k4avZ55ONDusZg%3A1649210934822&ei=NvZMYqDnMeKJxc8PldSH2Ao&ved=0ahUKEwig2re_rf72AhXiRPEDHRXqAasQ4dUDCA4&uact=5&oq=mr incredible meme&gs_lcp=Cgdnd3Mtd2l6EAMyBAgAEEMyBQgAEMsBMgUIABDLATIFCAAQywEyBQgAEMsBMgUIABDLATIFCAAQywEyBQgAEMsBMgUIABDLATIFCAAQywE6BwgAEEcQsAM6BwgAELADEEM6BwguELADEEM6CggAEOQCELADGAE6DAguEMgDELADEEMYAjoPCC4Q1AIQyAMQsAMQQxgCOgUILhDLAUoECEEYAEoECEYYAVDwAVj-BWDbBmgBcAF4AIABUIgBjQOSAQE1mAEAoAEByAETwAEB2gEGCAEQARgJ2gEGCAIQARgI&sclient=gws-wiz
website of entry txbcq3 is still invalid! https:///r/bolivia
subreddit of entry txba3b is still invalid! /quebec
website of entry txb8m7 is still invalid! https:///r/bolivia
subreddit of entry txb271 is still invalid! /quebec
subreddit of entry txb083 is still invalid! whitemesa
5500 checked.
subreddit of entry txgyc2 is still invalid! , /r/funnymansgg
subreddit of entry txfqgt is still invalid! [/r/Gamecocks
subreddit of entry txfdp1 is still invalid! [/r/Vulfpeck
subreddit of entry txfalv is still invalid! [/r/RobloxCountlessWorlds
subreddit of entry txf0cz is still invalid! [/r/Moomins
subreddit of entry txe6s5 is still invalid! /r/México
6000 checked.
subreddit of entry txdzf9 is still invalid! [/r/deqiuv
subreddit of entry txdndu is still invalid! /r/México
subreddit of entry txd8wt is still invalid! [/r/place_CentralAlliance](/r/place_CentralAlliance
subreddit of entry txhzen is still invalid! [/r/le_gamer_club
subreddit of entry txkfqa is still invalid! /r/Switzerland/,
website of entry txjen7 is still invalid! https://Former link: [arca.live/b/minbokworld](https://arca.live/b/minbokworld) / New link: [https://cafe.naver.com/minbok2](https://cafe.naver.com/minbok2)
website of entry txiv4x is still invalid! https://the r/Israel Discord channel
website of entry txisb0 is still invalid! https://the r/Israel Discord server
subreddit of entry txnwh1 is still invalid! [/r/Totless
6500 checked.
subreddit of entry txlslu is still invalid! ,
subreddit of entry txlgtl is still invalid! [/r/epita
subreddit of entry txs8wu is still invalid! /u/_neroxis
subreddit of entry txrcns is still invalid! [/r/GirlsLastTour
subreddit of entry txr62x is still invalid! /r/miamidolphins,
website of entry txq9w0 is still invalid! https://twitter.com/Juanma_M_05 in twitter and https://www.instagram.com/diegor0n/?utm_medium=copy_link in instagram
subreddit of entry txq49k is still invalid! [/r/czech](/r/czech)
subreddit of entry txp76h is still invalid! we, dont, have, an, official, subreddit
subreddit of entry txp6kt is still invalid! JLMafiaa
subreddit of entry txun4q is still invalid! [/r/Argentina](/r/Argentina
subreddit of entry txw0yd is still invalid! [/r/CaptainPuffy
subreddit of entry ty0qe3 is still invalid! /u/blackdragon6547
7000 checked.
subreddit of entry ty30rl is still invalid! /r/Minecraft,
subreddit of entry ty2yac is still invalid! /u/TheJosiahTurner
subreddit of entry tycehc is still invalid! /r/loserfruitofficial,
subreddit of entry ty76wz is still invalid! [/r/dreamsmp](/r/dreamsmp
7500 checked.
subreddit of entry tyinbg is still invalid! several
subreddit of entry typqdc is still invalid! channel/UCDeQ02bwQsKCOly9FzvdWHA
website of entry typq3n is still invalid! https://www.polymtl.ca/ & https://www.umontreal.ca/
7624 checked.
Writing completed. All done.
Formatting ../web/atlas-before-ids-migration.json...
0 checked.
subreddit of entry 1 is still invalid! ccKufiPrFaShleWoli0
500 checked.
website of entry twn9c1 is still invalid! https://Indian Space Research Organisation
subreddit of entry twn05d is still invalid! /r/PERU, with, the, help, of, other, random, redditords
subreddit of entry twmyah is still invalid! V̵̝̆̊͘ͅͅO̵̒̎͐͝ͅỊ̸̙͎̌̕͝D̸̰͎̀̑̏́͜
website of entry twmqtd is still invalid! https:/omori-game.com
subreddit of entry twm4ix is still invalid! /r/Touhou, r/, Hatsune
1000 checked.
subreddit of entry twm2l0 is still invalid! [/r/Senzawa
subreddit of entry twm1va is still invalid! /r/Aphex, twin
website of entry twlfjl is still invalid! https://www.twitch.tv/vinesauce https://www.twitch.tv/jerma985
website of entry twle0e is still invalid! https://[reddit.com/r/theowlhouse](https://reddit.com/r/theowlhouse) [reddit.com/amphibia](https://reddit.com/amphibia) [reddit.com/r/princessesofpower](https://reddit.com/r/princessesofpower) [reddit.com/ghostandmollymcgee](https://reddit.com/ghostandmollymcgee)
subreddit of entry twl3ve is still invalid! /b/, (it's, not, on, reddit)
1247 checked.
Writing completed. All done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants