-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use index dicts to make trope list to compare to list from Main directory #9
Comments
also: make a list from https://github.com/jwzimmer/tv-tropes/blob/main/tvtropes.org/index.html |
(overarching goal: satisfy ourselves everything we care about from the website is now on GH, re #1) |
after filtering out any files with 'Index' or 'Trope' in their filename, there are still 4936 files to work with, although it's very likely that some of the files in here still aren't only individual trope pages. I've stored the names of the pages with 'Index' or 'Trope' in them as separate JSON files in case we need to use them. It's unclear to me how many files/data objects we need to do a network analysis, but it seems like even with ~4000 individual trope pages we can probably find a large number of interesting connections between them. As for parsing the HTML itself, it looks like the main content of individual tropes is structured within a How to filter non-individual trope pages out? |
I made a list of all the items inside any file starting with 'txt_dict_from_', as these are the indices we want to check the contents of Main against. These are saved as json files as per the last commit. If we remove the files in Main that are not in this 'txt_dict_from_' list, we are left with 3689 items. These items look like tropes though, so we might actually not want to remove them at all, or at least we'll have to go through with our eyes and see what else needs to be filtered. Seems like this will be easier once we determine more clearly what we're looking for, which may depend on the structure of the individual tropes themselves and how they link to other tropes. This might be easy though once we have a list of individual tropes, from Main or some combination of the various files we haven't checked yet (though I think Main is our best bet) since we can just filter links out that don't link to any other item in our individual trope list. |
The 'txt_dict_from_' items that are not in Main ALSO look like tropes, so these are definitely files not in Main that we may be interested in. That makes me wonder though... where are they if not in Main? |
OH ! So it looks like the folder Will need to look into the actual contents of the other folders inside |
@nguyenhphilip this is great, thanks! I think that plan totally makes sense. Let's look (manually) at what kind of trope pages are in the folders besides Main in pmwiki.php and verify that they'll be reasonable to exclude. I think if we have some way to categorize those pages it will almost certainly be okay to look just at what's in Main, but let's have some handle on what we're excluding, you know? I'll do that now. |
I think we might just want to use the tropes listed in https://tvtropes.org/pmwiki/pagelist_having_pagetype_in_namespace.php?n=Main&t=trope. This is their manually categorized list of what counts as a trope, which might be both a justifiable and concise way to decide which articles to consider and which to exclude. |
Using above trope list, we were able to extract 27485 files for individual tropes !! This looks like all of them. Super exciting because we can investigate lots of research Qs! Step 1 extract actual contents of trope pages and explore? |
Resolved by @nguyenhphilip in 61388ba! : ) |
Per discussion with Phil, reopening briefly for one last hurrah of trying to decide what counts as a trope - we need to decide between: using strictly what they've labeled as tropes as the master list OR using that PLUS some tropes we identify as tropes that are in Main but not the masterlist. if there are not too many "obvious tropes" like Always Male, we should just append them to the master list |
a180ca8 (am i only looking at the first part of Main or something? or are there really that many things in the masterlist compared to main? well, either way, 2000 is a lot - so let's go with masterlist as definition of "trope", although indexes and metatropes are still useful, but...) for the sake of having a stable definition let's go with masterlist - Phil agrees |
we could make a set of all the trope titles from all the indices pages (using the dicts) and check that against the list of tropes from the main folder on gh? the idea being that when i downloaded the site if i missed something from one of those i got it in the other?
@nguyenhphilip has a list from the main directory we can compare to
there seem to be some just a few tropes and indices that are listed on https://tvtropes.org/pmwiki/index_report.php but are not in https://github.com/jwzimmer/tv-tropes/tree/main/tvtropes.org/pmwiki/pmwiki.php/Main (phil has another list for those few pages too)... that could be because the page is just not a complete record of the contents of the folder and vice versa, or it could be because i missed out some of the pages when i was downloading the site. so we should do our best to verify that it isn't the latter (because obviously we don't want to miss out any tropes in our analysis).
what about: https://github.com/jwzimmer/tv-tropes/blob/main/tvtropes.org/index.html
The text was updated successfully, but these errors were encountered: