Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use index dicts to make trope list to compare to list from Main directory #9

Closed
jwzimmer-zz opened this issue Oct 22, 2020 · 12 comments
Assignees

Comments

@jwzimmer-zz
Copy link
Owner

we could make a set of all the trope titles from all the indices pages (using the dicts) and check that against the list of tropes from the main folder on gh? the idea being that when i downloaded the site if i missed something from one of those i got it in the other?

@nguyenhphilip has a list from the main directory we can compare to

there seem to be some just a few tropes and indices that are listed on https://tvtropes.org/pmwiki/index_report.php but are not in https://github.com/jwzimmer/tv-tropes/tree/main/tvtropes.org/pmwiki/pmwiki.php/Main (phil has another list for those few pages too)... that could be because the page is just not a complete record of the contents of the folder and vice versa, or it could be because i missed out some of the pages when i was downloading the site. so we should do our best to verify that it isn't the latter (because obviously we don't want to miss out any tropes in our analysis).

what about: https://github.com/jwzimmer/tv-tropes/blob/main/tvtropes.org/index.html

@jwzimmer-zz jwzimmer-zz self-assigned this Oct 22, 2020
@jwzimmer-zz
Copy link
Owner Author

@jwzimmer-zz
Copy link
Owner Author

(overarching goal: satisfy ourselves everything we care about from the website is now on GH, re #1)

@nguyenhphilip
Copy link
Collaborator

nguyenhphilip commented Oct 23, 2020

after filtering out any files with 'Index' or 'Trope' in their filename, there are still 4936 files to work with, although it's very likely that some of the files in here still aren't only individual trope pages. I've stored the names of the pages with 'Index' or 'Trope' in them as separate JSON files in case we need to use them. It's unclear to me how many files/data objects we need to do a network analysis, but it seems like even with ~4000 individual trope pages we can probably find a large number of interesting connections between them.

As for parsing the HTML itself, it looks like the main content of individual tropes is structured within a <div> with attribute id = main_article'. Links within main_article are nested in paragraphs <p> and seem to link only to other individual tropes, indexes, or trope category pages.

How to filter non-individual trope pages out?

@nguyenhphilip
Copy link
Collaborator

I made a list of all the items inside any file starting with 'txt_dict_from_', as these are the indices we want to check the contents of Main against. These are saved as json files as per the last commit.

If we remove the files in Main that are not in this 'txt_dict_from_' list, we are left with 3689 items. These items look like tropes though, so we might actually not want to remove them at all, or at least we'll have to go through with our eyes and see what else needs to be filtered. Seems like this will be easier once we determine more clearly what we're looking for, which may depend on the structure of the individual tropes themselves and how they link to other tropes. This might be easy though once we have a list of individual tropes, from Main or some combination of the various files we haven't checked yet (though I think Main is our best bet) since we can just filter links out that don't link to any other item in our individual trope list.

@nguyenhphilip
Copy link
Collaborator

The 'txt_dict_from_' items that are not in Main ALSO look like tropes, so these are definitely files not in Main that we may be interested in. That makes me wonder though... where are they if not in Main?

@nguyenhphilip
Copy link
Collaborator

nguyenhphilip commented Oct 24, 2020

OH ! So it looks like the folder pmwiki.php, which holds Main, holds other folders with other tropes in it as well... this makes the search a bit harder though since there appears to be random stuff in here as well, not just individual tropes. Which is to say, with this giant list, we need a way to QC. Might be more feasible to focus on using Main since the things that make it into Main are probably the primary files, which are presumably better maintained/more active pages?

Will need to look into the actual contents of the other folders inside pmwiki.php. Still betting on Main as our main data source.

@jwzimmer-zz
Copy link
Owner Author

@nguyenhphilip this is great, thanks! I think that plan totally makes sense. Let's look (manually) at what kind of trope pages are in the folders besides Main in pmwiki.php and verify that they'll be reasonable to exclude. I think if we have some way to categorize those pages it will almost certainly be okay to look just at what's in Main, but let's have some handle on what we're excluding, you know? I'll do that now.

@jwzimmer-zz
Copy link
Owner Author

jwzimmer-zz commented Oct 24, 2020

I think we might just want to use the tropes listed in https://tvtropes.org/pmwiki/pagelist_having_pagetype_in_namespace.php?n=Main&t=trope.

This is their manually categorized list of what counts as a trope, which might be both a justifiable and concise way to decide which articles to consider and which to exclude.

@nguyenhphilip
Copy link
Collaborator

Using above trope list, we were able to extract 27485 files for individual tropes !! This looks like all of them. Super exciting because we can investigate lots of research Qs! Step 1 extract actual contents of trope pages and explore?

@jwzimmer-zz
Copy link
Owner Author

Resolved by @nguyenhphilip in 61388ba! : )

@jwzimmer-zz
Copy link
Owner Author

Per discussion with Phil, reopening briefly for one last hurrah of trying to decide what counts as a trope -

we need to decide between: using strictly what they've labeled as tropes as the master list OR using that PLUS some tropes we identify as tropes that are in Main but not the masterlist.
But the issue with that is it pulls in many non-trope pages which we'll have to filter out

if there are not too many "obvious tropes" like Always Male, we should just append them to the master list
But if there are a lot of files, and it isn't obvious what is a trope, a trope of tropes, an index, etc., then we should stick with. their master list strictly as the definition of "trope"

@jwzimmer-zz jwzimmer-zz reopened this Oct 28, 2020
@jwzimmer-zz
Copy link
Owner Author

a180ca8
results:
Pages in Main but not in masterlist 2124
Pages in masterlist but not in Main 23842

(am i only looking at the first part of Main or something? or are there really that many things in the masterlist compared to main? well, either way, 2000 is a lot - so let's go with masterlist as definition of "trope", although indexes and metatropes are still useful, but...)

for the sake of having a stable definition let's go with masterlist - Phil agrees

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants