use index dicts to make trope list to compare to list from Main directory #9

jwzimmer-zz · 2020-10-22T02:39:11Z

we could make a set of all the trope titles from all the indices pages (using the dicts) and check that against the list of tropes from the main folder on gh? the idea being that when i downloaded the site if i missed something from one of those i got it in the other?

@nguyenhphilip has a list from the main directory we can compare to

there seem to be some just a few tropes and indices that are listed on https://tvtropes.org/pmwiki/index_report.php but are not in https://github.com/jwzimmer/tv-tropes/tree/main/tvtropes.org/pmwiki/pmwiki.php/Main (phil has another list for those few pages too)... that could be because the page is just not a complete record of the contents of the folder and vice versa, or it could be because i missed out some of the pages when i was downloading the site. so we should do our best to verify that it isn't the latter (because obviously we don't want to miss out any tropes in our analysis).

what about: https://github.com/jwzimmer/tv-tropes/blob/main/tvtropes.org/index.html

jwzimmer-zz · 2020-10-22T02:43:56Z

also: make a list from https://github.com/jwzimmer/tv-tropes/blob/main/tvtropes.org/index.html

jwzimmer-zz · 2020-10-22T14:42:50Z

(overarching goal: satisfy ourselves everything we care about from the website is now on GH, re #1)

nguyenhphilip · 2020-10-23T03:15:55Z

after filtering out any files with 'Index' or 'Trope' in their filename, there are still 4936 files to work with, although it's very likely that some of the files in here still aren't only individual trope pages. I've stored the names of the pages with 'Index' or 'Trope' in them as separate JSON files in case we need to use them. It's unclear to me how many files/data objects we need to do a network analysis, but it seems like even with ~4000 individual trope pages we can probably find a large number of interesting connections between them.

As for parsing the HTML itself, it looks like the main content of individual tropes is structured within a <div> with attribute id = main_article'. Links within main_article are nested in paragraphs <p> and seem to link only to other individual tropes, indexes, or trope category pages.

How to filter non-individual trope pages out?

nguyenhphilip · 2020-10-24T17:34:40Z

I made a list of all the items inside any file starting with 'txt_dict_from_', as these are the indices we want to check the contents of Main against. These are saved as json files as per the last commit.

If we remove the files in Main that are not in this 'txt_dict_from_' list, we are left with 3689 items. These items look like tropes though, so we might actually not want to remove them at all, or at least we'll have to go through with our eyes and see what else needs to be filtered. Seems like this will be easier once we determine more clearly what we're looking for, which may depend on the structure of the individual tropes themselves and how they link to other tropes. This might be easy though once we have a list of individual tropes, from Main or some combination of the various files we haven't checked yet (though I think Main is our best bet) since we can just filter links out that don't link to any other item in our individual trope list.

nguyenhphilip · 2020-10-24T18:16:42Z

The 'txt_dict_from_' items that are not in Main ALSO look like tropes, so these are definitely files not in Main that we may be interested in. That makes me wonder though... where are they if not in Main?

nguyenhphilip · 2020-10-24T18:17:36Z

OH ! So it looks like the folder pmwiki.php, which holds Main, holds other folders with other tropes in it as well... this makes the search a bit harder though since there appears to be random stuff in here as well, not just individual tropes. Which is to say, with this giant list, we need a way to QC. Might be more feasible to focus on using Main since the things that make it into Main are probably the primary files, which are presumably better maintained/more active pages?

Will need to look into the actual contents of the other folders inside pmwiki.php. Still betting on Main as our main data source.

jwzimmer-zz · 2020-10-24T18:50:11Z

@nguyenhphilip this is great, thanks! I think that plan totally makes sense. Let's look (manually) at what kind of trope pages are in the folders besides Main in pmwiki.php and verify that they'll be reasonable to exclude. I think if we have some way to categorize those pages it will almost certainly be okay to look just at what's in Main, but let's have some handle on what we're excluding, you know? I'll do that now.

jwzimmer-zz · 2020-10-24T19:45:12Z

I think we might just want to use the tropes listed in https://tvtropes.org/pmwiki/pagelist_having_pagetype_in_namespace.php?n=Main&t=trope.

This is their manually categorized list of what counts as a trope, which might be both a justifiable and concise way to decide which articles to consider and which to exclude.

nguyenhphilip · 2020-10-25T03:43:03Z

Using above trope list, we were able to extract 27485 files for individual tropes !! This looks like all of them. Super exciting because we can investigate lots of research Qs! Step 1 extract actual contents of trope pages and explore?

jwzimmer-zz · 2020-10-25T18:29:03Z

Resolved by @nguyenhphilip in 61388ba! : )

jwzimmer-zz · 2020-10-28T16:40:56Z

Per discussion with Phil, reopening briefly for one last hurrah of trying to decide what counts as a trope -

we need to decide between: using strictly what they've labeled as tropes as the master list OR using that PLUS some tropes we identify as tropes that are in Main but not the masterlist.
But the issue with that is it pulls in many non-trope pages which we'll have to filter out

if there are not too many "obvious tropes" like Always Male, we should just append them to the master list
But if there are a lot of files, and it isn't obvious what is a trope, a trope of tropes, an index, etc., then we should stick with. their master list strictly as the definition of "trope"

jwzimmer-zz · 2020-10-28T17:48:29Z

a180ca8
results:
Pages in Main but not in masterlist 2124
Pages in masterlist but not in Main 23842

(am i only looking at the first part of Main or something? or are there really that many things in the masterlist compared to main? well, either way, 2000 is a lot - so let's go with masterlist as definition of "trope", although indexes and metatropes are still useful, but...)

for the sake of having a stable definition let's go with masterlist - Phil agrees

jwzimmer-zz self-assigned this Oct 22, 2020

jwzimmer-zz assigned nguyenhphilip Oct 22, 2020

jwzimmer-zz mentioned this issue Oct 24, 2020

what kinds of things are in the non-main folders in pmwiki? #10

Closed

jwzimmer-zz closed this as completed Oct 25, 2020

jwzimmer-zz reopened this Oct 28, 2020

jwzimmer-zz closed this as completed Oct 28, 2020

jwzimmer-zz mentioned this issue Jan 22, 2021

ethics/ transparency audit jwzimmer-zz/tv-tropening#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use index dicts to make trope list to compare to list from Main directory #9

use index dicts to make trope list to compare to list from Main directory #9

jwzimmer-zz commented Oct 22, 2020

jwzimmer-zz commented Oct 22, 2020

jwzimmer-zz commented Oct 22, 2020

nguyenhphilip commented Oct 23, 2020 •

edited

Loading

nguyenhphilip commented Oct 24, 2020

nguyenhphilip commented Oct 24, 2020

nguyenhphilip commented Oct 24, 2020 •

edited

Loading

jwzimmer-zz commented Oct 24, 2020

jwzimmer-zz commented Oct 24, 2020 •

edited

Loading

nguyenhphilip commented Oct 25, 2020

jwzimmer-zz commented Oct 25, 2020

jwzimmer-zz commented Oct 28, 2020

jwzimmer-zz commented Oct 28, 2020

use index dicts to make trope list to compare to list from Main directory #9

use index dicts to make trope list to compare to list from Main directory #9

Comments

jwzimmer-zz commented Oct 22, 2020

jwzimmer-zz commented Oct 22, 2020

jwzimmer-zz commented Oct 22, 2020

nguyenhphilip commented Oct 23, 2020 • edited Loading

nguyenhphilip commented Oct 24, 2020

nguyenhphilip commented Oct 24, 2020

nguyenhphilip commented Oct 24, 2020 • edited Loading

jwzimmer-zz commented Oct 24, 2020

jwzimmer-zz commented Oct 24, 2020 • edited Loading

nguyenhphilip commented Oct 25, 2020

jwzimmer-zz commented Oct 25, 2020

jwzimmer-zz commented Oct 28, 2020

jwzimmer-zz commented Oct 28, 2020

nguyenhphilip commented Oct 23, 2020 •

edited

Loading

nguyenhphilip commented Oct 24, 2020 •

edited

Loading

jwzimmer-zz commented Oct 24, 2020 •

edited

Loading