A World Map of Kinkiness. Classifying Adult Entertainment on the Web

hannesmuehleisen edited this page Sep 22, 2014 · 35 revisions

Hannes Mühleisen (hannes@cwi.nl), Database Architectures Group, Centrum Wiskunde & Informatica, Amsterdam

Introduction

The World Wide Web has (like Sokrates) long been accused of corrupting the youth. These accusations even go so far to claim that the success of the Internet as a whole should be due to easy access to adult entertainment material. With regards to the fraction of Web pages offering this materials, estimates vary widely, alarmists go up to 80%, while more serious research shows something around 4%. Keep in mind that the fraction of pages has no relation with the traffic these pages draw, but without being a large three-letter-agency, traffic numbers are hard to obtain. Still, the cited 4% come from an analysis of the top one million most popular pages. The real answer might still lay in the uncharted depths of the Web, a place we have set out to classify and chart.

Given the vast differences in dealing with adult content around the world, another question regards the sources of this kind of content. How large is the fraction of a nation's total online presence against the adult content published there?

The research question of this work is: How great is the percentage of adult entertainment pages on the public WWW, and where do these pages come from.

Methodology

The availability of datasets such as the Common Crawl allows us to use a sigificant amount of Web data to answer our research question. This is fortunate, as it would be very difficult to create a representative sample of Web content. We used the 2014 Common Crawl dataset, which contains 2.8 billion URLs. While the total size of the Web has been found to exceed 1 trillion URLs (in 2008!), the indexed amout of URLs is estimated to be in the tens of billions. Given these figures, we regard the sample set to be adequate.

While it might be enticing for some people, manually classifying 2.8 billion URLs is out of the question. Parts of the software industry have long tried to sell software solutions that would reliably filter out questionable content for child/employee protection. Essentially, we need to do the same, first classify all URLs into two groups, "clean" or "dirty". Second, we need to detect where adult content comes from comes from by associating each URL with a country.

Clean/Dirty Classification

The “Restricted to Adults” (RTA) initiative has proposed a magic tag (RTA-5042-1996-1400-1577-RTA) to be put into web pages that contain adult content. We can therefore take the presence of this tag as an indicator.

After the massive scandal regarding underage adult film actress Traci Lords, the US enacted the the Child Protection and Obscenity Enforcement_Act forces commercial adult entertainment providers to add a notice to their website, for example

The records required by Section 2257 of Title 18 of the United States Code with respect to visual depictions of actual sexually explicit conduct are kept by the custodian of records [...]".

Therefore, the presence of the character string 2257 on a page is also a strong indicator that we are dealing with adult entertainment.

We also make use of a public black list of domains known to contain this material. This list contains 527,846 domains, which we expand to include also the pay-level-domain (google.com is the pay-level domain of www.google.com). We match this list to any domain that ends with any entry.

URL - Country Association

In order to display the results on a map, we also analyzed the country a specific Web page is targeting. However, finding this information reliably proved difficult. We implemented a “voting” system, where several pieces of information were combined to give a more or less robust result. Three features were included here, a page would only be assigned to a country if at least two votes concur.

Domain names might end with a country-code top-level domain (ccTLD), for example .de. In this case, there is an indication that the page might be aimed at German people.

We used a IP geolocation database to determine the country in which the Web server is likely hosted from the IP address from which a page was retrieved.

A language detection library together with a list of spoken languages per country was used to generate an additional vote.

Finally, we could have used domain registration data, since they contain information on the country of the registering entity. However bulk access to this information could not be obtained at reasonable cost for the approx. 270M domains that are currently registered.

Results

We implemented the analysis above as a Apache Pig script enhanced with several User-Defined Functions (UDFs). The script and the UDFs may be found in this repository. The analysis was performed in a single pass over the data. For each URL, we extracted the plain text, looked for the features described above, guessed the country, and aggregated all this to counts for each country.

In total, 2,302,456,253 URLs (2.3B) were analyzed. Of those, 2,220,552,547 (2.2B) were deemed 'clean' and 81,903,706 (81M) were not. This gives us an overall "smut ratio" of 3.56%. Regarding the mapping of URLs to countries, our voting process outlined above was able to determine a country for 83% of the URLs (~2.1B).

The following table shows the top ten countries by ratio:

Country Clean URLs Dirty URLs Ratio
1 Brazil 4,592,603 473,648 9.35%
2 Russia 1,750,870 87,249 4.75%
3 Canada 32,577,491 1,232,110 3.64%
4 United States 1,785,327,594 65,136,307 3.52%
5 France 9,902,491 270,838 2.66%
6 Sweden 2,906,508 75,235 2.52%
7 Japan 2,398,185 55,475 2.26%
8 Spain 5,827,703 112,612 1.90%
9 Netherlands 3,938,242 76,088 1.90%
10 Italy 5,364,918 92,755 1.70%

The following map shows the ratio of clean to dirty pages per country. A deeper shade of pink corresponds to a higher ratio.

Map of fraction of adult entertainment webpages per country

For completeness, we also include a table of countries by total count of dirty URLs:

Country Clean URLs Dirty URLs Ratio
1 United States 1,785,327,594 65,136,307 3.52%
2 Canada 32,577,491 1,232,110 3.64%
3 Brazil 4,592,603 473,648 9.35%
4 United Kingdom 72,237,401 379,215 0.52%
5 France 9,902,491 270,838 2.66%
6 Germany 17,791,668 157,163 0.88%
7 Costa Rica 7,862,666 130,620 1.63%
8 Spain 5,827,703 112,612 1.90%
9 Italy 5,364,918 92,755 1.70%
10 Russia 1,750,870 87,249 4.75%

Discussion

It has been claimed that the Common Crawl does not contain adult content at all, we could clearly refute this. We found 81 Million URLs in the crawl that would likely be considered "not safe for work". The overall percentage of 3.5% fits well in the observations from previous work. Regarding the geographic distribution, we were surprised by the clear high score for Brazil. Some consolation for the World Cup defeat at last...

Regarding possible sources of errors, there are several. First, the selection of pages to go into the Common Crawl copora seems slightly arbitrary. There is little documentation on how exactly pages are included or excluded from the crawl. Also, there might be "cultural" differences in labelling adult entertainment sites as off-limits for search engines using for example the robots.txt technique. Hence, there are likely be geographical and class-based imbalances in the data.

The second source of imbalances is the relatively low amount of non-English content. Since the majority of online content is published in English, the US as the largest English-speaking country will be more likely as a country association. The same bias is likely to be present in the other indicators, the content tag, the US Code number and the domain blacklist. While we the language detection has reasonable accuracy, we cannot distinguish between the US and the UK or Portugal and Brazil.

The placement of a website on the globe based on their IP address is imprecise to say the least, there are many countries without a commodity hosting industry or legal restrictions that prohibit the placement of adult content in that country. Also, many web pages nowadays are served using content delivery networks. These usually serve content from servers that are geographically close to the user to improve page load latency. Since the Common Crawl was generated by EC2 nodes within the US, the server IP address that we used for geolocation would have a higher tendency to be in the US, too.

Conclusion

We were finally able to answer this question: Google Groups Question

A Word of Caution on Mahout

Originally, we planned to use a Naive Bayesian classifier trained on the static features outlined above to detect non-labelled porn pages. We chose the Mahout collection of supposedly scalable machine learning algorithms for this. The way Mahout achieves scalability is by relying on Hadoop or Spark as back-end systems. We have tried to use its implementation of a Naive Bayesian Classifier to classify non-tagged web pages, too. The classifier was able to correctly mark 98% of pages in our testing on smaller datasets. We considered this performance acceptable. However, when we then started training the classifier on larger amounts of web pages, this turned out to be a complete disaster. Not only did the preprocessing of the data require a huge amount of time, but the training itself also failed on every one of our numerous tries. We can therefore only advise against using Mahout's naive Bayesian classifier implementation on Web data.

Acknowledgements

The author would like to thank SURFsara for the opportunity to run this analysis on their computing infrastructure.