-
Notifications
You must be signed in to change notification settings - Fork 4
Babel 2012 Web Language Connections
by Hannes Mühleisen, Database Architectures, CWI <hannes@cwi.nl>
According to Genesis, people once spoke a single language, so everyone could understand each other. If ever the case, this has since changed profoundly. Today, our civilization makes use of a large number of languages, both in spoken as well as in written form. This diversity is of course also reflected on the Web, where a multitude of languages is used to express information on Web pages. The top ten languages used on Web pages today are English, German, Russian, Japanese and Spanish, with a huge bias towards English, which is used on over 50% of websites [1].
However, many people speak more than one language, in fact, it has been estimated that over half of all people are bilingual [2]. On the web, this multilingualism may be reflected in a very particular way. People publish web pages, and set links to other pages. In most cases, the page being linked to will be written in the same language than the linking page. However, in many other cases it will not be. For example, many resources such as the english Wikipedia are more comprehensive than their counterparts in other languages. Hence, authors are often inclined to set a link to a page in a foreign language.
The presence of such an inter-language link on the Web can indicate that the author of the page at least saw the possibility that his audience is able to read and understand the other language. By determining the language Web pages have been written in, finding the pages they link to, also determining their language, and counting these occurences, we can quantify the inter-language links on the web according to languages. Ultimately, this can give an indication as to which languages are commonly understood together in written form.
The Common Crawl dataset contains a considerable part of the public Web. Starting from the most popular pages according to Alexa [3], a crawler program collects web pages by following links to other pages. The decision whether a page is included in the crawl is taken using the well-known PageRank algorithm. For the Norvig Award, around one billion Web pages were made available for analysis. This dataset can be used to determine the frequency and type of inter-language links, since the sheer size and popularity selection of the dataset makes it likely that the page being linked to is also included in the crawl.
However, Web pages differ greatly on the number of links they set, on their internal structure and so on. Therefore, using URLs linking to each other to determine the inter-language link counts would be difficult. Instead, one can count these links according to so-called Pay-Level-Domains (PLDs) linking to each other on any of their URLs. This will result in a more balanced view independent of inner structure and crawler preferences. Furthermore, links within a single domain are less interesting, since they are unlikely to link to pages in other languages.
To collect our data, we have run an Apache Pig [4] script on the Common Crawl data. Pig is a high-level data analysis language with a declarative syntax that is compiled into a series of Map/Reduce jobs. In addition, it allows the easy integration of user-defined functions (UDFs) into its execution. For our analysis, we have created four UDFs:
In order to be able to determine the language a Web page is written in, we have integrated a high-precision language detection library [5] into Pig. This library builds language profiles using Wikipedia abstracts and a Bayesian filter, and is reported to reach a detection precision of over 99%. Since this library operates on plain text rather than HTML pages, we first strip the pages of their markup and then feed it into the library. While the library is able to determine probabilities for multiple languages, we select the most likely on and return it to Pig.
To find the URLs a specific Web page is linking to, we have also extended Pig with a way to extract a list of URLs a HTML page is linking to from its content. For this, a regular expression is evaluated on the page content.
As mentioned, counting inter-language links by the number of domains they occur on has a higher chance of reducing biases introduced by the crawling strategy and the properties of pages under a single domain. Therefore, the "Domain" UDF extracts the domain name from an URL for comparison and grouping purposes. Furthermore, to exclude links within a domain, the PayLevelDomain UDF calculates the second-level-domain for an URL, e.g. www.google.com and translate.google.com have the same pay-level domain, google.com. Since detecting the pay-level domain is non-trivial, a third-party library is used here as well [6].
Using the custom UDFs as well as Pig's built in capabilities, we have all the components to run our analysis of inter-language links between Web pages. The analysis progresses as follows:
- For all HTML pages and the corresponding URL in the Common Crawl dataset, the text language, the list of links and the domain name are extracted.
- For each domain, the most prevalent language is calculated by counting the number of urls per language from this domain. Then, we determine a main language for the domain and store a domain - language mapping.
- Local links are filtered out.
- The list of URLs are joined together with the domain - language mapping such that a table with the columns (url,urlDomain,urlLanguage,link,linkDomain,linkLanguage) is created. This raw result list (51GB) is available for further research on request.
- The raw result list is aggregated by both the number of domains that contain a specific inter-language link. The result is a table with the columns (language1,language2,domains).
The full Pig script used is available in the project github repository at https://github.com/norvigaward/naward25 together with the source code of the described UDFs.
The results generated using Pig were refined using the GNU R environment. As described, the Pig script will generate a table of languages and the number of domains that contain inter-language links between the two languages. For example:
Outgoing Language | Incoming Language | Number of Domains |
---|---|---|
Afrikaans | English | 179778 |
Afrikaans | German | 12851 |
... |
However, these counts depend on the number of domains containing a particular language in the crawl data. In order to be able to compare the languages, we have to normalize the data. To this end, we have calculated the total number of domains that contain text in each of the languages, and then calculated the percentage for each target language. This results in the following table:
Outgoing Language | Incoming Language | Percentage in Language |
---|---|---|
Afrikaans | English | 67% |
Afrikaans | German | 4% |
... |
Now, as English is by far the most widely used language on the Web, with over 50% of content written in this language [1], our analysis has shown that all of the languages link to English, with only very little return links. Hence, English is removed. Furthermore, as we are only interested in inter-language links, we remove all entries from this table where the two languages are equivalent. Then, we filter the table to only include entries where the relative percentage is greater than 1 to reduce noise. Finally, we reshape the data into a two-dimensional matrix, where row and columns are languages and the fields are the relative percentages of inter-language links between domains. The result is the following matrix
Language | Czech | Danish | German | ... | Vietnamese |
---|---|---|---|---|---|
Afrikaans | - | - | 4% | ... | - |
Arabic | - | 3% | - | ... | - |
Bulgarian | - | - | - | ... | 2% |
... | |||||
Simplified Chinese | - | 3% | - | ... | - |
This table is to be read as follows: From all domains that are published in one of the languages in the row headers, the percentage in the value fields gives the percentage of domains in that language that link to a page that is published in the language given in the corresponding column header. For example, from all domains that publish content in Afrikaans, 4% contain links to pages written in German.
Bot raw and refined results files as well as the R script are available in the project GitHub repository at https://github.com/norvigaward/naward25 in the results subfolder. The language code - name mappings for the language codes used in the raw files can be found online [7].
The results from our analysis can be seen as a directed and weighted graph. To visualize the relationship between languages, we have chosen a so-called chord diagram, where the languages are aligned in a circle, and the relationships as well as their weight are represented using lines of different widths between them. Two of these circular visualizations are produced, once for outgoing and once for incoming links. To produce these graphs, we have used the excellent Circos software package for visualizing data and information in a circular layout [8]. The visualization result can be seen in the following graph, the language code mappings can be found in the references [7]. In this graph, lines of different widths connect the languages. The direction is given by leaving a gap between the line and the incoming language. For example, Ukrainian (uk) links to russian (ru) using a fairly broad line denoting 17% inter-language linking.
(A larger version is available in the results/circos-tableviewer-tractsm/results/ folder in the GitHub repo and here)
Now that we have our results visualized, we can discuss them. At first glance, we can see many languages that are well-connected from their share on the circle. These are (in decreasing order) Dutch, Portuguese, Vietnamese, Danish, Russian, French, Japanese and German. However, if we consider their relationships between incoming and outgoing connections, a different conclusion arises: Only Vietnamese and Japanese have an roughly equal share between incoming and outgoing connections, while Dutch, Portuguese, Russian, Danish, French and German are mainly being linked to. Keeping in mind that we have removed English from the data, we can see that these languages are next in being a "lingua franca" connecting different cultures.
While, our goal was to see which languages the speakers of a particular language are likely to understand in written form. To accomplish this, we filter our data further to include only results over 5% and with more than 1000 domains supporting that result. These thresholds are admittedly arbitrary, but will give only the most relevant results without overwhelming the reader:
Outgoing Language | Incoming Language | Percentage in Language |
---|---|---|
Bulgarian | Russian | 6% |
Estonian | Russian | 11% |
Ukrainian | Russian | 17% |
German | Dutch | 6% |
Dutch | German | 7% |
Farsi | Danish | 53% |
Japanese | Portuguese | 15% |
Portuguese | Japanese | 9% |
Czech | Slovak | 15% |
Albanian | Portuguese | 6% |
Vietnamese | Portuguese | 15% |
Vietnamese | French | 9% |
Vietnamese | Russian | 8% |
Vietnamese | Japanese | 7% |
By looking at this table, we can see immediately how speakers of the Vietnamese language are the "winners", they frequently publish links on Web pages in Vietnamese to Portuguese, French, Russian and Japanese content. This could be a result of each of those cultures at times having a large impact on people speaking Vietnamese, or just a sign of widespread education of these languages in Vietnam. There are unsurprising bidirectional pairs of highly similar languages, such as Dutch/German and Czech/Slovak as well as the surprising bidirectional pair of Portuguese/Japanese (see [9] for one possibility). Also, former Soviet satellite states seem to keep their affinity to Russian such as Bulgarian, Estonian, and Ukrainian. Finally, there are also surprising unidirectional pairs such as Farsi/Danish and Albanian/Portuguese.
We have set out to determine the degree of the language confusion. As we could see through our large-scale analysis on content languages on Web pages and the corresponding visualization, this confusion still exists apart from the ubiquitous English language. In the particulars, however, this analysis yielded some surprising results, for example the Vietnamese affinity to Portuguese and other languages. We hope to have shed some light on this topic, and invite everyone to perform further analysis on our data, in particular the large "raw" dataset.
The author would like to thank the initiators of the Norvig Award for the opportunity to run this analysis on the Common Crawl dataset free of charge. Furthermore, we thank Jeroen Schot from SARA for his support in getting our Pig job to finish on their Hadoop cluster, and Emmanuelle Beauxis-Aussalet from CWI for her hints towards the circular visualization of the inter-language links. Finally, we thank Christian Czekay for his contribution to the assumptions that sparked this analysis.
- Usage of content languages for websites, http://w3techs.com/technologies/overview/content_language/all
- Bilingual: Life and Reality (Harvard University Press, 2010)
- Alexa Top 500 Global Sites http://www.alexa.com/topsites
- Apache Pig, http://pig.apache.org
- Language Detection Library for Java, http://code.google.com/p/language-detection/
- Guava: Google Core Libraries for Java 1.6+, http://code.google.com/p/guava-libraries/
- Language code - name mappings, https://code.google.com/p/language-detection/wiki/LanguageList
- Circos, http://circos.ca
- Japanese words of Portuguese origin, http://en.wikipedia.org/wiki/Japanese_words_of_Portuguese_origin