Song mentioning detection on the Common Crawl dataset
We investigated the possibility of automatically developing lists of popular songs for different countries by detecting song mentionings in the Common Crawl dataset. We found that that it is possible to efficiently determine the country of a web page and that is is possible to efficiently find mentionings of a large number of songs titles (5.7 million) within these web pages. We find however that our current implementation is not sufficient as the amount of false positive song mention detections is too high, but show that on a smaller set of songs, particular the Dutch top2000, the approach produces interesting reshuffeling of this list for different countries.
Making ordered lists of songs is a popular activity performed by both commercial institutes and individuals alike. In the Netherlands there is an annual event called the Top 2000 were people are asked to submit their favorite songs that when combined created a ranked list of the top 2000 greatest songs of all time.
It was this particular event that sparked our interest into how such a list of the "greatest songs of all time" would differentiate when created by people from different countries. As we are not aware of comparable lists in different countries, we wanted to investigate if it is possible to automatically build an equivalent list for a country by counting the mentionings of songs on web pages originating from a specific countries.
For our research, we developed algorithm that is capable of detecting song mentionings within a given web page and an algorithm that finds the country of origin of a web page, so that by combining these results we gain an ranked list of songs for a number of different countries.
The research performed for the Norvig Award has resulted in a paper that will be published as part of the Study Tour USB14.0 preliminary report. (todo, study tour, link naar paper).
To determine the country of origin of a web page we discussed several methods. A possible method would be language detection, but because this is processing intensive and languages don't map one-to-one to countries we decided to use another approach.
In the end we settled for detection based on the TLD (top-level domain) of the web page and the origin of the IP as fallback. We argue the TLD is the most reliable, but some TLDs are not linked to a (single) country, like
.eu. In those cases we use the country of origin of the IP address of the server.
Song mentioning detection
We first decided that the dataset of songs that we should detect should be as complete as possible. We therefore used the music metadata encyclopedia MusicBrainz that contained around 13.5 million recordings at the time. For the mentioning detection, we decided to use the LingPipe toolkit. This provided us with an implementation of the Aho-Corasick string matching algorithm that allowed us to find all song mentionings in a text in a linear amount of time. To reduce the expected amount of false positives by only detecting song titles, we added the requirement that the artist of the song should be mentioned in the text surrounding the song mentioning.
Unfortunatly, after initial testing we found that this set of songs contains a high number of songs with names and artists that are also often occuring in regular text, e.g. songs named "product" or "contact" that when metioned are not always references to the respective songs. Due to time limitations, we decided to reduce the amount of false positives by only using the top 2000 of 2013 as input for the algorithm.
Combining the detectors
To make our approach flexible we have used seperate jobs for country and song mentioning detection. The results are then combined using the UUID of the CommonCrawl records. This way we could use file input formats for the different detectors (
.wat for country detection and
.wet for song mentions) and already run our country detection before the song detection was finished.
Although not the main focus of our research, the country detection on the entire CommonCrawl dataset already showed some interesting results.
Only 7% of the countries could be determined by the TLD and by far most of counties are determined using the country of origin of the IP address (93%).
Another interesting observation is that webpages from the United States (80%) are overrepresented and non-latin pages like from Russia are underreprsented in the CommonCrawl data.
Song mentioning detection
The result of the main experiment is a generated list of songs, ordered on the amount of mentionings, for a large number of countries. As it is not feasible to show the entire result on this page, we show the top 15 most mentioned songs for small number of countries below. The complete result set can be found in this .csv file.
The Netherlands (NL)
|Don't Dream It's Over||Crowded House||1330|
|Get Lucky||Daft Punk||334|
|Let Her Go||Passenger||203|
|Wake Me Up||Avicii||202|
|All Of Me||John Legend||137|
|Story Of My Life||One Direction||137|
|Losing My Religion||REM||135|
United Kingdom (GB)
|Hey Jude||The Beatles||134|
|We Will Rock You||Queen||134|
|Get Lucky||Daft Punk||131|
|She Loves You||The Beatles||125|
United States (US)
|Blurred Lines||Robin Thicke||5840|
|Ho Hey||The Lumineers||5735|
|Locked Out Of Heaven||Bruno Mars||5377|
|I Won't Give Up||Jason Mraz||5371|
|The Unforgettable Fire||U2||4388|
Combined list of all countries
For added context, we also show the combined top list of top 15 most mentioned songs and the top 15 of the original Top2000.
|Blurred Lines||Robin Thicke||6272|
|Ho Hey||The Lumineers||5845|
|Locked Out Of Heaven||Bruno Mars||5503|
|I Won't Give Up||Jason Mraz||5439|
|The Unforgettable Fire||U2||4463|
|Someone Like You||Adele||4394|
|Stairway To Heaven||Led Zeppelin|
|Child In Time||Deep Purple|
|Avond||Boudewijn de Groot|
|Wish You Were Here||Pink Floyd|
|The River||Bruce Springsteen|
|Comfortably Numb||Pink Floyd|
|Nothing Else Matters||Metallica|
|Shine On You Crazy Diamond||Pink Floyd|
|Brothers In Arms||Dire Straits|
|November Rain||Guns N' Roses|
|Viva La Vida||Coldplay|
Because a large majority of the webpages (~80%) is from the US, and also the most occurences are found in pages from the US. The US list has a large impact on the combined list of songs.
We believe that it is interesting to see that The Beatles scores very high in the GB list, while american artists such as Bruno Mars, Robin Thicke and Pink score high in the US list.