# Exploring linguistic choice in Barcelona through Twitter

### Introduction

In this post we will explore the linguistic behavior of Barcelona residents that are also active on Twitter. Catalan and Spanish have been regularly heard on the streets of Barcelona since at least the end of the [16th century]( http://blogs.sapiens.cat/socialsenxarxa/2012/06/29/catala-i-castella-a-la-catalunya-moderna/). Whereas the Catalan language developed as a local dialect from Latin, Spanish was initially imported through political and cultural influence, later through several periods of repression against Catalan from central authorities  and more recently due to massive immigration from Spanish regions in the 20th century. However, when we claim that a city is bilingual, more often than not we are not conveying the effective linguistic reality of that city: individual bilingualism is rarely perfect, many residents may still be monolinguals, and one of the two languages may clearly dominate the other in everyday life. It is a matter of constant political debate whether Catalan ( a local language still [spoken by a few millions](https://en.wikipedia.org/wiki/Catalan_language) ) will be able to coexist with Spanish ( a global language spoken by hundred of millions ) , and for how long. The purpose of this post is to quantify the degree of use of each language in spontaneous, informal written communication. In order to accomplish this task, we focused on the activity of Twitter users that are also residents in the city of Barcelona. Even though [language surveys](https://www.idescat.cat/dades/eulp/?lang=en) to quantify the linguistic reality are regularly carried out in Catalonia, they suffer from the limitation of having to rely on the reporting from sampled people ( and humans are notoriously biased when [reporting statistics on their own activity](https://en.wikipedia.org/wiki/List_of_cognitive_biases#Social_biases) ). Instead, we believe that analyzing randomly sampled linguistic data from Twitter should allow for much less unbiased estimates. Unfortunately, the current study will only provide a picture of the current linguistic reality: since Twitter is a very recent social networking service, in order to identify trends the same study should be carried out once every few years.

### Identification of sources of bias in Twitter random samples

The first step in a language survey is to ensure unbiased sampling and, in case bias cannot be avoided, at least describe and quantify it. Since we will be sampling from Twitter users only, it is clear that our sample will not be entirely random. It is therefore important to reason whether this may introduce bias in favor of one of the two languages. We identify three main relevant bias inherent to Twitter-based samples:

- __Age bias__: Assuming Twitter demographics do not vary too much across the globe, a few conclusions from [available statistics on Twitter users in the US](https://www.statista.com/statistics/192703/age-distribution-of-users-on-twitter-in-the-united-states/) when compared to [US demographic data](https://en.wikipedia.org/wiki/Demography_of_the_United_States) are apparent:
     - Speakers younger than 15 will not be represented 
     - Speakers older than 65 will be underrepresented
     - Speakers between 18 and 30 will be overrepresented
     
 In order to understand how this bias can affect our conclusions, we need to consider that the vast majority of the population of Barcelona educated under the dictatorship of General Franco from 1939 to 1975 will not feel at ease when having to write in Catalan even in the case that Catalan happened to be their mother tongue. The language was completely banned from the regular school system until 1978, when a new Constitution was adopted. Therefore, only people younger than 40 will have had the chance of learning to write in Catalan at school. In a way, the fact that people younger than 40 are overrepresented in the sample is balanced by the fact that these people are indeed those having the least bias when it comes to decide in which language to communicate. The main conclusion is that our sample, for the very fact of being a sample drawn from Twitter users only, will have a bias in favor of Catalan as compared to the linguistic reality of Barcelona's population older than 18. 

- __Education bias:__ [Twitter users are more likely to be in college or to have a college degree](https://blog.hootsuite.com/twitter-statistics/) than if they were randomly sampled from population. Since in Barcelona Catalan is more spoken among people of educated classes than from less educated backgrounds, the educational bias will result in an overrepresentation of Catalan speakers with respect to a random sample of inhabitants.

- __Income bias:__ This bias , correlated with the previous one, results from the fact that [Twitter users are likely to have above-average income](https://blog.hootsuite.com/twitter-statistics/). Since Catalan is much less spoken among low income inhabitants than among middle income ones, again many people that prefer to communicate mostly in Spanish will not be adequately represented in our samples. We expect the income bias from Twitter samples to overrepresent Barcelona inhabitants that speak and write fluent Catalan.

Because of all the listed bias, the conclusions of this study will not apply to the population of Barcelona as a whole, but rather to that part of the population between 18 and 50, with higher-than-average income and educational levels.

### Methodology

10 Twitter accounts that we deem representative of the cultural spectrum of the city of Barcelona are selected. The criteria used for the selection are the following:
 - Accounts must be relevant ( hundreds of thousands of followers )
 - Accounts must be related to the city of Barcelona as much as possible
 - At least half of the accounts must be as _politically_ neutral as possible
 - Accounts must cover the cultural spectrum of the city: they must range from a strong commitment towards the minority language to a very weak one, if any
 - Selection must include accounts written in each of the two languages at study
 
The following are the selected root accounts ( in parenthesis the name and language of the account ):
 - __ARA__ (@diariARA, CAT) : the most widely distributed newspaper in Catalan language. Moderate nationalistic views
 - __BCN_Ajuntament__ (@bcn_ajuntament, CAT) : the official Twitter account of Barcelona's City Council.
 - __Hola__ (@hola, SPA) : weekly magazine specialising in celebrity news. Currently published in Madrid but founded during the dictatorship in Barcelona in 1944. In Spanish only.
 - __El Periódico__ (@elperiodico, SPA) : the second newspaper by sales in Catalonia, edited in both Spanish and Catalan (automated translation). Twitter account is only in Spanish. 
 - __La Vanguardia__ (@LaVanguardia, SPA) : the most widely circulated newspaper in Catalonia. Available in both Spanish and Catalan (automated translation), but official Twitter account is only in Spanish. 
 - __Mossos__ (@mossos, mostly CAT) : the official Twitter account of the Catalan regional police
 - __Meteocat__ (@meteocat, CAT) : the official Twitter account of the Meteorological Service of Catalonia
 - __Sport__ (@sport, SPA) : one of two main sports newspaper published in Barcelona, published in Spanish only
 - __TMB_Barcelona__ (@TMB_Barcelona, CAT) : the official Twitter account of Barcelona public transport service
 - __Vilaweb__ (@VilaWeb, CAT) : an online newspaper with a strong focus on Catalan culture and language
    

Of these accounts, we consider __@meteocat, @TMB_Barcelona, @bcn_ajuntament, @sport__ and __@hola__ to be politically neutral. Political neutrality does not imply cultural neutrality.

In order to quantify the aggregate linguistic behavior of the followers of each account, 
we first need to characterize the linguistic choice of each follower. The interesting information will be the distribution resulting from all the sampled linguistic choices. The process to identify the distribution of the linguistic choice among the followers of a given root-account is the following ( code details can be checked on GitHub repo [here]( https://github.com/pgervila/Explore_bilingualism_in_cities) ).

1. Select a root account

2. For the selected account, obtain a random sample of _at least_ 100 city-resident followers with a sufficient number of (re)tweets and followers. It is important to ensure that the sample is indeed as random as possible. Since in the process of selecting tweets from users we will be forced to apply several filters in order to ensure quality of the data, it is important to discuss the potential sources of bias one by one.
 - Selected root-account followers must have the location activated in order to identify them as city residents. Unfortunately, the proportion of Twitter users that activate the geolocation field is very low ( approximately 10% ) : this makes the retrieval process quite costly, since the Twitter API currently does not allow to stream more than 3000 followers every 15 minutes. We assume no linguistic bias is associated with the activation of the geolocation.
  -  A limitation of Twitter API is that followers data can only be streamed in chronological form, from most recent to oldest, thus potentially introducing another source of bias. However, it was verified that the process of selecting the most recent followers is approximately equivalent to taking a random sample of the entire list of followers ( time does not introduce significant bias ). This makes the process of retrieving relevant followers much less expensive than taking a proper random sample after having downloaded the entirety of followers
  -  It is assumed that filtering accounts by a minimum number of followers and (re)tweets does not introduce any significant source of bias in linguistic terms
  - Selected users need to have twitted or retweeted at least 60 times ( a compromise between statistical significance and computational cost). 
  - In addition, in order to avoid dummy or irrelevant accounts as much as possible, users need to have a minimum number of followers ( >= 50 ).
    
3. Stream 60 (re)tweets from each relevant follower

4. Detect how many of these tweets can be linguistically identified. Linguistic identification is implemented by comparing results from the Twitter algorithm to those from [langdetect](https://pypi.org/project/langdetect/), a Python library for text language detection. Only tweets for which both language detection algorithms provide the same result are kept. Discard tweets for which language cannot be reliably identified because of discording output between language detection algos. 

5. Select only tweets written in local languages plus English

6. Filter followers further by considering only those with at least 40 filtered tweets ( a compromise between statistical significance and computational feasability), and drop all followers with less than 40 remaining tweets. In order to have an even weight from all followers, keep only 40 tweets for each follower.

7. Each tweet from a given user is considered as a random bernoulli variable (with identical expectation). Therefore each list of tweets per follower of size n = 40 is a sample from a binomial distribution, in which _success_ is a tweet written in a particular language $ L $ and _failure_ any tweet written in another language. For each relevant follower $ i $ and each applicable language, compute the sample mean $\bar Y_{i,L}$ ( an estimate of the success probability $p_{i,L}$). We will call these sample mean the follower's linguistic mean. Each follower will have as many linguistic means as languages are taken into consideration

8. For each language, consider the distribution resulting from all the linguistic means from each follower. Since we only have a random sample of linguistic means, we will have a sample of the desired distribution

9. Repeat for all root accounts

### Population samples per account

### Twitter language settings per root accounts

A first indication of the relative language relevance is provided by the language setting of the account of each follower. In most cases, the default language ( worked out by algos based on computer settings and installations ) will be accepted by the user unless he or she strongly prefers another alternative language. Therefore, in this case there is a strong bias towards the dominant official language: Spanish is still the default language for virtually all commercial computers or smartphones sold in Barcelona. Nevertheless, it is possible ( and simple ) to change the settings of a Twitter account to Catalan language.

<img src="figures/lang_settings_in_Barcelona.png" width="600" height="400" />

Results show that Spanish is the default language for most users. Even in the case of __@Vilaweb__, only 3 users out of 10 on average have their language settings in Catalan. In some cases, we assume users are not even aware they can switch the setting. It's the language inertia so common in diglossic environments, where it is assumed that some activities or services are provided in one language only.

### Percentage of tweets in each language

In order to have an idea of the relative weight of each language per account, since every follower will contribute with the same amount of tweets to the final sample per account, we can compute the percentage of tweets in each language per account by considering all tweets from the account followers and computing the ratio in each language.

<img src="figures/percentage_of_each_lang_per_account_in_Barcelona.png" width="600" height="400" />

We observe that there is a higher proportion of tweets in Catalan as compared to the percentage of lang settings in Catalan. The conclusion is that some users that regularly tweet also in Catalan have their settings in Spanish ( a default option they feel comfortable enough with). One of the most telling results from this graph is that the majority of (re)tweets from followers of __@VilaWeb__ (widely considered a Catalan nationalist publication) are in Spanish.



### Language choice distribution of followers per account

Language percentages give us a general idea of language preference, but we have not explored yet how this preference is distributed among users. Do most followers (re)tweet in both languages for all accounts? What proportion of users tweet in only or mostly one language for each account? 

Since we have 40 classified tweets per selected follower, we can compute the ratio of tweets in each language per follower. After sorting all user ratios per language and per account, we can visualize box plots of the data. [Box plots](https://www.wellbeingatschool.org.nz/information-sheet/understanding-and-interpreting-box-plots) are a good way to quickly visualize distributions. In a box plot, data is divided into 4 groups by its 3 quartiles. The middle point of the boxes is the median of the sorted data (the middle point from lower to higher value, and also its second quartile). The box itself contains 50% of the data as determined by the first and the third quartiles. The interval defined by the upper and lower wiskers contains almost all of the data, except for those values that are considered statistical outliers. We plot two box plots per account: one for Catalan, another for Spanish ( English is not considered here ).


<img src="figures/lang_distribs_per_acc_in_Barcelona.png" width="600" height="400" />

The first thing we should point out is how the median values are mostly far away from a balanced result ( 0.5 ) for most accounts, but especially for those that tweet in Spanish ( the leftmost four ). For these accounts, the distributions are very skewed, thus showing a relatively large number of users that tweet only or almost only in Spanish. Only in the case of __@VilaWeb__ followers, we can recognise a more even distribution for both languages. As a consequence, while the median follower of __@hola__ does not tweet in Catalan at all, and that of __@sport__ only 5% of the times, the median follower of __@VilaWeb__ tweets almost 40% of the times in Catalan. It is interesting to notice that while users that tweet only or almost only in Catalan are non existent or considered outliers for the accounts that tweet in Spanish, users that tweet only or almost only in Spanish are not outliers for any of the accounts under consideration. In any case, all median followers (re)tweet more often in Spanish than in Catalan. In addition, they do so by very consistent margins, since all median followers (re)tweet more than half of the times in Spanish, even in the case of __@VilaWeb__.

In the case of each of the four accounts whose tweets are in Spanish, 75% of their followers write more than 50% of their tweets in Spanish. On the other hand, for these very accounts, 75% of their followers write less than 30% of their tweets in Catalan.

### Linguistic choice of public transport users

The __@TMB_Barcelona__ account is probably the most informative of all accounts from a linguistic viewpoint. Even though the account tweets in Catalan, public transport is tipically one of the few places in a city where large amounts of people from different backgrounds and places come together. Since we expect its followers to be, if certainly not a random one, at least a representative sample of its users, this account should provide a good linguistic summary of the city. 

<img src="figures/hist_lang_choice_@TMB_Barcelona_followers.png" width="600" height="400" />

From the Spanish language perspective, the interval with the highest frequency is that with more than 95% of tweets written in Spanish. From the Catalan perspective, the interval with the highest frequency is that with less than 5% of tweets written in Catalan ( with a frequency 3 times higher than that of the second higher frequency interval - the adjacent one ). 

### Conclusions

A sample of accounts mostly written in Catalan reveals a clear preference for Spanish in the tweets written by their Barcelona-resident followers. Barcelona is officially a bilingual city, but, at least among its Twitter users, this bilingualism is strongly asymmetrical : Spanish is consistently the most frequently used language, either out of preference, inertia or maximization of interaction. From a sociolinguistic viewpoint, this strong imbalance in the frequency of use may have important effects on the quality and creativity of the minority language in the long term, and eventually on its survival as a language.
An important conclusion is that a linguistically balanced environment is only provided by accounts that on paper are skewed (because of the language they are written in) towards the minority language, whereas accounts that tweet in Spanish, the majority language, feature a group of followers that linguistically are very unbalanced . This is evidence that if the goal is to achieve a truly bilingual society, with a minimum of inequalities in language use and preference, policies must be skewed towards the minority language. In any case, it will be interesting to repeat the same analysis 10 years from now in order to detect potential trends.

