# Check code-switching
In which we check to see if our data labelled as Catalan/Spanish contains any code-switching. A legitimate concern!

In [4]:
import pandas as pd
import json
import codecs

## Load data
We'll use the data from the independence referendum dataset (IT) rather than the controls (XT).

In [13]:
ref_tweets = [json.loads(l.strip()) for l in codecs.open('../../data/tweets/archive_Jan-01-17_Oct-31-17_ref_hashtags_filtered.json', 'r', encoding='utf-8')]
ref_tweet_data = pd.concat([pd.Series(t) for t in ref_tweets], axis=1).transpose()
lang_id_data = pd.read_csv('../../data/tweets/archive_Jan-01-17_Oct-31-17_ref_hashtags_filtered_langid.csv', 
                           index_col=False, encoding='utf-8')
# join on tweet ID
ref_tweet_data = pd.merge(ref_tweet_data, lang_id_data, on='id', how='inner')

In [14]:
ref_tweet_data.head()

Unnamed: 0,date,id,location.country,location.name,text,user,lang,conf
0,Sat Apr 08 16:27:26 +0000 2017,850746920700174337,,,Renunciar al creixement econòmic per un fi po...,ramonjmagi,ca,0.995823
1,Sun Apr 09 00:41:55 +0000 2017,850871361476202496,,,very few people know what I've been through......,TavarezRenaldy,en,1.0
2,Sun Apr 09 05:50:36 +0000 2017,850949044184772608,,,AŞK bedenin bedeni sevmesi degil RUHUN RUHU SE...,cemilesakin88,tr,0.97192
3,Sun Apr 09 12:26:28 +0000 2017,851048667309944832,,,밥 먹으면서 그냥 해볼게요*_ _) #1orpg시작시기_2좋아하는세션타입_3잘안맞는...,NOAH_11110101,ko,1.0
4,Sun Apr 09 15:28:23 +0000 2017,851094448142417920,,,Most of the media totally ignored the election...,fauxcroft,en,1.0


In [15]:
ca_tweets = ref_tweet_data[ref_tweet_data.loc[:, 'lang'] == 'ca']
es_tweets = ref_tweet_data[ref_tweet_data.loc[:, 'lang'] == 'es']
print('%d Catalan tweets'%(ca_tweets.shape[0]))
print('%d Spanish tweets'%(es_tweets.shape[0]))

2549 Catalan tweets
4320 Spanish tweets


Let's get a sample of 50 Catalan and 50 Spanish tweets.

In [17]:
# set seed
pd.np.random.seed(123)
N = 100
sample_size = N / 2
ca_sample = ca_tweets.loc[pd.np.random.choice(ca_tweets.index, sample_size, replace=False), :]
es_sample = es_tweets.loc[pd.np.random.choice(es_tweets.index, sample_size, replace=False), :]
combined_sample = pd.concat([ca_sample, es_sample])

In [20]:
# clean up text for printing
combined_sample.loc[:, 'text'] = combined_sample.loc[:, 'text'].apply(lambda x: x.replace('\n', ''))

In [26]:
print('\n'.join(combined_sample.apply(lambda x: 'lang=%s:\t%s'%(x.loc['lang'], x.loc['text']), axis=1).values))

lang=ca:	Sense llum per a que puguin fer sessió de cinema cada nit. Però no pelis de la WB... #FreePiolin https://t.co/23yzWNrqh7
lang=ca:	🗳️ He votat "Sí" per raons democràtiques, socials,  econòmiques i de justícia  #1Oct  #Democracia #CatalanReferendum https://t.co/4An13LO1E9
lang=ca:	#Salomó per la #democràcia, demà us hi esperem a tots! #Catalonia #Catalunya #1Oct https://t.co/VfeiVd8Qxp
lang=ca:	#CatalanReferendum Apliquen el cens universal. La gent podrà anar a votar a qualsevol col.legi habilitat per votar del territori català
lang=ca:	Podem canviar ja el crit de #votarem per #hemvotat?
lang=ca:	Fantàstic! #seguim #votarem https://t.co/eS2vgQrPZC
lang=ca:	Pels que ja no hi són i pels que vindran, demà tothom a les urnes. #votarem https://t.co/sQ0ztS9wKj
lang=ca:	A Gelida l'#1O volem votar #comsempre. @lluis_valls, obri els col·legis! https://t.co/pWc61Q5Cdi
lang=ca:	Ens podem veure el cava, oi? 🤔#1OoctTV3
lang=ca:	El P-LIB català condemna absolutament la repressió policial del 

Dump to file for annotation, reupload later.

In [27]:
combined_sample.to_csv('../../data/tweets/CA_ES_sample_tweets.tsv', sep='\t', index=False, encoding='utf-8')