# Experiment 2
In which we attempt to implement Experiment 2 from the original study.

$H_{0}$: People are equally likely to use Catalan in non-referendum tweets as in referendum-specific tweets.

$H_{1}$: People are more likely to use Catalan in non-referendum tweets as in referendum-specific tweets.

In [4]:
import pandas as pd
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt

## Load data

In [5]:
data = pd.read_csv('../../data/tweets/extra_user_tweets/Jan-01-17_Oct-31-17_user_tweets.tsv', sep='\t', index_col=False)
print(data.head())
print('%d total tweets'%(data.shape[0]))
data.loc[:, 'hashtags'] = data.loc[:, 'hashtags'].fillna('', inplace=False)

                   id             user  \
0  850626678380535808     CampsCliment   
1  850628981040848896         Dom70Bcn   
2  850631111747276800  Estrellas_Siete   
3  850632361649860608    pasionxespana   
4  850632420365873152        MH17files   

                                                text  \
0  RT @_Gafas_y_reloj_: El nulo interés de PP y C...   
1  RT @MTudela: ‘Té gràcia estudiar a una univers...   
2  Las empresas valencianas pueden solicitar #sub...   
3  #españaesuna #stopUE #stopOTAN #stopLGTB #stop...   
4  RT @pthefigg: @EliotHiggins @benimmo @DFRLab @...   

                                            hashtags  contains_ref_hashtag  \
0                                                NaN                     0   
1                                                NaN                     0   
2         subvenciones,contratacion,jovenes,Valencia                     0   
3  españaesuna,stopUE,stopOTAN,stopLGTB,stopgloba...                     0   
4                   

In [6]:
# compute hashtag counts for later filtering
data.loc[:, 'hashtag_count'] = data.loc[:, 'hashtags'].apply(lambda x: 0 if x=='' else len(x.split(',')))

In [7]:
print(data.loc[:, 'contains_ref_hashtag'].value_counts())

0    208927
1      3549
Name: contains_ref_hashtag, dtype: int64


In [8]:
# get rid of retweeted stuff
data_original = data[data.loc[:, 'retweeted'] == 0]
print('%d original tweets'%(data_original.shape[0]))
# language cutoff
lang_conf_cutoff = 0.90
allowed_langs = set(['es', 'ca'])
data_original_high_conf = data_original[(data_original.loc[:, 'lang_conf'] >= lang_conf_cutoff) &
                                        (data_original.loc[:, 'lang'].isin(allowed_langs))]
print('%d relevant tweets'%(data_original_high_conf.shape[0]))

81458 original tweets
52364 relevant tweets


In [9]:
# restrict to users who have tweeted at least once with a referendum hashtag (contains_ref_hashtag==1)
# and at least once without a referendum hashtag (contains_ref_hashtag==0)
relevant_users = data_original_high_conf.groupby('user').apply(lambda x: (x.loc[:, 'contains_ref_hashtag'].max()==1 and 
                                                                          x.loc[:, 'contains_ref_hashtag'].min()==0))
relevant_users = relevant_users[relevant_users].index.tolist()
print('%d relevant users'%(len(relevant_users)))
data_relevant = data_original_high_conf[data_original_high_conf.loc[:, 'user'].isin(relevant_users)]
print('%d relevant tweets'%(data_relevant.shape[0]))

775 relevant users
32044 relevant tweets


Sample size is really small! This will probably affect our power.

## All controls
This test uses as control all tweets without any referendum hashtags.

In [10]:
data_ref = data_relevant[data_relevant.loc[:, 'contains_ref_hashtag'] == 1]
data_control = data_relevant[data_relevant.loc[:, 'contains_ref_hashtag'] == 0]

In [11]:
print('%d referendum tweets'%(data_ref.shape[0]))
print('%d non-referendum tweets'%(data_control.shape[0]))
print('%d users'%(data_relevant.loc[:, 'user'].nunique()))

890 referendum tweets
31154 non-referendum tweets
775 users


In [12]:
print(data_ref.loc[:, 'lang'].value_counts())
print(data_control.loc[:, 'lang'].value_counts())

es    679
ca    211
Name: lang, dtype: int64
es    29347
ca     1807
Name: lang, dtype: int64


In [13]:
# compute probability of choosing Catalan in ref and control
from __future__ import division
lang = 'ca'
compute_prob_lang = lambda x: x[x.loc[:, 'lang'] == lang].shape[0] / x.shape[0]
cat_prob_ref = data_ref.groupby('user').apply(compute_prob_lang)
cat_prob_control = data_control.groupby('user').apply(compute_prob_lang)
print(cat_prob_ref.head())
print(cat_prob_control.head())

user
12decima12         0.0
19722791es         0.0
24clm              0.0
3OejCDcfvFi0M1B    0.0
4G_RED             0.0
dtype: float64
user
12decima12         0.0
19722791es         0.0
24clm              0.0
3OejCDcfvFi0M1B    0.0
4G_RED             0.0
dtype: float64


In [14]:
all_control_d_u = cat_prob_ref - cat_prob_control
all_control_d_u_mean = all_control_d_u.mean()

In [15]:
all_control_d_u_stderr = all_control_d_u.std() / len(all_control_d_u)**.5

In [16]:
print('d_u for all control tweets is %.3f +/- %.3f'%(all_control_d_u_mean, all_control_d_u_stderr))

d_u for all control tweets is 0.031 +/- 0.011


In [17]:
from scipy.stats import ttest_1samp
d_u_null = 0.
t_stat, p_val = ttest_1samp(all_control_d_u, d_u_null)
print('significance: t=%.3f p=%.3E'%(t_stat, p_val))

significance: t=2.839 p=4.640E-03


**Conclusion 1**:

People are more likely to speak Catalan when using a tweet with a referendum hashtag as compared to a tweet without a referendum hashtag.

## Hashtag control
Same test but only allowing tweets that contain at least one hashtag.

In [18]:
data_with_hashtags = data_original_high_conf[data_original_high_conf.loc[:, 'hashtag_count'] > 0]
# recompute relevant users
relevant_users = data_with_hashtags.groupby('user').apply(lambda x: (x.loc[:, 'contains_ref_hashtag'].max()==1 and 
                                                                     x.loc[:, 'contains_ref_hashtag'].min()==0))
relevant_users = relevant_users[relevant_users].index.tolist()
# recompute relevant data
data_relevant_with_hashtags = data_with_hashtags[data_with_hashtags.loc[:, 'user'].isin(relevant_users)]
data_ref = data_relevant_with_hashtags[data_relevant_with_hashtags.loc[:, 'contains_ref_hashtag'] == 1]
data_control = data_relevant_with_hashtags[data_relevant_with_hashtags.loc[:, 'contains_ref_hashtag'] == 0]
print('%d referendum tweets'%(data_ref.shape[0]))
print('%d non-referendum tweets'%(data_control.shape[0]))
print('%d users'%(data_relevant_with_hashtags.loc[:, 'user'].nunique()))

656 referendum tweets
13956 non-referendum tweets
550 users


In [19]:
cat_prob_ref = data_ref.groupby('user').apply(compute_prob_lang)
cat_prob_control = data_control.groupby('user').apply(compute_prob_lang)
hashtag_control_d_u = cat_prob_ref - cat_prob_control
hashtag_control_d_u_mean = hashtag_control_d_u.mean()
hashtag_control_d_u_stderr = hashtag_control_d_u.std() / len(hashtag_control_d_u)**.5
print('d_u for all control hashtag tweets is %.3f +/- %.3f'%(hashtag_control_d_u_mean, hashtag_control_d_u_stderr))
d_u_null = 0.
t_stat, p_val = ttest_1samp(hashtag_control_d_u, d_u_null)
print('significance: t=%.3f p=%.3E'%(t_stat, p_val))

d_u for all control hashtag tweets is 0.014 +/- 0.011
significance: t=1.230 p=2.192E-01


**Conclusion 2**:

People are not more likely to speak Catalan when using a tweet with a referendum hashtag as compared to a tweet with some other hashtag.

## Example referendum/non-referendum tweets
To show Catalan versus Spanish usage.

In [20]:
# we want ref_hashtag = 1, lang = ca
# and ref_hashtag = 0, lang = es
sample_data_ca = data_relevant[(data_relevant.loc[:, 'contains_ref_hashtag'] == 1) & 
                               (data_relevant.loc[:, 'lang'] == 'ca')]
sample_data_es = data_relevant[(data_relevant.loc[:, 'contains_ref_hashtag'] == 0) & 
                               (data_relevant.loc[:, 'lang'] == 'es')]
sample_users = list(set(sample_data_ca.loc[:, 'user'].unique()) & set(sample_data_es.loc[:, 'user'].unique()))

In [22]:
pd.np.random.seed(123)
sample_size = 10
test_users = pd.np.random.choice(sample_users, size=sample_size, replace=False)
for u in test_users:
    u_ca_data = sample_data_ca[sample_data_ca.loc[:, 'user'] == u]
    u_es_data = sample_data_es[sample_data_es.loc[:, 'user'] == u]
#     if(u_ca_data.shape[0] > 0 and u_es_data.shape[0] > 0):
    print('user %s CA text:\n %s'%(u, '\n'.join(u_ca_data.loc[:, 'text'].values)))
    print('user %s ES text:\n %s'%(u, '\n'.join(u_es_data.loc[:, 'text'].values)))

user dmontserratnono CA text:
 Si l'#1Oct2017 guanya el "No" els Manel passaran a dir-se Manuel. Jo votaria "Sí". #AlguHoHaviaDeDir
Ara només ens falta que en Trump faci treure la web en castellà de la UE. #CatalanRef2017 #alguhohaviadedir
El conseller @quimforn i el Major Trapero estan administrant molt bé el tempo que requereix aquesta setmana. Postura intel·ligent #1Oct2017
https://t.co/5r3OPySf0h #freepiolin no l'oblidem que la memòria històrica és molt curta
user dmontserratnono ES text:
 .@AliciaSCamacho puso en marcha la 'Operación Cataluña' con ayuda de Moragas | Diario Público https://t.co/KBCIeyew5z
user RoigNegre CA text:
 @SiviLaCanya Així s'estimaran més. Tricorni contra tricorni #freepiolin
user RoigNegre ES text:
 Los fascistas no nos quitarán la alegria.Ni en Barcelona ni en Caracas ni en Damascohttps://m.youtube.com/watch?v=1tP1umpp4M4#NoTincPor
@Ritmari @M3DuSsa @LeswinJPerez Los quemados vivos no son verdad ? Los linchamientos? Los ataques a escuelas y hosp… https://