# A semantic breadth analysis of twitter users diagnosed with bipolar disorder 

## Introduction

The Broaden and Build Theory (BBT) (Fredrickson, 2001) states that certain positive emotions (like joy, interest, pride and happines) broaden our mind, while negative emotions (sadness, anger, being afraid) narrow our thoughts. When we are happy our thoughts leap from topic to topic. When we are sad, our thoughts are more narrowly clustered and we tend to be unable to let go of our negative thoughts. 

According to the BBT, this mechanism evolved to focus attention, cognition and actions on immediate threats. Positive emotions support exploration and experimentation which serve to enhance survival skills (Fredrickson, 2013; Fredrickson et al., 2004). Cognitive breadth, denotes in this context, the variety of the content of cognitive processes (Schweighofer, 2018).Postulating that affect impacts cognitive breadth, we might observe an impact on semantic breadth. 

Several studies have already found characteristic semantic features in the online communication of individuals with depression (Hswen et. al., 2018). However, there are currently no studies on the characteristics of semantic features of individuals with bipolar disorder. 

## Method

Twitter timelines by users that posted "I was diagnosed with Bipolar disorder" ... or similar : # mention all combinations 
... 
were scraped via the twitter api. 

The /scripts/ folder contains all the scripts used to extract the data used in this notebook. 

The steps were: 
1. Extract the json (data_extraction.py)
2. Extract the document vectors (vectors.py) 
3. Reduce the dimensionality with PCA (clustering.py) 
4. Resample the data with a monthly timeframe using cosine similarity (resample.py) 

A pretrained model was used to extract the word vectors from the texts 
--> https://github.com/loretoparisi/word2vec-twitter


## Exploratory Data Analysis

In [79]:
# read full data 
import pandas as pd 

# read data 
bipolar_data = pd.read_pickle("../data/processed/bipolar_data.pkl")
control_data = pd.read_pickle("../data/processed/control_data.pkl")

# columns are the same check 
control_data.columns == bipolar_data.columns

# merge data 
df = pd.concat([control_data, bipolar_data],ignore_index=True)

In [80]:
df.head()

Unnamed: 0,id,is_control,created_at,full_text
0,185501402,1,2016-05-18 11:52:16,new balance derby in final #YNWA
1,185501402,1,2016-03-20 15:12:03,yook liverpool :(
2,185501402,1,2015-12-20 00:40:05,Watch Over You by Alter Bridge — https://t.co/...
3,185501402,1,2015-11-21 16:36:32,waah firmino false 9
4,185501402,1,2015-10-01 21:01:26,this is shit for origi tonight #rodgetsout


In [81]:
# set color for bipolar and control group and for the models 
cc = '#5bd1d7' # control 
bc = '#ff502f' # bipolar 
cm = '#3A868A' # control model 
bm = '#A3331E' # bipolar model

In [85]:
# imports and settings for plotly 
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

init_notebook_mode(connected=True)

In [86]:
# simple number of tweets per group per person 
df_count = df.groupby(['id','is_control']).count().groupby('is_control').mean()

fig = [go.Bar(
            x=['Bipolar Group', 'Control Group'],
            y=[df_count.iloc[0][0], df_count.iloc[1][0]],
            marker=dict(
            color=[bc,cc]),
)]

iplot(fig)

In [83]:
import scipy.stats as stats
user_count = df.groupby(['id','is_control'], as_index=False).count()
bipolar_count = user_count[user_count['is_control'] == 0]['full_text']
control_count = user_count[user_count['is_control'] == 1]['full_text']

t2, p2 = stats.ttest_ind(bipolar_count,control_count)
print("t = " + str(t2))
print("p = " + str(p2))

t = -2.2615091466156936
p = 0.023768531458239865


### There is a  small but significant difference in the average amount of tweets per User in each group. 

In [87]:
df = df.reset_index().set_index('created_at')
grouper = df.groupby([pd.Grouper(freq='M'),'is_control'])

df_monthly = grouper['full_text'].count().unstack('is_control').fillna(0)

init_notebook_mode(connected=True)

trace_control = go.Scatter(
                x=df_monthly.index,
                y=df_monthly[1],
                name = "Control",
                line = dict(color = cc),
                opacity = 0.8)

trace_bipolar = go.Scatter(
                x=df_monthly.index,
                y=df_monthly[0],
                name = "Bipolar",
                line = dict(color = bc),
                opacity = 0.8)

data = [trace_control,trace_bipolar]

layout = dict(
    title = "Monthly amount of Tweets per Group",
    xaxis = dict(
        range = ['2009-07-01','2019-03-31'])
)

fig = dict(data=data, layout=layout)
iplot(fig)

### Most persons in the bipolar group only started using twitter recently. This is expected since the sampling of the bipolar group started only recently.

## From Word to Document Embeddings

In [None]:
# for more information see vectors.py 

# calculate the document vector as average of all the words 
def avg_feature_vector(tweet, model, num_features,index2word_set,tokenizer):
    '''
    calculates the average vector 
    '''
    feature_vec = np.zeros((num_features, ), dtype='float32')
    n_words = 0
    words = tokenizer.tokenize(tweet)
    for word in words:
        #print(word) # sanity check
        if word in index2word_set:
            n_words += 1
            feature_vec = np.add(feature_vec, model[word])
    if (n_words > 0):
        feature_vec = np.divide(feature_vec, n_words)
    return(feature_vec)

def get_avg_doc2vec(tweets, model, num_features): 
    '''
    calculates the average vector as a representation for a document 
    num_features: number of dimensions of vector 
    '''
    print('Averaging Word Vectors ...')
    tokenizer = TweetTokenizer()
    
    index2word_set = set(model.index2word)
    doc_vecs = []

    for tweet in tqdm(tweets): 
        doc_vec = avg_feature_vector(tweet, model, num_features,index2word_set,tokenizer)
        doc_vecs.append(doc_vec)
    print('... Done')
    
    return(doc_vecs)

### After averaging the word vectors of each tweet we get a vector of length 400 (pretrained model). This vector can be seen as a representation of the tweet with respect to 400 features. 

## Cluster Analysis - PCA 

### The aim of the cluster analysis was to reduce the dimensionality of the document representation (400 Dimensions to 2). If differences are visible even in a low dimensional representation of the tweets, this would indicate a major difference in content or frequency of content. 
#### Only a sample of 10.000 tweets was used for the visualization in order to save computing time.


In [88]:
# read resampled data 
df = pd.read_pickle("../data/processed/pca_transformed.pkl")

In [89]:
df.head()

Unnamed: 0,0,1,is_control,full_text
0,0.14836,-0.339964,1,new balance derby in final #YNWA
1,0.09792,0.259978,1,yook liverpool :(
2,0.54371,-0.273045,1,Watch Over You by Alter Bridge — https://t.co/...
3,0.118905,0.069363,1,waah firmino false 9
4,-0.122559,-0.248111,1,this is shit for origi tonight #rodgetsout


In [92]:
# take random sample 
import random 
random.seed(42) # let s set it to the answer to life 

df = df.sample(10000)

trace_control = go.Scatter(
    x = df[0][df['is_control'] == 1],
    y = df[1][df['is_control'] == 1],
    name = 'Control',
    mode = 'markers',
    marker = dict(
        size = 10,
        color = cc,
        line = dict(
            width = 2,
            color = 'rgb(0, 0, 0)'
        )
    ),
    text= df['full_text'][df['is_control'] == 1]) # change to full text 


trace_bipolar = go.Scatter(
    x = df[0][df['is_control'] == 0],
    y = df[1][df['is_control'] == 0],
    name = 'Bipolar',
    mode = 'markers',
    marker = dict(
        size = 10,
        color = bc,
        line = dict(
            width = 2,
        )
    ),
    text= df['full_text'][df['is_control'] == 0]) # change to full text 


data = [trace_control, trace_bipolar]

layout= go.Layout(
    title= 'PCA of Document Vectors',
    hovermode= 'closest',
    xaxis= dict(
        title= 'Dimension 1',
        ticklen= 5,
        zeroline= False,
        gridwidth= 2,
    ),
    yaxis=dict(
        title= 'Dimension 2',
        ticklen= 5,
        gridwidth= 2,
    ),
    showlegend= False
)
fig= dict(data=data, layout=layout)
iplot(fig)

### There seems to be a small difference in document representations for the groups. The tweets by the people in the bipolar group seem to cluster a little bit more. 

## Cosine Similarity comparison 

### The cosine similarity for each user was calculated by taking a random sample of a maximum of 100 tweets from each user. This was done to save computing costs, since the number of comparisons for the cosine similarity calculation increases exponantially.

In [None]:
# for more information see resample.py
# calculate cosine similarity    
def avg_cosine_similarity(vectors):
    similarities = []
    # calculate angle 
    for word_vec1, word_vec2 in tqdm(itertools.combinations(vectors, 2)):
        sim = 1 - distance.cosine(word_vec1, word_vec2)
        similarities.append(sim)
    return(np.mean(similarities))

In [95]:
import pandas as pd
from datetime import datetime

# read resampled data 
df = pd.read_pickle("../data/processed/df_user_cs.pkl")

In [96]:
df = df.dropna() # let s drop all nans -> users with only one tweet 
df.head()

Unnamed: 0,id,cosine_similarity,is_control
0,185501402,0.422217,1
2,623745823,0.566762,1
3,93473104,0.591954,1
4,259995699,0.732886,1
5,82018253,0.837146,1


In [97]:
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

init_notebook_mode(connected=True)

fig = {
    "data": [{
        "type": 'violin',
        "y": df['cosine_similarity'][df['is_control'] == 1],
        "name": 'Control',

        "box": {
            "visible": True
        },
        "line": {
            "color": 'black'
        },
        "meanline": {
            "visible": True
        },
        "fillcolor": cc,
        "opacity": 0.6,
        "x0": 'Control Group'
    },
    {
        "type": 'violin',
        "y": df['cosine_similarity'][df['is_control'] == 0],
        "name": 'Bipolar',

        "box": {
            "visible": True
        },
        "line": {
            "color": 'black'
        },
        "meanline": {
            "visible": True
        },
        "fillcolor": bc,
        "opacity": 0.6,
        "x0": 'Bipolar Group'
    }],
    "layout" : {
        "title": "",
        "yaxis": {
            "zeroline": False,
        }
    }
}

iplot(fig)

### The bipolar group has a slightly higher cosine similarity and fewer outliers.

## Cosine Similarity over Time 

### Now let's have a look at cosine similarity over time. First I had a look at a weekly timeframe, but this yielded alot of NaN values. NaN values occur, if there is only one tweet in a given timeframe, and therefore no comparison can be made. After resampling to a monthly timeframe, there were fewer NaN values and more pronounced distribution for cosine similarity. 

In [100]:
import pandas as pd
from datetime import datetime

# read resampled data 
df = pd.read_pickle("../data/processed/df_resampled_w.pkl")

# how many missing 
print('Out of {} there are {} missing observations ... '.format(len(df), df['cosine_similarity_week'].isnull().sum()))
print('Out of {} persons there are  {} with fewer than 12 weeks of activity  ... '.format(df['id'].nunique(), sum(df['id'].value_counts() < 24)))

Out of 478010 there are 132136 missing observations ... 
Out of 5246 persons there are  1208 with fewer than 12 weeks of activity  ... 


Since there are a lot of missing observations ( meaning twitter users with only one tweet per week), I will use a monthly timeframe.

In [101]:
import pandas as pd
from datetime import datetime

# read resampled data 
df = pd.read_pickle("../data/processed/df_resampled_m.pkl")

# how many missing 
print('Out of {} there are {} missing observations ... '.format(len(df), df['cosine_similarity_monthly'].isnull().sum()))
print('Out of {} persons there are  {} with fewer than 3 months of activity  ... '.format(df['id'].nunique(), sum(df['id'].value_counts() < 3)))

Out of 132100 there are 22732 missing observations ... 
Out of 5224 persons there are  269 with fewer than 3 months of activity  ... 


In [102]:
# excluding the values 
df = df.dropna()

# averaging the cosine similarity 
df = df.groupby(['date', 'is_control'],as_index=False).mean()

In [105]:
df.head()

Unnamed: 0,date,is_control,cosine_similarity_monthly
0,2014-01-31,0,0.686163
1,2014-01-31,1,0.659029
2,2014-02-28,0,0.691126
3,2014-02-28,1,0.664648
4,2014-03-31,0,0.675157


In [103]:
# let s compare the cosine similarity of each user averaged by group 
trace_control = go.Scatter(
                x=df['date'][df['is_control'] == 1],
                y=df['cosine_similarity_monthly'][df['is_control'] == 1],
                name = "Control",
                line = dict(color = cc),
                opacity = 0.8)

trace_bipolar = go.Scatter(
                x=df['date'][df['is_control'] == 0],
                y=df['cosine_similarity_monthly'][df['is_control'] == 0],
                name = "Bipolar",
                line = dict(color = bc),
                opacity = 0.8)

data = [trace_control,trace_bipolar]

layout = dict(
    title = "Cosine Similarity over time",
    xaxis = dict(
        range = ['2014-01-01','2019-03-31']),
    yaxis = dict(
        range = [0.6,0.8])
)

fig = dict(data=data, layout=layout)
iplot(fig)

### However, monthly cosine similarity is higher in the bipolar group at all times. 

### Now I will take a closer look at the monthly cosine similarity distribution and look for outliers

In [106]:
# read resampled data 
df = pd.read_pickle("../data/processed/df_resampled_m.pkl")

In [107]:
# number of samples 
n = 100
control = random.sample(set(df['id'][df['is_control'] == 1].unique()),n)
bipolar = random.sample(set(df['id'][df['is_control'] == 0].unique()),n)

data = []
for user in control:
    trace_control = go.Scatter(
                x= df['date'][df['id'] == user],
                y=df['cosine_similarity_monthly'][df['id'] == user],
                name = "Control",
                line = dict(color = cc),
                opacity = 0.2)
    data.append(trace_control)

    
for user in bipolar:
   
    trace_bipolar = go.Scatter(
                x= df['date'][df['id'] == user],
                y=df['cosine_similarity_monthly'][df['id'] == user],
                name = "Bipolar",
                line = dict(color = bc),
                opacity = 0.2)
    
    data.append(trace_bipolar)

layout = dict(
    title = "Individual cosine Similarity over time",
    xaxis = dict(
        range = ['2014-01-01','2019-03-31']),
    showlegend = False
)

fig = dict(data=data, layout=layout)
iplot(fig)

### Most tweets in the bipolar group are occur in the last few months, furthermore there are fewer outliers in the bipolar group

## Time-Series Decomposition

### The aim of this part of the notebook was to capture trends and seasonality effects in both groups and compare them 

In [130]:
import pandas as pd 
from statsmodels.tsa.seasonal import seasonal_decompose
from matplotlib import pyplot
df = pd.read_pickle("../data/processed/df_resampled_m.pkl")
df = df.dropna()

In [131]:
# let's have a look at the bipolar series and prepare it 
df['datetime'] = pd.to_datetime(df['date'])
df = df.set_index('datetime')
df_mean = df.groupby(['date', 'is_control'],as_index=False).mean()

In [132]:
# extract bipolar series 
bipolar = df_mean[df_mean['is_control'] == 0]
series_bipolar = pd.Series(bipolar['cosine_similarity_monthly'].values , index=bipolar['date'])

# creating an additative model for bipolar group 
model_bipolar = seasonal_decompose(series_bipolar, model='additive', freq=12)
model_fit_bipolar = model_bipolar.trend + model_bipolar.seasonal

# extract control series 
control = df_mean[df_mean['is_control'] == 1]
series_control = pd.Series(control['cosine_similarity_monthly'].values , index=control['date'])

# creating an additative model 
model_control = seasonal_decompose(series_control, model='additive', freq=12)
model_fit_control = model_control.trend + model_control.seasonal

In [133]:
# list of all my series 
all_series = [series_bipolar, model_fit_bipolar, series_control, model_fit_control]
all_colors = [bc, bm, cc, cm]

In [134]:
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

init_notebook_mode(connected=True)

data = []

# rewrite this later 
trace_bipolar = go.Scatter(
                x = series_bipolar.index,
                y = series_bipolar.values,
                name = "Bipolar",
                line = dict(color = bc),
                opacity = 0.8)

trace_model_bipolar = go.Scatter(
                x = model_fit_bipolar.index,
                y = model_fit_bipolar.values,
                name = "Bipolar Model",
                line = dict(color = bm),
                opacity = 0.8)

trace_control = go.Scatter(
                x = series_control.index,
                y = series_control.values,
                name = "Control",
                line = dict(color = cc),
                opacity = 0.8)

trace_model_control = go.Scatter(
                x = model_fit_control.index,
                y = model_fit_control.values,
                name = "Control Model",
                line = dict(color = cm),
                opacity = 0.8)

data = [trace_bipolar,trace_model_bipolar,trace_control,trace_model_control]

layout = dict(
    title = "Additative Cosine Similarity Model for Bipolar vs Control Group",
    xaxis = dict(
        range = ['2014-01-01','2019-03-31']),
    yaxis = dict(
        range = [0.6,0.8])
)

fig = dict(data=data, layout=layout)
iplot(fig)

In [118]:
# compare the residual distribution for both groups 
trace_bipolar_model = go.Histogram(
    x=model_control.resid,
    opacity=0.75,
    name = "Bipolar Model",
    marker=dict(
        color= bm,
    )
)
trace_control_model = go.Histogram(
    x=model_bipolar.resid,
    opacity=0.75,
    name = "Control Model",
    marker=dict(
        color= cm,
    )
)

data = [trace_bipolar_model, trace_control_model]
layout = go.Layout(barmode='overlay')

fig = dict(data=data, layout=layout)
iplot(fig)


In [136]:
# calculate the sum squared residuals of our models 
def sse(resid):
    sse = sum(resid**2)
    return sse

print('The sum squared error of the bipolar model is {}'.format(sse(model_bipolar.resid.dropna())))
print('The sum squared error of the control model is {}'.format(sse(model_control.resid.dropna())))

The sum squared error of the bipolar model is 0.0009388215515125074
The sum squared error of the control model is 0.00046416350625885663


In [None]:
# calculate the aic and bic for our models 
#-> probably nonsense since there is only one variable (values over time)
def aic(): 
    pass 

def bic(): 
    pass

### The model achieves a slightly better fit for the control group 

In [154]:
# let s have a look at the trend component 
trace_bipolar = go.Scatter(
                x=model_bipolar.trend.index,
                y=model_bipolar.trend.values,
                name = "Bipolar Trend",
                line = dict(color = bm),
                opacity = 0.8)

trace_control = go.Scatter(
                x=model_control.trend.index,
                y=model_control.trend.values,
                name = "Control Trend",
                line = dict(color = cm),
                opacity = 0.8)


data = [trace_bipolar, trace_control]

layout = dict(
    title = "Trend",
    xaxis = dict(
        range = ['2014-01-01','2019-03-31']),
    yaxis = dict(
        range = [0.65,0.75])
)

fig = dict(data=data, layout=layout)
iplot(fig)

### The slope of the trend component seems to be similar for both control and bipolar group

In [156]:
# let s have a look at the seasonality component 
trace_bipolar = go.Scatter(
                x=model_bipolar.seasonal.index,
                y=model_bipolar.seasonal.values,
                name = "Bipolar Seasonality",
                line = dict(color = bm),
                opacity = 0.8)

trace_control = go.Scatter(
                x=model_control.seasonal.index,
                y=model_control.seasonal.values,
                name = "Control Seasonality",
                line = dict(color = cm),
                opacity = 0.8)


data = [trace_bipolar, trace_control]

layout = dict(
    title = "Seasonality",
    xaxis = dict(
        range = ['2014-01-01','2019-03-31']),

)

fig = dict(data=data, layout=layout)
iplot(fig)

### Although similar, the seasonality component seems to be more pronounced in the bipolar group in october, december, january and february

## fin