# Generalists and Specialists: How Language Barriers Affect Activity Diversity on Reddit

We study how the language of user/subreddit affects GS-score. To adhere the conventions from [Waller and Anderson 2019](https://dl.acm.org/doi/10.1145/3308558.3313729),  we will use "community" and "subreddit" interchangeably. Similarly, we will use "user" to mean a Redditor. 

In [3]:
import numpy as np
import pandas as pd
import plotly.express as px
from pandas import Series, DataFrame

## Data Wrangling

In this section, we will compute the necessary data for analysis. Namely, we will compute the following:

- Compute GS-score for each user and community
- Detect language for each comment
- Based on comment language, compute language frequency for each user and community

### Common Communities

We have two source of data: community embeddings from https://github.com/CSSLab/social-dimensions and provided comments/submissions data on Reddit. Since our analysis combines the embeddings with the provided dataset, we will only consider the communities that are in both sources.

To do so, we will first load the communities from the embedding, and filter the `text_comments.csv` dataset to only include communities that are in the embeddings.

In [5]:
# get communities from embedding
import os
SOCIAL_DIMENSIONS = '../social-dimensions'
comms_embd = pd.read_csv(os.path.join(SOCIAL_DIMENSIONS, 'data/embedding-metadata.tsv'), sep='\t', usecols=['community'])['community']
comms_embd.values

array(['keto', 'AskReddit', 'funny', ..., 'barkour', 'wc2010_crests',
       'PigJargon'], dtype=object)

### Language Detection

In this section, we will detect the language of the comments in `text_comments.csv`. For language detection, we use [lingua](https://github.com/pemistahl/lingua-py).

In [6]:
from lingua import Language
languages: dict[Language, str]  = { l: l.iso_code_639_3.name.lower() for l in Language.all() }

The following cell will read `text_comments.csv`, exclude comments whose subreedit is not in the embedding, detect the language based on the comment's body, and finally put the results to `lang_comments.csv`.

We only include comments whose language are detectable. Here are some examples of undetectable comments:

- Removed comments (i.e. the body is `[removed]` or `[deleted]`)
- Emoji only
- URL only

The result will be `lang_comments.csv` with the following columns

<!--
- `id`: unique id
- `score`: score of comment based on upvote and downvotes
- `created_utc`: datetime when the comment waw posted
- `link_id`: id of submission to which this comment belongs
For this part, we take a conservative approach to include a lot of columns. But in later analysis, the most important ones are `author`, `subreddit`, and `lang`.
-->

- `author`: username of comment
- `subreddit`: name of the subreddit the comment was posted in
- `lang`: language of comment, in ISO 639-3 code

see [lang_comments.log](./lang_comments.log) for progross log.

In [None]:
import os

from lingua import LanguageDetectorBuilder

def detect(src_csv: str, dest_csv: str):
    # remove existing file
    if os.path.exists(dest_csv):
        os.remove(dest_csv)
    detector = LanguageDetectorBuilder.from_all_languages().build()
    # placeholder for removed content/username
    unwanted = ["[removed]", "[deleted]"]
    # columns to keep
    columns = ['author', 'subreddit', 'lang']
    comms = set(comms_embd)
    with pd.read_csv(src_csv, chunksize=10 ** 6, lineterminator='\n') as reader:
        for i, chunk in enumerate(reader):
            # only keep comments from communities in embedding
            chunk = chunk[chunk['subreddit'].isin(comms)]
            chunk = chunk.replace(unwanted, None)
            # the detector will return None if text is empty
            chunk['body'] = chunk['body'].fillna('')
            chunk['lang'] = detector.detect_languages_in_parallel_of(chunk['body'])
            chunk = chunk[columns].dropna()
            # map language to ISO 639-3 code
            chunk['lang'] = chunk['lang'].map(languages)
            print(f'Chunk {i:02d}: {len(chunk)} comments')
            chunk.to_csv('lang_comments.csv', mode='a', index=False, header=(i == 0))
detect('text_comments.csv', 'lang_comments.csv')

This will load the `lang_comments.csv` as precomputed data

In [1]:
import pandas as pd
# use Category dtype to save memory
lang_comments = pd.read_csv('lang_comments.csv', dtype={'lang':'category'})
lang_comments

Unnamed: 0,author,subreddit,lang
0,mega_trex,BeautyGuruChatter,eng
1,divadream,BeautyGuruChatter,eng
2,Ziegenkoennenfliegen,BeautyGuruChatter,eng
3,meowrottenralph,BeautyGuruChatter,eng
4,somethingelse19,BeautyGuruChatter,eng
...,...,...,...
29681125,AJTK,SquaredCircle,eng
29681126,imdelirious3,AskReddit,eng
29681127,NocapNightingale,TheStrokes,eng
29681128,jag-engr,ChoosingBeggars,eng


Notice that we do not need the `text_submissions.csv` dataset. This is because our ultimate goal for this data wrangling section is to combine GS-scores with our current Reddit dataset. Since the [paper](http://csslab.cs.toronto.edu/gs/actdiv-www2019.pdf) defines GS-score based on contributions of a user, where contribution means *commenting* in a subreddit, it only makes sense to use only the comment dataset.

### GS Scores

In this section, we will compute the GS-scores for users and communities as per [Waller and Anderson 2019](https://dl.acm.org/doi/10.1145/3308558.3313729)

First, we will filter the embedding since the only useful communities that are the ones that we have in our comments dataset.

In [None]:
embedding = pd.read_csv(os.path.join(SOCIAL_DIMENSIONS, 'data/embedding-vectors.tsv'), sep='\t', header=None)
embedding = embedding.set_index(comms_embd)
# get common communites between embedding and comments
embedding = embedding.loc[list(set(lang_comments['subreddit']) & set(comms_embd))]
embedding.to_csv('embedding.csv')

In [7]:
embedding = pd.read_csv('embedding.csv', index_col=['community'])
embedding

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,140,141,142,143,144,145,146,147,148,149
community,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
whitesox,-0.210966,-0.033527,0.068185,-0.091237,0.077390,-0.023480,-0.634326,0.297105,0.078728,0.184559,...,0.327551,0.098173,-0.051080,-0.252188,0.573846,-0.695682,-0.049100,-0.315903,-0.350931,-0.308789
needforspeed,0.117809,-0.045149,0.026157,-0.046551,0.037208,0.129973,0.128956,0.198981,0.053289,0.090528,...,0.189227,0.163141,0.120499,-0.109910,-0.002825,-0.295848,-0.182379,0.087979,-0.196882,0.123743
singapore,-0.086955,0.042368,-0.458273,0.141884,0.145896,-0.136140,-0.209931,0.287035,0.125812,0.279292,...,0.329905,-0.232483,0.044648,-0.110947,-0.017498,-0.143914,-0.180745,0.101154,0.093565,-0.057554
China,0.027609,0.003552,-0.191009,0.142138,0.106009,-0.014841,0.030333,0.723382,-0.396016,0.032850,...,0.400575,-0.182233,0.073708,-0.167503,0.188529,-0.123957,-0.230072,0.358343,-0.223186,0.108320
CanadaPolitics,-0.044821,0.020423,-0.162500,-0.283616,0.339133,0.006719,-0.260493,0.224549,0.112861,0.097335,...,0.445444,-0.065959,0.350684,0.080559,0.158537,-0.070104,-0.124536,0.446964,-0.072867,0.355092
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
funhaus,0.184801,0.142184,0.005854,0.120801,0.051308,-0.104341,0.256501,0.008896,-0.119677,0.005178,...,0.317293,0.130474,0.381371,-0.579089,0.149820,-0.266531,0.411645,0.294616,-0.137921,0.067997
speedrun,0.279569,0.074087,-0.155210,0.199809,0.063970,-0.084391,0.279174,0.255730,0.060109,0.241046,...,0.248152,0.248942,0.128581,-0.117363,0.102765,-0.205994,-0.158799,0.275427,-0.132525,0.180295
MakingaMurderer,0.088003,0.046263,-0.183474,0.021420,-0.204858,-0.159328,-0.149026,0.559824,-0.110523,0.390340,...,-0.001780,-0.114099,0.154348,-0.198905,0.286602,-0.059701,-0.017783,0.081613,-0.447207,0.356112
GRE,0.050482,-0.045961,-0.345226,0.161670,0.059878,0.016453,0.424253,0.580363,0.034056,-0.046038,...,0.138814,0.006454,-0.043733,-0.325318,0.116789,-0.192167,0.046609,0.336882,-0.176348,0.341897


Given the community embeddings and user contributions, we can calculate the GS-score of a user. According to the paper (3.1), the GS-score of a user $u_i$ is defined as

$$
GS(u_i)=\frac1J\sum_jw_j\frac{\vec c_j\cdot\vec\mu_i}{||\mu_i||}
$$

where

- $\vec c_j$ is the embedding vector of community $c_j$
- $w_j$ is the number of contribitions/comments that user $u_i$ makes in community $c_j$
- $\vec\mu_i$ is the *center of mass* of user $u_i$, that is, $\vec\mu_i=\sum_jw_j\vec c_j$

In other words, the GS-score is the average cosine similarity between $u_i$'s communities and $u_i$'s center of mass.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

total = lang_comments['author'].nunique()
i = 0

def gs_user(group: DataFrame):
    """
    returns GS-score of a user, given that group is grouped by author
    """
    comms = group['subreddit']
    w: Series = comms.value_counts()
    c = embedding.loc[comms.unique()]
    μ = c.mul(w, axis=0).sum().values
    global i
    i += 1
    if i % 1000 == 0:
        print(f'{i}/{total}: {i/total:.2%}')
    return cosine_similarity(c, μ.reshape(1, -1)).mean()

uscores = lang_comments[['author','subreddit']].groupby('author').apply(gs_user)
uscores.name = 'uscore'
uscores.to_csv('uscores.csv')

In [2]:
uscores = pd.read_csv('uscores.csv', index_col='author')['uscore']
uscores

author
------------------GL    1.000000
------------------O     1.000000
------------------f     0.740005
--------------Emkay     0.686527
-------------0          1.000000
                          ...   
zzzzzzzzzzzzccccccgg    0.687017
zzzzzzzzzzzzvzzzzvzz    0.813356
zzzzzzzzzzzzzs          1.000000
zzzzzzzzzzzzzzzzspaf    0.674736
zzzzzzzzzzzzzzzzzu      1.000000
Name: uscore, Length: 6911138, dtype: float64

With user scores, we can then calculate the community score (3.2). The GS-score of a community $c_i$ is the weight average over its users.

$$
GS(c_i)=\frac1N\sum_jw_jGS(u_j)
$$

In [None]:
total = lang_comments['subreddit'].nunique()
i = 0

def gs_comm(data: DataFrame):
    """
    return GS-score of a community, given that data is grouped by subreddit
    """
    w = data.groupby('author').apply('count')['subreddit']
    gs = uscores.loc[data['author'].unique()]
    global i
    i += 1
    if i % 100 == 0:
        print(f'{i}/{total}: {i/total:.2%}')
    return np.average(gs, weights=w)

cscores = lang_comments[['author','subreddit']].groupby('subreddit').apply(gs_comm)
cscores.name = 'cscore'
cscores.to_csv('cscores.csv')

In [3]:
cscores = pd.read_csv('cscores.csv', index_col='subreddit')['cscore']
cscores

subreddit
1200isplenty    0.806710
13ReasonsWhy    0.802825
13or30          0.719325
195             0.741930
2007scape       0.821024
                  ...   
youtubers       0.838359
yugioh          0.819044
zelda           0.767611
zen             0.812143
zerocarb        0.771824
Name: cscore, Length: 3183, dtype: float64

### Language Frequency

Lastly, we will calculate the frequency for each user and community.

In [None]:
user_langs = lang_comments.pivot_table(index='author', columns='lang', fill_value=0, aggfunc=len).astype(np.float32)
user_langs = user_langs.div(user_langs.sum(axis=1), axis=0)
user_langs.to_csv('user_langs.csv')

In [8]:
user_langs = pd.read_csv('user_langs.csv', index_col='author')
user_langs

Unnamed: 0_level_0,afr,ara,aze,bel,bos,bul,cat,ces,cym,dan,...,hin,kat,urd,ben,guj,pan,hye,tam,mar,tel
author,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
------------------GL,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
------------------O,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
------------------f,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
--------------Emkay,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-------------0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zzzzzzzzzzzzccccccgg,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zzzzzzzzzzzzvzzzzvzz,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zzzzzzzzzzzzzs,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zzzzzzzzzzzzzzzzspaf,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
comm_langs = lang_comments[['subreddit', 'lang']].pivot_table(index='subreddit', columns='lang', fill_value=0, aggfunc=len).astype(np.float32)
comm_langs = comm_langs.div(comm_langs.sum(axis=1), axis=0)
comm_langs.to_csv('comm_langs.csv')

In [9]:
comm_langs = pd.read_csv('comm_langs.csv', index_col='subreddit')
comm_langs

Unnamed: 0_level_0,afr,ara,aze,bel,bos,bul,cat,ces,cym,dan,...,hin,kat,urd,ben,guj,pan,hye,tam,mar,tel
subreddit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1200isplenty,0.002959,0.00000,0.000000,0.0,0.000423,0.000000,0.001479,0.000634,0.001479,0.000634,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13ReasonsWhy,0.001788,0.00000,0.000000,0.0,0.001788,0.000000,0.000000,0.000000,0.002384,0.000596,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13or30,0.004069,0.00000,0.000000,0.0,0.000904,0.000000,0.000904,0.002260,0.009946,0.004069,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
195,0.009174,0.00000,0.000917,0.0,0.002752,0.000000,0.002752,0.000917,0.016514,0.008257,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2007scape,0.002756,0.00002,0.000200,0.0,0.000479,0.000000,0.001757,0.000519,0.005312,0.002336,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
youtubers,0.000422,0.00000,0.000000,0.0,0.000845,0.000422,0.000422,0.000000,0.000845,0.002111,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
yugioh,0.001566,0.00000,0.000149,0.0,0.000522,0.000000,0.001268,0.000373,0.003356,0.002088,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zelda,0.003019,0.00000,0.000483,0.0,0.000604,0.000000,0.000725,0.000604,0.004106,0.002053,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zen,0.003234,0.00000,0.000000,0.0,0.000606,0.000000,0.001011,0.000809,0.003436,0.002021,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


This is all the data wrangling we need. For the analysis, see [index.ipynb](./index.ipynb).