## Question 4

Is gendered toxic speech more prevalent on some subreddits than on others?


#### Operationalized Questions

Do some subreddits have a higher frequency of toxic speech mentioning females?

Which subreddits have the most toxic speech with female identifiers and which have the most frequency of toxic speech with male identifiers?

Consider examining by comparing hate, offensive, and neither classifications as well as exploring the Detoxify scores.


#### Hypothesis

Reddit has a wide variety of subreddits (micro communities). Toxic speech that includes female identifiers is more prevalent on some subreddits than on others.


#### Results
Our results show that toxic speech including female identifiers is more prevalent on subreddits than toxic speech including male identifiers. When comparing subreddits by their high toxicity scores (above average for each of the Detoxify metrics), the Subreddit count with female identifiers was 2099, versus a count of 31 for those with male identifiers. The top ten subreddits based on total number of comments with female identifiers also contained much higher proportions of toxic speech than those with male identifiers. The top ten list contains subreddits with adult themes (rapefantasies, MisogynisticLife), but also sports (Cricket, soccer) and those intended for a more general audience (AskReddit, antiwork). After running Welch’s T-test comparing the subreddits with male or female comments with above average toxicity, we got the results (t-statistic=20.2661, p-value=2.5911e-19), leading us to reject the Null Hypothesis that there is no difference between the means of both groups.

When comparing subreddits based on class label instead of Detoxify scores, we found that the number of subreddits containing comments with female identifiers were far more than those with male identifiers. There were a total of 5016 and 6381 subreddits, classified as having “hate” and “offensive” comments respectively when female identifiers were included, where as the subreddit counts for those with male identifiers were 103 and 212, classified as having “hate” and “offensive” comments respectively.  We ran Welch’s T-test comparing the subreddits with male or female comments classified as hate or offensive. The results for those containing hate comments resulted in (t-statistic=-3.4180, p-value=0.0009), while those containing offensive comments resulted in (t-statistic=-6.7197, p-value=1.4771e-10), leading us to reject the Null Hypothesis that there is no difference between the means of both groups in each case.





In [None]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind
import altair as alt

In [None]:
c = pd.read_csv('../data/comments.csv')

#### Initiate T-test function 

This will help us look for any potential relationships between gender identifiers and higher toxicity in specific subreddits.

In [None]:
def collect_test_statistics(dataset_1, dataset_2, alpha=0.05): 
    ttest_df = pd.DataFrame(columns=['variable','t-statistic','p-value','H0','variance'])
    
    variables = ['proportion']
    ttest_df['variable'] = variables

    for variable in variables:
        ttest_x=dataset_1[variable]
        ttest_y=dataset_2[variable]
        
        if np.var(ttest_x) != np.var(ttest_y):
            ttest_df.loc[ttest_df['variable']==variable,'variance'] = 'unequal'
            ttest = ttest_ind(ttest_x, ttest_y, nan_policy='omit', equal_var=False)
            ttest_df.loc[ttest_df['variable']==variable,'t-statistic'] = ttest[0]
            ttest_df.loc[ttest_df['variable']==variable,'p-value'] = ttest[1]
            if ttest[1] > alpha:
                ttest_df.loc[ttest_df['variable']==variable,'H0'] = 'accept'
            else:
                ttest_df.loc[ttest_df['variable']==variable,'H0'] = 'reject'
        else:
            ttest_df.loc[ttest_df['variable']==variable,'variance'] = 'equal'
            ttest = ttest_ind(ttest_x, ttest_y, nan_policy='omit')
            ttest_df.loc[ttest_df['variable']==variable,'t-statistic'] = ttest[0]
            ttest_df.loc[ttest_df['variable']==variable,'p-value'] = ttest[1]
            if ttest[1] > alpha:
                ttest_df.loc[ttest_df['variable']==variable,'H0'] = 'accept'
            else:
                ttest_df.loc[ttest_df['variable']==variable,'H0'] = 'reject'
    
    return ttest_df


#### Initiate find_toxic_subreddits_detoxify_gender function

The purpose of this function is to find the subreddits mentioning gender identifiers with a higher proportion of toxic languge, based on the Detoxify metrics. Because determining what constitutes "high toxicity" is relative, the quantile parameter allows the user to set a minimum threshold for the toxicity metrics. Subreddits beginning with "u_" are user profiles where only the user can make submissions, and therefore been removed from the analysis.

In [None]:
def find_toxic_subreddits_detoxify_gender(gender='female', quantile=0.0, print_subreddit_count=True):
    
    gender_df = c[c.gender == gender]
    gender_df = gender_df.drop(gender_df[gender_df.subreddit.str.contains('^u_', regex=True)==True].index)
    gender_df = gender_df[['subreddit', 'toxicity','severe_toxicity','obscene','identity_attack', 'insult', 'threat', 'sexual_explicit']]
    gender_grouped_on_subreddit = gender_df.groupby('subreddit').mean()
    
    most_toxic_df = gender_grouped_on_subreddit[(gender_grouped_on_subreddit.toxicity >= c.toxicity.quantile(q=quantile)) & (
        gender_grouped_on_subreddit.severe_toxicity >= c.severe_toxicity.quantile(q=quantile)) & (
        gender_grouped_on_subreddit.obscene >= c.obscene.quantile(q=quantile)) & (
        gender_grouped_on_subreddit.identity_attack >= c.identity_attack.quantile(q=quantile)) & (
        gender_grouped_on_subreddit.insult >= c.insult.quantile(q=quantile)) & (
        gender_grouped_on_subreddit.threat >= c.threat.quantile(q=quantile)) & (
        gender_grouped_on_subreddit.sexual_explicit >= c.sexual_explicit.quantile(q=quantile))]

    toxic_subreddit_list = list(most_toxic_df.index)
    if print_subreddit_count == True:
        print('Toxic '+gender+' subreddit count: '+str(len(toxic_subreddit_list)))

    gen_df = c[c.gender == gender]
    test_df = pd.DataFrame(columns=['toxic_count'])
    test_df['toxic_count'] = gen_df[gen_df.subreddit.isin(toxic_subreddit_list)].groupby('subreddit').count().body
    test_df['total_count'] = c[c.subreddit.isin(toxic_subreddit_list)].groupby('subreddit').count().body
    test_df['proportion'] = gen_df[gen_df.subreddit.isin(toxic_subreddit_list)].groupby('subreddit').count().body/c[c.subreddit.isin(toxic_subreddit_list)].groupby('subreddit').count().body
    
    return test_df


In [None]:
find_toxic_subreddits_detoxify_gender(gender='female', quantile=0.5, print_subreddit_count=True).head(3)
# Uncomment to return the full list of Subreddits
#list(find_toxic_subreddits_detoxify_gender(gender='female', quantile=0.5, print_subreddit_count=False).index)

Toxic female subreddit count: 2099


Unnamed: 0_level_0,toxic_count,total_count,proportion
subreddit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2000sGirls,12,13,0.923077
420,3,4,0.75
502GoneWild,1,1,1.0


In [None]:
female_toxicity_sorted = find_toxic_subreddits_detoxify_gender(gender='female', quantile=0.0, print_subreddit_count=False).sort_values('total_count', ascending=False)[:10].sort_values('proportion', ascending=False).reset_index()
alt.Chart(female_toxicity_sorted).mark_bar(color='#bb99e0').encode(
    x = alt.X('proportion', title='Proportion of Toxic Language',scale = alt.Scale(domain=[0,1])),
    y=alt.Y('subreddit',sort=['proportion'], title='')
).properties(
    title={
      'text': ['Top Ten Subreddits with a High Frequency of', 'Toxic Speech including Female Identifiers'],
               
      'subtitle': ['Sorted by Total Number of Comments and', 'Proportions of Toxic Language' ], 
      'color': 'black',
      'anchor': 'middle',
      'fontSize': 15
    }
)


In [None]:
find_toxic_subreddits_detoxify_gender(gender='male', quantile=0.5, print_subreddit_count=True).head(3)
# Uncomment to return the full list of Subreddits
#list(find_toxic_subreddits_detoxify_gender(gender='male', quantile=0.5, print_subreddit_count=False).index)

Toxic male subreddit count: 31


Unnamed: 0_level_0,toxic_count,total_count,proportion
subreddit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
BlackTwinks,1,2,0.5
DadsAndBoys,1,39,0.025641
Episode,1,19,0.052632


In [None]:
male_toxicity_sorted = find_toxic_subreddits_detoxify_gender(gender='male', quantile=0.0, print_subreddit_count=False).sort_values('total_count', ascending=False)[:10].sort_values('proportion', ascending=False).reset_index()
alt.Chart(male_toxicity_sorted).mark_bar(color='#8edc7c').encode(
    x = alt.X('proportion', title='Proportion of Toxic Language',scale = alt.Scale(domain=[0,1])),
    y=alt.Y('subreddit',sort=['proportion'], title='')
).properties(
    title={
      'text': ['Top Ten Subreddits with a High Frequency of', 'Toxic Speech including Male Identifiers'],
               
      'subtitle': ['Sorted by Total Number of Comments and', 'Proportions of Toxic Language' ], 
      'color': 'black',
      'anchor': 'middle',
      'fontSize': 15
    }
)

In [None]:
collect_test_statistics(
    find_toxic_subreddits_detoxify_gender(gender='female', quantile=0.5, print_subreddit_count=True),
    find_toxic_subreddits_detoxify_gender(gender='male', quantile=0.5, print_subreddit_count=True))

Toxic female subreddit count: 2099
Toxic male subreddit count: 31


Unnamed: 0,variable,t-statistic,p-value,H0,variance
0,proportion,20.266096,0.0,reject,unequal


#### Initiate find_toxic_subreddits_class_label_gender function

Similar to the above function, this function extracts those subreddits mentioning gender identifiers based on their classification label as 'hate' or 'offensive' instead of toxicity scores. The quantile parameter again allows the user to set a minimum threshold, in this case the proportion of hate or offensive comments in each subreddit. Subreddits beginning with "u_" are user profiles where only the user can make submissions, and therefore been removed from the analysis.

In [None]:
def find_toxic_subreddits_class_label_gender(class_label=0, gender='female', print_subreddit_count=True):
    
    filtered_c = c.drop(c[c.subreddit.str.contains('^u_', regex=True)==True].index)
    toxic_subreddit_count = filtered_c[(filtered_c.davidson_label == class_label) & (filtered_c.gender == gender)]
    toxic_subreddit_count = toxic_subreddit_count.groupby('subreddit').count().body
    subreddits_count = filtered_c[filtered_c.gender == gender].groupby('subreddit').count().body
    
    if class_label==0:
        label='hate'
    elif class_label==1:
        label='offensive'
    else:
        label='neither_hate_nor_offensive'
        
    df = pd.DataFrame(columns=[label+'_comment_count'])
    df[label+'_comment_count']=toxic_subreddit_count
    df['total_sampled_comments']=subreddits_count
    df['proportion']=toxic_subreddit_count/subreddits_count
    
    if print_subreddit_count == True:
        print(label.capitalize()+' subreddit count with '+ gender+' identifiers: '+str(len(df)))

    return df


In [None]:
find_toxic_subreddits_class_label_gender(class_label=0, gender='female', print_subreddit_count=True).head(3)
# Uncomment to return the full list of Subreddits
#list(find_toxic_subreddits_class_label_gender(class_label=0, gender='female', print_subreddit_count=False).index)

Hate subreddit count with female identifiers: 5016


Unnamed: 0_level_0,hate_comment_count,total_sampled_comments,proportion
subreddit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1000lbsisters,5,27,0.185185
18nsfw,2,16,0.125
196,51,244,0.209016


In [None]:
female_hate_sorted = find_toxic_subreddits_class_label_gender(class_label=0, gender='female', print_subreddit_count=False).sort_values('total_sampled_comments', ascending=False)[:10].sort_values('proportion', ascending=False).reset_index()
alt.Chart(female_hate_sorted).mark_bar(color='#bb99e0').encode(
    x = alt.X('proportion', title='Proportion of Toxic Language',scale = alt.Scale(domain=[0,1])),
    y=alt.Y('subreddit',sort=['proportion'], title='')
).properties(
    title={
      'text': ['Top Ten Subreddits with a High Frequency of', 'Hate Speech including Female Identifiers'],
      'subtitle': ['Sorted by Highest Total Number of Sampled ', 'Comments and Proportions of Toxic Language' ],
      'color': 'black',
      'anchor': 'middle',
      'fontSize': 15
    }
)

In [None]:
find_toxic_subreddits_class_label_gender(class_label=1, gender='female', print_subreddit_count=True).head(3)
# Uncomment to return the full list of Subreddits
#list(find_toxic_subreddits_class_label_gender(class_label=1, gender='female', print_subreddit_count=False).index)

Offensive subreddit count with female identifiers: 6381


Unnamed: 0_level_0,offensive_comment_count,total_sampled_comments,proportion
subreddit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1000lbsisters,9,27,0.333333
13ReasonsWhy,1,1,1.0
1819club,3,5,0.6


In [None]:
female_offensive_sorted = find_toxic_subreddits_class_label_gender(class_label=1, gender='female', print_subreddit_count=False).sort_values('total_sampled_comments', ascending=False)[:10].sort_values('proportion', ascending=False).reset_index()
alt.Chart(female_offensive_sorted).mark_bar(color='#bb99e0').encode(
    x = alt.X('proportion', title='Proportion of Toxic Language',scale = alt.Scale(domain=[0,1])),
    y=alt.Y('subreddit',sort=['proportion'], title='')
).properties(
    title={
      'text': ['Top Ten Subreddits with a High Frequency of', 'Offensive Speech including Female Identifiers'],
      'subtitle': ['Sorted by Highest Total Number of Sampled ', 'Comments and Proportions of Toxic Language' ],
      'color': 'black',
      'anchor': 'middle',
      'fontSize': 15
    }
)

In [None]:
find_toxic_subreddits_class_label_gender(class_label=0, gender='male',  print_subreddit_count=True).head(3)
# Uncomment to return the full list of Subreddits
#list(find_toxic_subreddits_class_label_gender(class_label=0, gender='male', quantile=0.5, print_subreddit_count=False).index)

Hate subreddit count with male identifiers: 103


Unnamed: 0_level_0,hate_comment_count,total_sampled_comments,proportion
subreddit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AreTheStraightsOK,1,2,0.5
AskMen,1,7,0.142857
AskReddit,3,41,0.073171


In [None]:
male_hate_sorted = find_toxic_subreddits_class_label_gender(class_label=0, gender='male', print_subreddit_count=False).sort_values('total_sampled_comments', ascending=False)[:10].sort_values('proportion', ascending=False).reset_index()
alt.Chart(male_hate_sorted).mark_bar(color='#8edc7c').encode(
    x = alt.X('proportion', title='Proportion of Toxic Language',scale = alt.Scale(domain=[0,1])),
    y=alt.Y('subreddit',sort=['proportion'], title='')
).properties(
    title={
      'text': ['Top Ten Subreddits with a High Frequency of', 'Hate Speech including Male Identifiers'],
      'subtitle': ['Sorted by Highest Total Number of Sampled ', 'Comments and Proportions of Toxic Language' ],
      'color': 'black',
      'anchor': 'middle',
      'fontSize': 15
    }
)

In [None]:
find_toxic_subreddits_class_label_gender(class_label=1, gender='male', print_subreddit_count=True).head(3)

Offensive subreddit count with male identifiers: 212


Unnamed: 0_level_0,offensive_comment_count,total_sampled_comments,proportion
subreddit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
196,1,2,0.5
49ers,1,1,1.0
ANRime,1,1,1.0


In [None]:
male_offensive_sorted = find_toxic_subreddits_class_label_gender(class_label=1, gender='male', print_subreddit_count=False).sort_values('total_sampled_comments', ascending=False)[:10].sort_values('proportion', ascending=False).reset_index()
alt.Chart(male_offensive_sorted).mark_bar(color='#8edc7c').encode(
    x = alt.X('proportion', title='Proportion of Toxic Language',scale = alt.Scale(domain=[0,1])),
    y=alt.Y('subreddit',sort=['proportion'], title='')
).properties(
    title={
      'text': ['Top Ten Subreddits with a High Frequency of', 'Offensive Speech including Male Identifiers'],
      'subtitle': ['Sorted by Highest Total Number of Sampled ', 'Comments and Proportions of Toxic Language' ], 
      'color': 'black',
      'anchor': 'middle',
      'fontSize': 15
    }
)

In [None]:
collect_test_statistics(
    find_toxic_subreddits_class_label_gender(class_label=0, gender='female', print_subreddit_count=True),
    find_toxic_subreddits_class_label_gender(class_label=0, gender='male', print_subreddit_count=True))


Hate subreddit count with female identifiers: 5016
Hate subreddit count with male identifiers: 103


Unnamed: 0,variable,t-statistic,p-value,H0,variance
0,proportion,-3.417982,0.000898,reject,unequal


In [None]:
collect_test_statistics(
    find_toxic_subreddits_class_label_gender(class_label=1, gender='female', print_subreddit_count=True),
    find_toxic_subreddits_class_label_gender(class_label=1, gender='male', print_subreddit_count=True))

Offensive subreddit count with female identifiers: 6381
Offensive subreddit count with male identifiers: 212


Unnamed: 0,variable,t-statistic,p-value,H0,variance
0,proportion,-6.719712,0.0,reject,unequal


## Question 7

Is toxic speech more prevalent on some subreddits than on others?


#### Operationalized Questions

Do some subreddits have a higher frequency of
* hate or offensive speech?
* toxic language (specific measures from Detoxify)?

note: compare frequency of hate, offensive, and neutral speech in the subreddits with the most hate speech.

Also compare subreddits with the most hate speech with the subreddits that have the most offensive speech.


#### Hypothesis

Reddit has a wide variety of subreddits (micro communities). Toxic speech is more prevalent on some subreddits than on others.


#### Results

When comparing subreddits by their high toxicity scores (above average for each of the Detoxify metrics), the Subreddit count was as follows: hate speech : 1687, offensive speech : 1359, neither hate nor offensive speech : 1100. Running Welch’s T-test comparing the subreddits with classified comments with above average toxicity, we got the following results: Hate versus offensive speech : (t-statistic=-7.9508, p-value=2.6402e-15), Hate speech versus neither : (t-statistic=-32.9205, p-value=3.9065e-200), Offensive speech versus neither : (t-statistic=-22.3680, p-value=5.6182e-101).  In each case, we can reject the Null Hypothesis that there is no difference between the means of both groups.


When comparing subreddits based on comment classification alone without taking into account the Detoxify metrics, the Subreddit count was as follows: hate speech : 5826, offensive speech : 7588, neither hate nor offensive speech : 10,227. Again running Welch’s T-test to compare the subreddits with classified comments, we got the following results: Hate versus offensive speech : (t-statistic=-21.1521, p-value=1.6750 e-97), Hate speech versus neither : (t-statistic=-54.1445, p-value=0.0), Offensive speech versus neither : (t-statistic=-34.4237, p-value=1.7924-250).  In each case, we can again reject the Null Hypothesis that there is no difference between the means of both groups.




#### Initiate find_toxic_subreddits_detoxify function

This function finds the subreddits with a higher proportion of toxic languge, based on the Detoxify metrics, but without taking into account gender identifiers. Similarly to the above functions, the quantile parameter allows the user to set a minimum threshold for the toxicity metrics based on the user's preference. Subreddits beginning with "u_" are user profiles where only the user can make submissions, and therefore been removed from the analysis.

In [None]:
def find_toxic_subreddits_detoxify(class_label=0, quantile=0.0, print_subreddit_count=True):
    
    filtered_c = c.drop(c[c.subreddit.str.contains('^u_', regex=True)==True].index)
    filtered_c = filtered_c[filtered_c.davidson_label == class_label]
    toxic_subreddits = filtered_c[['subreddit', 'toxicity','severe_toxicity','obscene','identity_attack', 'insult', 'threat', 'sexual_explicit']]

    toxic_subreddits = toxic_subreddits.groupby('subreddit').mean()

    most_toxic_df = toxic_subreddits[(toxic_subreddits.toxicity >= c.toxicity.quantile(q=quantile)) & (
        toxic_subreddits.severe_toxicity >= c.severe_toxicity.quantile(q=quantile)) & (
        toxic_subreddits.obscene >= c.obscene.quantile(q=quantile)) & (
        toxic_subreddits.identity_attack >= c.identity_attack.quantile(q=quantile)) & (
        toxic_subreddits.insult >= c.insult.quantile(q=quantile)) & (
        toxic_subreddits.threat >= c.threat.quantile(q=quantile)) & (
        toxic_subreddits.sexual_explicit >= c.sexual_explicit.quantile(q=quantile))]
    
    toxic_subreddit_list = list(most_toxic_df.index)
    
    if class_label==0:
        label='hate'
    elif class_label==1:
        label='offensive'
    else:
        label='neither_hate_nor_offensive'

        
    df = pd.DataFrame(columns=[label+'_comment_count'])

    df[label+'_comment_count']= filtered_c[filtered_c.subreddit.isin(toxic_subreddit_list)].groupby('subreddit').count().body
    df['total_sampled_comments']=c[c.subreddit.isin(toxic_subreddit_list)].groupby('subreddit').count().body
    df['proportion']=filtered_c[filtered_c.subreddit.isin(toxic_subreddit_list)].groupby('subreddit').count().body/c[c.subreddit.isin(toxic_subreddit_list)].groupby('subreddit').count().body
    
    if print_subreddit_count == True:
        print(label.capitalize()+' subreddit count based on Detoxify scores: '+str(len(df)))
    
    return df


In [None]:
find_toxic_subreddits_detoxify(class_label=0, quantile=0.5, print_subreddit_count=True).head(3)

Hate subreddit count based on Detoxify scores: 1687


Unnamed: 0_level_0,hate_comment_count,total_sampled_comments,proportion
subreddit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1819club,1,7,0.142857
18nsfw,2,17,0.117647
2000sGirls,1,13,0.076923


In [None]:
label_detoxify_sorted_h = find_toxic_subreddits_detoxify(class_label=0, quantile=0.5, print_subreddit_count=False).sort_values('total_sampled_comments', ascending=False)[:10].sort_values('proportion', ascending=False).reset_index()
alt.Chart(label_detoxify_sorted_h).mark_bar(color='#f2a3b6').encode(
    x = alt.X('proportion', title='Proportion of Toxic Language',scale = alt.Scale(domain=[0,1])),
    y=alt.Y('subreddit',sort=['proportion'], title='')
).properties(
    title={
      'text': ['Top Ten Subreddits with a High ', 'Frequency of Hate Speech'],
      'subtitle': ['Sorted by Proportion of Toxic Language', 'with Above Average Toxicity Scores' ], 
      'color': 'black',
      'anchor': 'middle',
      'fontSize': 15
    }
)

In [None]:
find_toxic_subreddits_detoxify(class_label=1, quantile=0.5, print_subreddit_count=True).head(3)

Offensive subreddit count based on Detoxify scores: 1359


Unnamed: 0_level_0,offensive_comment_count,total_sampled_comments,proportion
subreddit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2000sGirls,4,13,0.307692
20SomethingsAustin,1,4,0.25
2MiddleEast4you,7,45,0.155556


In [None]:
label_detoxify_sorted_o = find_toxic_subreddits_detoxify(class_label=1, quantile=0.5, print_subreddit_count=False).sort_values('total_sampled_comments', ascending=False)[:10].sort_values('proportion', ascending=False).reset_index()
alt.Chart(label_detoxify_sorted_o).mark_bar(color='#f7cd70').encode(
    x = alt.X('proportion', title='Proportion of Toxic Language',scale = alt.Scale(domain=[0,1])),
    y=alt.Y('subreddit',sort=['proportion'], title='')
).properties(
    title={
      'text': ['Top Ten Subreddits with a High ', 'Frequency of Offensive Speech'],
      'subtitle': ['Sorted by Proportion of Toxic Language', 'with Above Average Toxicity Scores' ], 
      'color': 'black',
      'anchor': 'middle',
      'fontSize': 15
    }
)

In [None]:
find_toxic_subreddits_detoxify(class_label=2, quantile=0.5, print_subreddit_count=True).head(3)

Neither_hate_nor_offensive subreddit count based on Detoxify scores: 1100


Unnamed: 0_level_0,neither_hate_nor_offensive_comment_count,total_sampled_comments,proportion
subreddit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2000sGirls,8,13,0.615385
5050pics,1,1,1.0
757sexywives,1,1,1.0


In [None]:
label_detoxify_sorted_n = find_toxic_subreddits_detoxify(class_label=2, quantile=0.5, print_subreddit_count=False).sort_values('total_sampled_comments', ascending=False)[:10].sort_values('proportion', ascending=False).reset_index()
alt.Chart(label_detoxify_sorted_n).mark_bar(color='#63a671').encode(
    x = alt.X('proportion', title='Proportion of Toxic Language',scale = alt.Scale(domain=[0,1])),
    y=alt.Y('subreddit',sort=['proportion'], title='')
).properties(
    title={
      'text': ['Top Ten Subreddits with a High ', 'Frequency of Neither Hate nor Offensive Speech'],
      'subtitle': ['Sorted by Proportion of Toxic Language', 'with Above Average Toxicity Scores' ], 
      'color': 'black',
      'anchor': 'middle',
      'fontSize': 15
    }
)

In [None]:
collect_test_statistics(
    find_toxic_subreddits_detoxify(class_label=0, quantile=0.5, print_subreddit_count=True),
    find_toxic_subreddits_detoxify(class_label=1, quantile=0.5, print_subreddit_count=True))

Hate subreddit count based on Detoxify scores: 1687
Offensive subreddit count based on Detoxify scores: 1359


Unnamed: 0,variable,t-statistic,p-value,H0,variance
0,proportion,-7.950846,0.0,reject,unequal


In [None]:
collect_test_statistics(
    find_toxic_subreddits_detoxify(class_label=0, quantile=0.5, print_subreddit_count=True),
    find_toxic_subreddits_detoxify(class_label=2, quantile=0.5, print_subreddit_count=True))

Hate subreddit count based on Detoxify scores: 1687
Neither_hate_nor_offensive subreddit count based on Detoxify scores: 1100


Unnamed: 0,variable,t-statistic,p-value,H0,variance
0,proportion,-32.920467,0.0,reject,unequal


In [None]:
collect_test_statistics(
    find_toxic_subreddits_detoxify(class_label=1, quantile=0.5, print_subreddit_count=True),
    find_toxic_subreddits_detoxify(class_label=2, quantile=0.5, print_subreddit_count=True))

Offensive subreddit count based on Detoxify scores: 1359
Neither_hate_nor_offensive subreddit count based on Detoxify scores: 1100


Unnamed: 0,variable,t-statistic,p-value,H0,variance
0,proportion,-22.368015,0.0,reject,unequal


#### Initiate find_toxic_subreddits_class_label function

Similar to the above function, this function extracts those subreddits based on their classification label as 'hate' or 'offensive' instead of toxicity scores, and without taking into account gendered language. The quantile parameter again allows the user to set a minimum threshold, in this case the proportion of hate or offensive comments in each subreddit. Subreddits beginning with "u_" are user profiles where only the user can make submissions, and therefore been removed from the analysis.

In [None]:
def find_toxic_subreddits_class_label(class_label=0, print_subreddit_count=True):
    
    filtered_c = c.drop(c[c.subreddit.str.contains('^u_', regex=True)==True].index)
    toxic_subreddit_count = filtered_c[filtered_c.davidson_label == class_label]
    
    toxic_subreddit_count = toxic_subreddit_count.groupby('subreddit').count().body
    
    subreddits_count = filtered_c.groupby('subreddit').count().body
    
    if class_label==0:
        label='hate'
    elif class_label==1:
        label='offensive'
    else:
        label='neither_hate_nor_offensive'
        
    df = pd.DataFrame(columns=[label+'_comment_count'])
    
    df[label+'_comment_count']=toxic_subreddit_count
    
    df['total_sampled_comments']=subreddits_count
    
    df['proportion']=toxic_subreddit_count/subreddits_count

    if print_subreddit_count == True:
        print(label.capitalize()+' subreddit count without Detoxify scores: '+str(len(df)))
    
    return df



In [None]:
find_toxic_subreddits_class_label(class_label=0, print_subreddit_count=True).head(3)

Hate subreddit count without Detoxify scores: 3284


Unnamed: 0_level_0,hate_comment_count,total_sampled_comments,proportion
subreddit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1200Australia,1,2,0.5
21stCenturyHumour,1,1,1.0
2ALiberals,2,5,0.4


In [None]:
label_sorted_h = find_toxic_subreddits_class_label(class_label=0, print_subreddit_count=False).sort_values('total_sampled_comments', ascending=False)[:10].sort_values('proportion', ascending=False).reset_index()
alt.Chart(label_sorted_h).mark_bar(color='#f2a3b6').encode(
    x = alt.X('proportion', title='Proportion of Toxic Language',scale = alt.Scale(domain=[0,1])),
    y=alt.Y('subreddit',sort=['proportion'], title='')
).properties(
    title={
      'text': ['Top Ten Largest Sampled Subreddits (Most Comments) ', 'with Highest Proportions of Hate Speech'],
      'color': 'black',
      'anchor': 'middle',
      'fontSize': 15
    }
)

In [None]:
find_toxic_subreddits_class_label(class_label=1, print_subreddit_count=True).head(3)

Offensive subreddit count without Detoxify scores: 7588


Unnamed: 0_level_0,offensive_comment_count,total_sampled_comments,proportion
subreddit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1000lbsisters,11,31,0.354839
1200Australia,1,2,0.5
13ReasonsWhy,1,2,0.5


In [None]:
label_sorted_o = find_toxic_subreddits_class_label(class_label=1, print_subreddit_count=False).sort_values('total_sampled_comments', ascending=False)[:10].sort_values('proportion', ascending=False).reset_index()
alt.Chart(label_sorted_o).mark_bar(color='#f7cd70').encode(
    x = alt.X('proportion', title='Proportion of Toxic Language',scale = alt.Scale(domain=[0,1])),
    y=alt.Y('subreddit',sort=['proportion'], title='')
).properties(
    title={
      'text': ['Top Ten Largest Sampled Subreddits (Most Comments) ', 'with Highest Proportions of Offensive Speech'],
      'color': 'black',
      'anchor': 'middle',
      'fontSize': 15
    }
)

In [None]:
find_toxic_subreddits_class_label(class_label=2, print_subreddit_count=True).head(3)

Neither_hate_nor_offensive subreddit count without Detoxify scores: 5122


Unnamed: 0_level_0,neither_hate_nor_offensive_comment_count,total_sampled_comments,proportion
subreddit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1001MentalNotes,1,1,1.0
100movies365days,1,1,1.0
1200isjerky,1,1,1.0


In [None]:
label_sorted_n = find_toxic_subreddits_class_label(class_label=2, print_subreddit_count=False).sort_values('total_sampled_comments', ascending=False)[:10].sort_values('proportion', ascending=False).reset_index()
alt.Chart(label_sorted_n).mark_bar(color='#63a671').encode(
    x = alt.X('proportion', title='Proportion of Toxic Language',scale = alt.Scale(domain=[0,1])),
    y=alt.Y('subreddit',sort=['proportion'], title='')
).properties(
    title={
      'text': ['Top Ten Largest Sampled Subreddits (Most Comments) ', 'with Highest Proportions of Neither Hate nor Offensive Speech'],
      'color': 'black',
      'anchor': 'middle',
      'fontSize': 15
    }
)

In [None]:
collect_test_statistics(
    find_toxic_subreddits_class_label(class_label=0, print_subreddit_count=True),
    find_toxic_subreddits_class_label(class_label=1, print_subreddit_count=True))

Hate subreddit count without Detoxify scores: 5826
Offensive subreddit count without Detoxify scores: 7588


Unnamed: 0,variable,t-statistic,p-value,H0,variance
0,proportion,-21.152133,0.0,reject,unequal


In [None]:
collect_test_statistics(
    find_toxic_subreddits_class_label(class_label=0, print_subreddit_count=True),
    find_toxic_subreddits_class_label(class_label=2, print_subreddit_count=True))

Hate subreddit count without Detoxify scores: 5826
Neither_hate_nor_offensive subreddit count without Detoxify scores: 10227


Unnamed: 0,variable,t-statistic,p-value,H0,variance
0,proportion,-54.144535,0.0,reject,unequal


In [None]:
collect_test_statistics(
    find_toxic_subreddits_class_label(class_label=1, print_subreddit_count=True),
    find_toxic_subreddits_class_label(class_label=2, print_subreddit_count=True))

Offensive subreddit count without Detoxify scores: 7588
Neither_hate_nor_offensive subreddit count without Detoxify scores: 10227


Unnamed: 0,variable,t-statistic,p-value,H0,variance
0,proportion,-34.423684,0.0,reject,unequal


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=6bb356e3-2dd8-4635-8ff0-a5ca506d60d1' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>