## Question 6

How likely is it that gendered speech is related to violence?

#### Operationalized Question:
What percentage of Reddit comments with high scores on threat (or other measures ex. sexual_explicit + threat) also have:
• male pronouns or identifiers (e.g.,"guy")?
• female pronouns or identifiers (e.g., "gal")?
• both genders?
• no gendered speech?

#### Hypothesis:
A higher percentage of comments with female identifiers will include terms related to violence than will comments with male identifiers.

#### Results:
An initial exploration of comments which have an above average threat score show that around 72% of those comments have only female identifiers, whereas less than 1% have only male identifiers. We expected that more threatening comments would include female-only identifiers, but were surprised at such a stark difference. Around 26% of above average threatening comments included both male and female identifiers, and less than 2% had no gender identifiers. When comparing the four gender categories with Welch’s T-test, there were some surprising results. When comparing male to female, female to both, and male to none categories, the t-tests resulted in p-values that were below the 5% significance level, leading us to reject the Null Hypothesis that the means for these categories were the same. However the t-tests comparing female to none, male to both, and both to none, resulted in p-values higher than the 5% significance level, leading us to fail to reject the Null Hypothesis.

We then included above average sexually explicit scores to the above average threat scores for each of the gender identifiers, and found similar proportions, with around 76% of comments including female identifiers, and less than 1% including male identifiers, again an unexpectedly stark difference between the two categories. Around 23% of comments included both gender identifiers, while around 1% had no gender identifiers. The ensuing t-test had similarly mixed results which was also surprising. When comparing male to female, female to both, and female to none categories, the t-tests resulted in p-values that were below the 5% significance level, leading us to reject the Null Hypothesis that the means for these categories were the same. However the t-tests comparing male to both, male to none, and both to none, resulted in p-values higher than the 5% significance level, leading us to fail to reject the Null Hypothesis. Given the t-test results, it would be interesting to further explore the relationships between gender identifiers, but it seems clear that when comparing male to female identifiers specifically, there does indeed appear to be a relationship between those identifiers and more threatening and sexually explicit speech.



In [None]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind

#### Read in the Comments data

In [None]:
c = pd.read_csv('../data/processed/comments.csv')

#### Filter to keep only the above average Threat comments

In [None]:
threat_comments = c[c.threat >= c.threat.mean()].copy()

#### Split the data based on gender identifiers

In [None]:
male_threat = threat_comments[threat_comments.gender == 'male']
female_threat = threat_comments[threat_comments.gender == 'female']
both_threat = threat_comments[threat_comments.gender == 'both']
none_threat = threat_comments[threat_comments.gender == 'none']

#### Find the percentages for each gender category

In [None]:
percentage_total = pd.concat([pd.Series(round((len(male_threat)/len(threat_comments))*100, 2)).to_frame(name='Male'),
                    pd.Series(round((len(female_threat)/len(threat_comments))*100, 2)).to_frame(name='Female'),
                    pd.Series(round((len(both_threat)/len(threat_comments))*100, 2)).to_frame(name='Both'),          
                    pd.Series(round((len(none_threat)/len(threat_comments))*100, 2)).to_frame(name='None'),    
                   ],axis=1).rename(index={0:'Percentage of Total'})
percentage_total

Unnamed: 0,Male,Female,Both,None
Percentage of Total,0.41,71.61,26.27,1.71


#### Initiate the test_statistics function

In [None]:
def collect_test_statistics(dataset_1, dataset_2,alpha=0.05): 
    ttest_df = pd.DataFrame(columns=['variable','t-statistic','p-value','H0','variance'])
    
    variables = ['threat']
    ttest_df['variable'] = variables

    for variable in variables:
        ttest_x=dataset_1[variable]
        ttest_y=dataset_2[variable]
        
        if np.var(ttest_x) != np.var(ttest_y):
            ttest_df.loc[ttest_df['variable']==variable,'variance'] = 'unequal'
            ttest = ttest_ind(ttest_x, ttest_y, nan_policy='omit', equal_var=False)
            ttest_df.loc[ttest_df['variable']==variable,'t-statistic'] = ttest[0]
            ttest_df.loc[ttest_df['variable']==variable,'p-value'] = ttest[1]
            if ttest[1] > alpha:
                ttest_df.loc[ttest_df['variable']==variable,'H0'] = 'fail to reject'
            else:
                ttest_df.loc[ttest_df['variable']==variable,'H0'] = 'reject'
        else:
            ttest_df.loc[ttest_df['variable']==variable,'variance'] = 'equal'
            ttest = ttest_ind(ttest_x, ttest_y, nan_policy='omit')
            ttest_df.loc[ttest_df['variable']==variable,'t-statistic'] = ttest[0]
            ttest_df.loc[ttest_df['variable']==variable,'p-value'] = ttest[1]
            if ttest[1] > alpha:
                ttest_df.loc[ttest_df['variable']==variable,'H0'] = 'fail to reject'
            else:
                ttest_df.loc[ttest_df['variable']==variable,'H0'] = 'reject'
    

    return ttest_df.rename(index={0:str(dataset_1.gender.iloc[0]+' / '+dataset_2.gender.iloc[0]),
                                  1:str(dataset_1.gender.iloc[0]+' / '+dataset_2.gender.iloc[0]),
                                 })


#### Combine the T-test results into a single dataframe

In [None]:
threat_df = collect_test_statistics(female_threat, male_threat, alpha=0.05)
threat_df = threat_df.append(collect_test_statistics(female_threat, both_threat, alpha=0.05))
threat_df = threat_df.append(collect_test_statistics(female_threat, none_threat, alpha=0.05))
threat_df = threat_df.append(collect_test_statistics(male_threat, both_threat, alpha=0.05))
threat_df = threat_df.append(collect_test_statistics(male_threat, none_threat, alpha=0.05))
threat_df = threat_df.append(collect_test_statistics(both_threat, none_threat, alpha=0.05))
threat_df

Unnamed: 0,variable,t-statistic,p-value,H0,variance
female / male,threat,3.442013,0.00096,reject,unequal
female / both,threat,7.395001,0.0,reject,unequal
female / none,threat,0.914012,0.361398,fail to reject,unequal
male / both,threat,-1.919833,0.058674,fail to reject,unequal
male / none,threat,-2.409275,0.017316,reject,unequal
both / none,threat,-1.378584,0.168918,fail to reject,unequal


#### Filter the dataframes to only include those Threat comments that also have above average Sexual Explicit content 

In [None]:
threat_sex_explicit = c[(c.threat >= c.threat.mean())&(c.sexual_explicit >= c.sexual_explicit.mean())]
male_th_se = male_threat[male_threat.sexual_explicit >= c.sexual_explicit.mean()]
female_th_se = female_threat[female_threat.sexual_explicit >= c.sexual_explicit.mean()]
both_th_se = both_threat[both_threat.sexual_explicit >= c.sexual_explicit.mean()]
none_th_se = none_threat[none_threat.sexual_explicit >= c.sexual_explicit.mean()]

In [None]:
percentage_total_th_se = pd.concat([pd.Series(round((len(male_th_se)/len(threat_sex_explicit))*100, 2)).to_frame(name='Male'),
                    pd.Series(round((len(female_th_se)/len(threat_sex_explicit))*100, 2)).to_frame(name='Female'),
                    pd.Series(round((len(both_th_se)/len(threat_sex_explicit))*100, 2)).to_frame(name='Both'),          
                    pd.Series(round((len(none_th_se)/len(threat_sex_explicit))*100, 2)).to_frame(name='None'),    
                   ],axis=1).rename(index={0:'Percentage of Total'})
percentage_total_th_se

Unnamed: 0,Male,Female,Both,None
Percentage of Total,0.12,75.71,23.07,1.1


#### Run the T-tests again for each comparison with the updated threat comments

In [None]:
threat_sex_explicit_df = collect_test_statistics(female_th_se, male_th_se, alpha=0.05)
threat_sex_explicit_df = threat_sex_explicit_df.append(collect_test_statistics(female_th_se, both_th_se, alpha=0.05))
threat_sex_explicit_df = threat_sex_explicit_df.append(collect_test_statistics(female_th_se, none_th_se, alpha=0.05))
threat_sex_explicit_df = threat_sex_explicit_df.append(collect_test_statistics(male_th_se, both_th_se, alpha=0.05))
threat_sex_explicit_df = threat_sex_explicit_df.append(collect_test_statistics(male_th_se, none_th_se, alpha=0.05))
threat_sex_explicit_df = threat_sex_explicit_df.append(collect_test_statistics(both_th_se, none_th_se, alpha=0.05))
threat_sex_explicit_df

Unnamed: 0,variable,t-statistic,p-value,H0,variance
female / male,threat,2.158725,0.04983,reject,unequal
female / both,threat,6.87704,0.0,reject,unequal
female / none,threat,2.47205,0.014628,reject,unequal
male / both,threat,-0.976268,0.346025,fail to reject,unequal
male / none,threat,-0.80305,0.431229,fail to reject,unequal
both / none,threat,0.165587,0.868698,fail to reject,unequal


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=b22dad3f-c925-4cd0-bb81-e22d83bd774f' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>