## Treatment and control groups for selected channels

The examined intervention targeted specific content: "videos that are perceived to be hateful or inflammatory” and “content that is harassing or attacking people based on their race, religion, gender or similar categories".
To estimate the effect on creators’ content supply, we categorised channels based on the topic of their published videos before the implementation of the moderation. The categorisation relied on keyword matching. A total of 96 words and expressions based on this condition  was generated. 
The keyword matching was done on the title, description and tags for videos published before the March 2017’s moderation update (n videos = 2,742,395). 

In [1]:
import pandas as pd
import numpy as np
import re

In [2]:
db_video=pd.read_csv('video_database_per_period_selected.csv', sep=';')
db_video.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11845978 entries, 0 to 11845977
Data columns (total 13 columns):
 #   Column          Dtype  
---  ------          -----  
 0   Unnamed: 0.1    int64  
 1   Unnamed: 0      int64  
 2   category        object 
 3   channel         object 
 4   date_crawled    object 
 5   description     object 
 6   id              object 
 7   duration        int64  
 8   tags            object 
 9   title           object 
 10  upload          object 
 11  week            object 
 12  before_updates  float64
dtypes: float64(1), int64(3), object(9)
memory usage: 1.1+ GB


## Selecting channels

In [3]:
db_chan=pd.read_csv('weekly_uploads_cat_completed.csv', sep=';')
chan=db_chan['channel'].unique().tolist()
len(chan)

53535

In [4]:
db_video=db_video[db_video['channel'].isin(chan)]
db_video.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7807776 entries, 141 to 11845922
Data columns (total 13 columns):
 #   Column          Dtype  
---  ------          -----  
 0   Unnamed: 0.1    int64  
 1   Unnamed: 0      int64  
 2   category        object 
 3   channel         object 
 4   date_crawled    object 
 5   description     object 
 6   id              object 
 7   duration        int64  
 8   tags            object 
 9   title           object 
 10  upload          object 
 11  week            object 
 12  before_updates  float64
dtypes: float64(1), int64(3), object(9)
memory usage: 834.0+ MB


In [5]:
out=pd.read_csv('outliers_995.csv', sep=';')
chan=out['channel'].unique().tolist()
len(chan)

1676

In [6]:
db_video=db_video[~db_video['channel'].isin(chan)]
db_video.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5596628 entries, 141 to 11845922
Data columns (total 13 columns):
 #   Column          Dtype  
---  ------          -----  
 0   Unnamed: 0.1    int64  
 1   Unnamed: 0      int64  
 2   category        object 
 3   channel         object 
 4   date_crawled    object 
 5   description     object 
 6   id              object 
 7   duration        int64  
 8   tags            object 
 9   title           object 
 10  upload          object 
 11  week            object 
 12  before_updates  float64
dtypes: float64(1), int64(3), object(9)
memory usage: 597.8+ MB


In [7]:
max(db_video['week'])

'2017-43'

In [8]:
db_video['channel'].nunique()

51859

## Selecting before update videos

In [9]:
db_before=db_video[db_video['before_updates']==1]

In [10]:
db_before.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2742395 entries, 363 to 11845922
Data columns (total 13 columns):
 #   Column          Dtype  
---  ------          -----  
 0   Unnamed: 0.1    int64  
 1   Unnamed: 0      int64  
 2   category        object 
 3   channel         object 
 4   date_crawled    object 
 5   description     object 
 6   id              object 
 7   duration        int64  
 8   tags            object 
 9   title           object 
 10  upload          object 
 11  week            object 
 12  before_updates  float64
dtypes: float64(1), int64(3), object(9)
memory usage: 292.9+ MB


In [11]:
db_before['channel'].nunique()

51859

In [12]:
db_before['description'].isna().sum()

78607

In [13]:
db_before['tags'].isna().sum()

235058

In [14]:
db_before[['description','tags']].isna().all(axis=1).sum()

45071

In [15]:
db_before['id'].count()

2742395

## Title, description and tags analysis

In [16]:
pattern_sex=[' ass ', ' sex ', ' dick ', ' vagina ', ' penis ', ' anal ', ' anus ', ' erotic ', 
             ' boobs ', ' butt ', ' cum ', ' nudity ', ' sexual ', ' porn ', ' xxx ', ' fuck ', ' condom ']
 
pattern_sex = re.compile('|'.join(pattern_sex))   
mask = ((db_before['title'].str.contains(pattern_sex))|(db_before['description'].str.contains(pattern_sex))|(db_before['tags'].str.contains(pattern_sex)))
db_before['sex']=mask
db_before['sex']=db_before['sex'].fillna('None')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  db_before['sex']=mask
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  db_before['sex']=db_before['sex'].fillna('None')


In [17]:
db_before['sex'].value_counts()

False    2727133
True       15262
Name: sex, dtype: int64

In [18]:
db_before[db_before['sex']==True]

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,category,channel,date_crawled,description,id,duration,tags,title,upload,week,before_updates,sex
746,1204,2691,Science & Technology,UCzWQYUVCpZqtN93H8RR44Qw,2019-11-03 22:22:23.728131,The U.S. legal abortion rate is at an all-time...,3jWjaHWQ-00,276,"current events,Science,abortion,sex education,...","The Abortion Rate Is At An All-Time Low, Why?",2017-01-26,2017-04,1.0,True
749,1207,2694,Science & Technology,UCzWQYUVCpZqtN93H8RR44Qw,2019-11-03 22:22:25.524810,A female shark spontaneously gave birth after ...,1mS10pRK8rE,275,"current events,Science,asexual,parthenogenesis...",This Shark Reproduced Without A Mate! Could Hu...,2017-01-25,2017-04,1.0,True
780,1238,2725,Science & Technology,UCzWQYUVCpZqtN93H8RR44Qw,2019-11-03 22:21:45.177946,Do illnesses really affect men worse than wome...,5axqfvcnlMQ,203,"current events,Science,man flu,illness,men vs ...",Men vs. Women: Who Really Gets Sicker?,2017-01-03,2017-01,1.0,True
781,1239,2726,Science & Technology,UCzWQYUVCpZqtN93H8RR44Qw,2019-11-03 22:21:45.754262,Scientists discovered a millipede that has 414...,MhNDsJNcy3g,202,"current events,Science,penis,genitals,milliped...",This Animal Has Penises For Legs... WHY?!,2017-01-02,2017-01,1.0,True
804,1262,2749,Science & Technology,UCzWQYUVCpZqtN93H8RR44Qw,2019-11-03 22:21:59.405248,Why does the honeymoon phase at the beginning ...,JWc3Ql0vsvc,192,"current events,Science,love,relationships,pupp...",The Scientific Reason The ‘Honeymoon Phase’ Go...,2016-12-14,2016-50,1.0,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11842202,18693022,2359306,Howto & Style,UCs8moRYU6xw11rTqlIQ6KRw,2019-11-19 14:57:28.948340,HAPPY NEW YEAR!!! YAAASSS!! What a better way ...,Pgfe336MUn4,2050,"WORKOUT,fitness over 40,makeup over 40,mature ...",NEW YEARS EVE WORKOUT 2017! ALL QUADS AND PECS!!,2016-12-31,2016-52,1.0,True
11842210,18693030,2359314,Howto & Style,UCs8moRYU6xw11rTqlIQ6KRw,2019-11-19 14:57:33.846920,Get ready to HATE ME today!! LOL! We love lung...,i93tzqOl0LY,1856,"workout,fitness,mature beauty,fitness over 40,...",Killer 30 minute quad and glute at home workou...,2016-11-12,2016-45,1.0,True
11842212,18693032,2359316,Howto & Style,UCs8moRYU6xw11rTqlIQ6KRw,2019-11-19 14:57:35.066883,Good morning!! Kettle bell workout today! I m...,ZyfQP82L6So,1557,"YouTube Editor,Hiit,workout,womens workouts,fa...",FITNESS OVER 40---25 minute GIRLS RULE kettle ...,2016-10-29,2016-43,1.0,True
11842220,18693040,2359324,Howto & Style,UCs8moRYU6xw11rTqlIQ6KRw,2019-11-19 14:57:40.348296,Good Weekend to you all! Today I am doing ano...,qsj4Jecl_mA,1738,"YouTube Editor,workout,hiit,fitness,fitover40,...",25 minute low impact silent injury workout # 2...,2016-09-03,2016-35,1.0,True


In [19]:
pattern_violence=[' blood ', ' torture ', ' murder ', ' abuse ', ' rape ', ' kill ', ' dead ', ' mass shooting ', 
                  ' execution ', ' kidnapping ', ' slaughter ', ' suicide ', ' victim ', 
                  ' violence ', ' violent ', ' weapon ', ' warefare ', ' gun', ' bomb', ' terrorism ', " extremism "]

pattern_violence= re.compile('|'.join(pattern_violence))   
mask_violence = ((db_before['title'].str.contains(pattern_violence))|(db_before['description'].str.contains(pattern_violence))|(db_before['tags'].str.contains(pattern_violence)))
db_before['violence']=mask_violence
db_before['violence']=db_before['violence'].fillna('No tags')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  db_before['violence']=mask_violence
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  db_before['violence']=db_before['violence'].fillna('No tags')


In [20]:
db_before['violence'].value_counts()

False    2655331
True       87064
Name: violence, dtype: int64

In [21]:
pattern_insults=[' bastard ', ' pussy ', ' dumbass ', ' goddamn ', ' bitch ', ' nigger ', ' shit ', ' idiot ', 
                 ' dammit ', ' cunt ']

pattern_insults= re.compile('|'.join(pattern_insults))   
mask_insults = ((db_before['title'].str.contains(pattern_insults))|(db_before['description'].str.contains(pattern_insults))|(db_before['tags'].str.contains(pattern_insults)))
db_before['insults']=mask_insults
db_before['insults']=db_before['insults'].fillna('No tags')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  db_before['insults']=mask_insults
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  db_before['insults']=db_before['insults'].fillna('No tags')


In [22]:
db_before['insults'].value_counts()

False    2735784
True        6611
Name: insults, dtype: int64

In [23]:
pattern_drug=[" acid ", ' drug ', ' weed ', ' boong ', ' cannabis ', ' cbd ', ' cocain ', ' crack ',
              ' dealer ', ' joint ', ' junky ', ' lsd ', ' marijuana ', ' rehab ', 
               ' stoned ',  ' thc ', ' heroin']

pattern_drug= re.compile('|'.join(pattern_drug))   
mask_drug = ((db_before['title'].str.contains(pattern_drug))|(db_before['description'].str.contains(pattern_drug))|(db_before['tags'].str.contains(pattern_drug)))
db_before['drug']=mask_drug
db_before['drug']=db_before['drug'].fillna('No tags')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  db_before['drug']=mask_drug
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  db_before['drug']=db_before['drug'].fillna('No tags')


In [24]:
db_before['drug'].value_counts()

False    2731158
True       11237
Name: drug, dtype: int64

In [25]:
pattern_sensitive=[" war ", " abortion ", " accident ", " Hitler ",  " fascism ",  " AIDS ", " Al Qaeda ", " alt right ", 
                   " genocide ", ' assassination ',
                   ' attack ', " concentration camp ",  " incel ",  " holocaust ",  " homophobia ", " illegal ", " incest ", " ISIS ", 
                   " Israel ", " Palestine ", " jewish ",  " Ku Klux Klan ", " LGBT ",   " nazi ", 
                   ' racism ', ' slavery ', ' supremacist ', ' supremacy ', ' transphobia ',  ' nuclear weapon ', ' climate change ', ' 9/11 ', ' Twin Towers ']

pattern_sensitive= re.compile('|'.join(pattern_sensitive))   
mask_sensitive = ((db_before['title'].str.contains(pattern_sensitive))|(db_before['description'].str.contains(pattern_sensitive))|(db_before['tags'].str.contains(pattern_sensitive)))
db_before['sensitive']=mask_sensitive
db_before['sensitive']=db_before['sensitive'].fillna('No tags')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  db_before['sensitive']=mask_sensitive
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  db_before['sensitive']=db_before['sensitive'].fillna('No tags')


In [26]:
db_before['sensitive'].value_counts()

False    2700339
True       42056
Name: sensitive, dtype: int64

## Selecting channels based on keyword analysis

In [27]:
db_channels=db_before.groupby(['channel','sex'], as_index=False).size()
db_cha_sex=pd.crosstab(db_before['channel'],db_before['sex'], margins=True)
db_cha_sex['percentage_true']=round((db_cha_sex[True]/db_cha_sex['All'])*100)

conditions=[(db_cha_sex['percentage_true']==0),
    (db_cha_sex['percentage_true']>0)]
values=['Control','Treatment']

db_cha_sex['type_channel_sex']=np.select(conditions, values)

db_cha_sex.groupby(['type_channel_sex'], as_index=False).size()

Unnamed: 0,type_channel_sex,size
0,Control,46945
1,Treatment,4915


In [28]:
db_cha_violence=pd.crosstab(db_before['channel'],db_before['violence'], margins=True)
db_cha_violence['percentage_true']=round((db_cha_violence[True]/db_cha_violence['All'])*100)

conditions=[(db_cha_violence['percentage_true']==0),
    (db_cha_violence['percentage_true']>0)]
values=['Control','Treatment']

db_cha_violence['type_channel_violence']=np.select(conditions, values)

db_cha_violence.groupby(['type_channel_violence'], as_index=False).size()

Unnamed: 0,type_channel_violence,size
0,Control,37505
1,Treatment,14355


In [29]:
db_cha_insults=pd.crosstab(db_before['channel'],db_before['insults'], margins=True)
db_cha_insults['percentage_true']=round((db_cha_insults[True]/db_cha_insults['All'])*100)

conditions=[(db_cha_insults['percentage_true']==0),
    (db_cha_insults['percentage_true']>0)]
values=['Control','Treatment']

db_cha_insults['type_channel_insults']=np.select(conditions, values)

db_cha_insults.groupby(['type_channel_insults'], as_index=False).size()

Unnamed: 0,type_channel_insults,size
0,Control,49441
1,Treatment,2419


In [30]:
db_cha_drugs=pd.crosstab(db_before['channel'],db_before['drug'], margins=True)
db_cha_drugs['percentage_true']=round((db_cha_drugs[True]/db_cha_drugs['All'])*100)

conditions=[(db_cha_drugs['percentage_true']==0),
    (db_cha_drugs['percentage_true']>0)]
values=['Control','Treatment']

db_cha_drugs['type_channel_drugs']=np.select(conditions, values)

db_cha_drugs.groupby(['type_channel_drugs'], as_index=False).size()

Unnamed: 0,type_channel_drugs,size
0,Control,48141
1,Treatment,3719


In [31]:
db_cha_sens=pd.crosstab(db_before['channel'],db_before['sensitive'], margins=True)
db_cha_sens['percentage_true']=round((db_cha_sens[True]/db_cha_sens['All'])*100)

conditions=[(db_cha_sens['percentage_true']==0),
    (db_cha_sens['percentage_true']>0)]
values=['Control','Treatment']

db_cha_sens['type_channel_sens']=np.select(conditions, values)

db_cha_sens.groupby(['type_channel_sens'], as_index=False).size()

Unnamed: 0,type_channel_sens,size
0,Control,44556
1,Treatment,7304


In [32]:
db_cha_sex=db_cha_sex[db_cha_sex['type_channel_sex']=='Treatment']
db_cha_violence=db_cha_violence[db_cha_violence['type_channel_violence']=='Treatment']
db_cha_insults=db_cha_insults[db_cha_insults['type_channel_insults']=='Treatment']
db_cha_drugs=db_cha_drugs[db_cha_drugs['type_channel_drugs']=='Treatment']
db_cha_sens=db_cha_sens[db_cha_sens['type_channel_sens']=='Treatment']

In [33]:
db_cha=list(db_cha_sex.index)

db_violence=list(db_cha_violence.index)
db_insults=list(db_cha_insults.index)
db_drugs=list(db_cha_drugs.index)
db_sens=list(db_cha_sens.index)

In [34]:
for ele in db_violence:
    if ele in db_cha:
        1
    else:
        db_cha.append(ele)

for ele in db_insults:
    if ele in db_cha:
        1
    else:
        db_cha.append(ele)

for ele in db_drugs:
    if ele in db_cha:
        1
    else:
        db_cha.append(ele)


for ele in db_sens:
    if ele in db_cha:
        1
    else:
        db_cha.append(ele)

print(len(db_cha))

20324


In [35]:
data={'id_channel':db_cha}
db=pd.DataFrame(data)

db['type']='treatment'
db.to_csv('id_channels_treatment_2024.csv', sep=';')