# Hate speech on Twitter: Scraping the Twitter API
## Team 8: 
 - Meera Whitson whitson.m@northeastern.edu
 - Anthony Bernardi bernardi.an@northeastern.edu
 
### CONTENT WARNING: offensive language

## Data

### Overview
We are making calls to the Twitter API, which is run and maintained by Twitter. We have received developer licenses from Twitter, the keys for which are shown below. In our research into how to use the Twitter API, we found [this Python library called TwitterAPI](https://github.com/geduldig/TwitterAPI) (no space; "Twitter API" is the actual API, "TwitterAPI" is the name of the library), and we will be using it to make requests.

In [4]:
pip install TwitterAPI

Collecting TwitterAPI
  Downloading TwitterAPI-2.6.9.1.tar.gz (11 kB)
Collecting requests_oauthlib
  Using cached requests_oauthlib-1.3.0-py2.py3-none-any.whl (23 kB)
Collecting oauthlib>=3.0.0
  Using cached oauthlib-3.1.0-py2.py3-none-any.whl (147 kB)
Building wheels for collected packages: TwitterAPI
  Building wheel for TwitterAPI (setup.py) ... [?25ldone
[?25h  Created wheel for TwitterAPI: filename=TwitterAPI-2.6.9.1-py3-none-any.whl size=13040 sha256=5368d91d32804d98f8dc777ff336d91c774678170cccf0d8ceccf847182d428f
  Stored in directory: /Users/meerawhitson/Library/Caches/pip/wheels/99/60/09/74d03ca2fdb4ef0891358249711f8939161265deba78a747cc
Successfully built TwitterAPI
Installing collected packages: oauthlib, requests-oauthlib, TwitterAPI
Successfully installed TwitterAPI-2.6.9.1 oauthlib-3.1.0 requests-oauthlib-1.3.0
Note: you may need to restart the kernel to use updated packages.


In [4]:
from TwitterAPI import TwitterAPI

## API Keys
API keys and authorization objects. We're going to run it using the oAuth2, but we have the code for both if we want to replace it with oAuth1.

In [5]:
# my (Meera) API keys. I have a Twitter developer license which gives me access to the full archive (everything posted on Twitter)

API_KEY = 'OKpfVrIxI2Eo266NTmr1wkHz6'
API_KEY_SECRET = 'txzAyLMDCWEptv6OJyHei82m0s57O22nc6ZWtzwRhetiS7Wwsh'
ACCESS_TOKEN = '1033180687250153472-8nyEZ3RIsVWzQ0IYZ17o4P1I9IFnzh'
ACCESS_TOKEN_SECRET = '6QWRdu6y3H6POnRcPUFDPrJuEXt2csGX5mSk8ECKZuTd5'

In [6]:
## oAuth1
api = TwitterAPI(API_KEY, 
                 API_KEY_SECRET, 
                 ACCESS_TOKEN, 
                 ACCESS_TOKEN_SECRET)

api.auth

<requests_oauthlib.oauth1_auth.OAuth1 at 0x115eed430>

In [7]:
## oAuth2 
api = TwitterAPI(API_KEY,
                 API_KEY_SECRET,
                 auth_type='oAuth2')

api.auth

<TwitterAPI.BearerAuth.BearerAuth at 0x115eeda00>

In [196]:
import json

The Twitter API will return a JSON object for each tweet, which contains a remarkable amount of information, over fifty fields. In the interest of saving space and not collecting more data than necesssary, we were selective in which fields we wanted to save, which will be listed below.

### Information to be collected:

We've decided on the following fields to be saved for a given Tweet:
- 'tweet id' 
- 'text'
- 'time'
- 'tweet location'
- 'hashtags'
- 'likes'
- 'retweets'
- 'possibly sensitive'
    - this is a notable field because it's interesting to see what Twitter classfies as possibly sensitive. The Tweets returned by the queries are pretty recent, so this category is likely not assigned due to manual review.
- 'name'
- 'screen name'
- 'user location'
- 'user bio'
- 'flagged term' 
    - the term that we were looking for. See below for how we pick these terms.



### How are we going to find these Tweets with hateful speech?
When doing research on previous work on hate speech online, we found [this csv](https://github.com/t-davidson/hate-speech-and-offensive-language/blob/master/lexicons/readme.md) containing a dictionary of words considered hateful. Some of these we don't necessarily agree with as being hate speech, such as "Yankee". However, it contains a lot of flagged phrases, some of which are slurs, others are more veiled, dog-whistle phrases such as "border jumper". 

note: we are considering using a different list. The source from which we got this file said it lead to a lot of false positives due to the presence of words like "yellow" and "bird" which have plenty of uses that are not hateful at all. If we think data analysis could benefit from a different list and we could find a good one online, we might use that, but we don't want to create our own list of words to search for by typing out slurs ourselves.

The list needed some cleaning (shown below) due to extraneuous commas and quotation marks.

In [1]:
import pandas as pd

df_slurs = pd.read_csv('hatebase_dict.csv', index_col=0, header=None)

slurs_list = []

for item in df_slurs.iterrows():
    slur = item[0]
    # removing quote marks and commas
    slur = slur.replace("'", "")
    slur = slur.replace(",", "")
    slurs_list.append(slur)

Some of the words in the Hatebase dictionary are not inherently hateful but are characteristic of a hateful attitude, such as "uncivilised". However, some are vulgar and incredibly offensive slurs, so we will not be printing the full contents of the dictionary here, but have included a few examples from it to provide a sense of the contents of the dictionary.

In [21]:
slurs_list[0], slurs_list[100], slurs_list[250], slurs_list[500], slurs_list[900]

('uncivilised', 'jungle bunny', 'paleface', 'mud persons', 'stovepipes')

In [2]:
len(slurs_list)

1034

### Pipeline Overview
We will accomplish this task in two parts. The first is the function `get_tweets_with_term`. This will return a dataframe the first 100 tweets that the a query to the API with the given term contains. The information listed above will be the columns of the dataframe.

The second part of the pipeline is the function `get_tweets_containing_terms`, which takes in a list of terms and repeatedly calls `get_tweets_with_term`, building a dataframe as it goes. Because of concerns about hitting the maximum number of requests we can make to the API, we are currently not using this function and instead are using a script so that even if the API shuts us out, we will still have a partially completed dataframe. We are going to be collecting up to 100 tweets for 1,034 terms in the csv from Hatebase, so the dataframe could have up to 100,000 items.

In [190]:
# the product I want to query (as opposed to the 7-day or 30-day archives)
PRODUCT = 'fullarchive'
# the label for the dev environment I set up on the Twitter developer portal
LABEL = 'mydev'

from TwitterAPI import TwitterPager

def get_tweets_with_term(term):
    """
    Finds up to 100 tweets in the full archive containing the given term.
    Args:
        term (str) : the string to search for in the archive
    Returns:
        df (dataframe) : a dataframe containing information on Tweets containing the given term
    """
    
    df = pd.DataFrame()
    
    count = 0
    # the maximum number of Tweets to be collected from this search
    max_count = 100
    
    pager = TwitterPager(api, 'search/tweets', {'q': term, 'count': max_count})
    # TwitterPager returns a JSON object
    
    # Twitter limits us to 300 requests per 15 minutes, so we have to wait three seconds
    for item in pager.get_iterator(wait=5):
        #print(json.dumps(item, indent=3, sort_keys=True))
        
        info_dict = dict()
        info_dict['tweet id'] = item['id']
        info_dict['text'] = item['text']
        info_dict['time'] = item['created_at']
        info_dict['tweet location'] = item['geo']
        info_dict['hashtags'] = item['entities']['hashtags']
        info_dict['likes'] = item['favorite_count']
        info_dict['retweets'] = item['retweet_count'] 
        try:
            # we ran into some KeyErrors here, so we hypothesize that this key
            # is not present in every Tweet
            info_dict['possibly sensitive'] = item['possibly_sensitive']
        except:
            info_dict['possibly sensitive'] = None
        info_dict['name'] = item['user']['name']
        info_dict['screen name'] = item['user']['screen_name']
        info_dict['user location'] = item['user']['location']
        info_dict['user bio'] = item['user']['description']
        info_dict['flagged term'] = term
        
        count = count + 1
        if count >= max_count:
            break
        
        df = df.append(pd.Series(info_dict), ignore_index=True)
    
    return df

In [185]:
def get_tweets_containing_terms(df_flagged_tweets, terms):
    """
    Collects tweets containing any of the terms in the given list
    Args:
        df_flagged_tweets (dataframe) : the initial datafr
        terms (list) : all the terms (strings) that we want to search for in the Twitter archives
    Returns:
        df_flagged_tweets (dataframe) : a dataframe of any tweet containing the any of the given terms
    """
    
    for term in terms:
        
        df_this_term = get_tweets_with_term(term)
        
        df_flagged_tweets = df_flagged_tweets.append(df_this_term)
        
    return df_flagged_tweets
    

Below is a small test case we created to ensure our pipeline would work on a small list (names of three political figures) before we ran it on a list of over a thousand query items.

In [188]:
df_test = pd.DataFrame()
people = ['harris', 'biden', 'fauci']

for term in people:
        
    df_this_term = get_tweets_with_term(term)
    
    df_test = df_test.append(df_this_term)

df_test

Unnamed: 0,flagged term,hashtags,likes,name,possibly sensitive,retweets,screen name,text,time,tweet id,tweet location,user bio,user location
0,harris,[],0.0,Rising serpent 🇺🇸,0,28.0,rising_serpent,RT @TrumpJew2: THREAD: Kamala Harris doesn’t s...,Tue Mar 23 00:28:14 +0000 2021,1.374156e+18,,You can only fight the way you practice,United States
1,harris,[],0.0,Press Secretary’s Ushanka,0,28.0,BeenLucky7,RT @TrumpJew2: THREAD: Kamala Harris doesn’t s...,Tue Mar 23 00:28:13 +0000 2021,1.374156e+18,,“Truth is treason in the empire of lies” RP,
2,harris,[],0.0,Dave Revcar,0,1525.0,DRevcar,RT @DailyCaller: “Do you have plans to visit t...,Tue Mar 23 00:28:13 +0000 2021,1.374156e+18,,"Liberalism, pray for the cure.\nI got a like f...",
3,harris,[],0.0,Anne Tirone,,2760.0,anne_tirone,RT @DearAuntCrabby: Republicans didn't give a ...,Tue Mar 23 00:28:10 +0000 2021,1.374156e+18,,,
4,harris,[],0.0,"Kelly Allistone, CHRM ✍🏻",,111.0,kellykba,RT @flywithkamala: This adorable set of triple...,Tue Mar 23 00:28:10 +0000 2021,1.374156e+18,,Freelance copy editor and comma lover. Your qu...,"Ontario, Canada"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
94,fauci,[],0.0,🤡🌎,,147.0,ClownWorld2020,RT @nedryun: Some of us back in April 2020 wer...,Tue Mar 23 00:24:05 +0000 2021,1.374155e+18,,,
95,fauci,[],0.0,Radio TFI (Home of The Taxi Stand Hour),False,0.0,TheRadioTFI,Dr. Fauci On The Johnson &amp; Johnson Vaccine...,Tue Mar 23 00:24:04 +0000 2021,1.374155e+18,,"Home of music, memories and more. As well as t...","Queens, NY - Worldwide"
96,fauci,[],0.0,curt (⧖),False,524.0,audiblevideo,RT @TeaPainUSA: This one arrogant decision kil...,Tue Mar 23 00:23:59 +0000 2021,1.374155e+18,,Follow me through the Anthropocene,los angeles
97,fauci,[],0.0,Itakepictures,,5218.0,COskidiva,"RT @kaitlancollins: ""I listened to him, but I ...",Tue Mar 23 00:23:57 +0000 2021,1.374155e+18,,"Photographer, skier, wife and mother! Trump is...","Boulder, CO"


This dataframe looks as expected, so we'll proceed with getting the data we're really interested in.

## Data collection
Here is where we run the pipeline on the whole list of slurs / hateful speech. This is an incredibly large amount of data. We are also saving it to a csv because it will take about an hour to make all the necessary queries to the Twitter API without breaking the limit, and we would rather not run it more times than necessary. We're not using a `_final.csv` structure because the only orignal csv is `hatebase_dict.csv`, which we are not modifying.

The current csv is 20.3 MB, and it took about an hour to build.

### CONTENT WARNING: 
### We have not censored Tweets that were scraped from the API at all, so this csv will contain many examples of harmful slurs.

In [193]:
df_flagged_tweets = pd.DataFrame()

In [195]:
for term in slurs_list:
        
    df_this_term = get_tweets_with_term(term)
    
    df_flagged_tweets = df_flagged_tweets.append(df_this_term)

## Saving the dataframe as a csv

In [197]:
df_flagged_tweets.to_csv('flagged_tweets.csv')

In [199]:
df_flagged_tweets.head()

Unnamed: 0,flagged term,hashtags,likes,name,possibly sensitive,retweets,screen name,text,time,tweet id,tweet location,user bio,user location
0,uncivilised,[],1.0,Quincunx,,0.0,Quincun36705461,@dcvandiver @nancycruise1 uncivilised is how i...,Tue Mar 23 00:34:49 +0000 2021,1.374158e+18,,One of the silent majority who will not be sil...,
1,uncivilised,[],5.0,British border in Ireland -Irish did’nt make it.,,0.0,border_ireland,@conormlally @ChrisMcNultyDgl Did you consider...,Tue Mar 23 00:14:01 +0000 2021,1.374152e+18,,"It’s not an Irish border, Irish didn’t put it ...",The Border
2,uncivilised,[],1.0,evil_gnome 🍄,,0.0,gnome_196883,@frontwheelskid They are supposed to be diplom...,Mon Mar 22 23:51:01 +0000 2021,1.374147e+18,,When I'm not sitting in front of a computer I ...,🇬🇧 🇲🇰 🇪🇪
3,uncivilised,[],6.0,Peter 😷,,0.0,pbmosligo,@Andrewm05562037 @JoeBrolly1993 What has no pl...,Mon Mar 22 23:30:44 +0000 2021,1.374142e+18,,- politics - current affairs - well being - bu...,Ireland
4,uncivilised,[],1.0,Loop PNG,False,0.0,looppng,Southern Command’s Assistant Commissioner of P...,Mon Mar 22 23:14:06 +0000 2021,1.374137e+18,,Loop is PNG's #1 digital news source for local...,Papua New Guinea
