# Data Preprocessing
In this Jupyter Notebook, the unique tweets dataset (unique-labelled-tweets-01-06-2020.csv) will be preprocessed. This includes:
* Importing of the data
* Preprocessing of the data
* Exporting the data to CSV and Pickle file

## Import the data
First, the data containing all the tweets will be imported.
This dataset contains unique labelled tweets on 01-07-2020 with the word "demonstratie" in it (Dutch for "demonstration") in the Dutch language.

In [5]:
# Import necessary packages
import pandas as pd

# Import the data
data = pd.read_csv("~/Documents/Github Repository/early-warning-twitter/Original datasets/unique-labelled-tweets-01-06-2020.csv", index_col=0)

## Preprocessing of the data

In [8]:
# Import necessary packages
import re
import demoji
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from urllib.parse import urlparse
import requests
from bs4 import BeautifulSoup

# Download demoji codes
demoji.download_codes()

# Function to clean the coordinates columns
def clean_coordinates(row, variable):
    import ast
    
    # Get the coordinates of the row
    coordinate = row[variable]
    
    # Check if the coordinates has any value
    if type(coordinate) == str:
        coordinate = ast.literal_eval(coordinate)  # Change the string to a dictionary (so we can get the necessary elements)
        i = 0
        for element in coordinate:
            if(i==0) :                             # This will skip the first element (not necessary)
                i = i+1
            else:
                return coordinate['coordinates']   # Return only the coordinates
    else:                                          # If the row is not a string it always is a nan, so we can set this to None
        return None
    
# Function that checks if the tweet is a retweet (based on the text)
def is_retweet(tweet):
    result = re.search("^(RT)\s{1}",tweet)
    if result != None:
        return True
    else:
        return False

# Function that cleans hashtag, user_mentions, urls and media fields
def clean_field(row, variable, type_field):
    field = row[variable]
    
    if (field != "[]") and (field != None):                              # Check if the field has any value
        field_list = []
        for element in field:
            if (type_field == 'hashtag'):                   # Check if the field is hashtag field
                field_list.append(element["text"])
            elif (type_field == 'user_mentions'):        # Check if the field is a user_mentions field
                field_list.append(element["screen_name"])
            elif (type_field == 'urls'):
                
                # Check if the URL is from Twitter 
                # If it is from Twitter, do not add it to the list
                url = element["expanded_url"]
                domain = domain = urlparse(url).netloc
                if domain != "twitter.com":
                    field_list.append(element["expanded_url"])
        
        if field_list != []:
            return field_list
        else:
            return None
    else:
        return None
    
# Function that takes the media in a tweet and it takes the thumbnails (if a website page is shared)
# And stores those values in the variable "cleaned_media"
def clean_media(data):
    media_df = pd.DataFrame(columns=["url", "thumbnail_url"])
    o = 0
    
    for i in range(len(data)):
        media = data.loc[i, 'media']
        urls = data.loc[i, 'urls']
        media_list = []
        
        # If there is media in the tweet, append the media_url to media_list
        if (media != "[]") and (media != None):
            for element in media:
                media_list.append(element["media_url"])
                
        if(urls != None):
            for url in urls:
        
                # If the URL is already handled, get the thumbnail_url from the DataFrame
                if (media_df["url"]==url).any() == True:
                    index = media_df[media_df["url"]==url].index.values.astype(int)[0]
                    thumbnail = media_df.loc[index, 'thumbnail_url']

                # Otherwise make a request to the URL
                else:

                    try:

                        # Make a request to the URL
                        reqs = requests.get(url)

                        # Get the HTML of the URL
                        soup = BeautifulSoup(reqs.text, 'html.parser')

                        # Get the thumbnail of the web page
                        thumbnail = soup.find('meta', property="og:image")

                        # Check if there is a thumbnail. If there is, grab the thumbnail URL
                        if(thumbnail != None):
                            thumbnail = thumbnail['content']
                            
                            # Store the URL and thumbnail url in the DataFrame
                            media_df.loc[o, 'url'] = url
                            media_df.loc[o, 'thumbnail_url'] = thumbnail

                            o = o+1
                    except:
                        pass
                
                # If there is a thumbnail, add it to the media list
                if(thumbnail):
                    media_list.append(thumbnail)
        
        # If we have media in the media_list, use it. Otherwise set it to None
        if(media_list != []):
            data["cleaned_media"][i] = media_list
        else:
            data["cleaned_media"][i] = None
    
# Function to get the number of words in a tweet
def get_n_words(row, variable):
    tweet = row[variable]
    return len(re.findall(r'\w+', tweet)) 

# Function to get the number of characters in a tweet
def get_n_characters(row,variable):
    tweet = row[variable]
    return len(tweet)

def get_user_type(row, variable):
    screen_name = row[variable]
    list_mentioned = []
    
    # Check if tweet has screen_name 
    if (screen_name != None) and (type(screen_name) != float):
            
        # Remove single quotes from string
        screen_name = screen_name.replace("'", "")
        
        # Make screen_name lowercase
        screen_name = screen_name.lower()
                        
        # Check if we have information on this user            
        if (labelled_users['screen_name'] == screen_name).any() == True:
                
            index = labelled_users[labelled_users['screen_name'] == screen_name].index[0]
            mentioned_type = labelled_users.loc[index, 'type']
                
            # Check if the mentioned_type is not nan
            if type(mentioned_type) != float:
                mentioned_type = mentioned_type.lower()
                return mentioned_type  
            else:
                return None
        else:
            return None
    
        
# A function that takes the full_text variable and the title of a web page (if a web page is shared)
# Stores those text data in one variable
# And then cleans that variable using the clean_text() function
def clean_full_text(data):
    urls_df_text = pd.DataFrame(columns=["url", "title_url"])
    o = 0

    for i in range(len(data)):
        text = data.loc[i, 'full_text']
        urls = data.loc[i, 'urls']

        if(urls != None):
            for url in urls:

                # If the URL is already handled, get the title from the DataFrame
                if (urls_df_text["url"]==url).any() == True:
                    index = urls_df_text[urls_df_text["url"]==url].index.values.astype(int)[0]
                    title = urls_df_text.loc[index, 'title_url']

                # This URL has a cookie gate
                elif url == "https://www.ad.nl/den-haag/me-grijpt-in-bij-lockdownprotest-den-haag-37-arrestaties~a4e031d0/":
                    title = "ME grijpt in bij lockdownprotest Den Haag: 37 arrestaties"

                # This URL has a cookie gate
                elif url == "https://www.ad.nl/amsterdam/maandag-demonstratie-op-de-dam-tegen-politiegeweld-in-amerika~a3c63f8f/":
                    title = "Maandag demonstratie op de Dam tegen politiegeweld in Amerika"

                # This URL has a cookie gate
                elif (url == "http://ad.nl") or (url == "http://AD.nl"):
                    title = "AD.nl, het laatste nieuws uit binnen- en buitenland, sport en show"

                # This URL has a cookie gate
                elif (url == "https://www.ad.nl/dossier-instagram/dam-in-amsterdam-bomvol-met-duizenden-demonstranten-tegen-politiegeweld~aae7b3e9/?utm_source=twitter&utm_medium=social&utm_campaign=socialsharing_web") or (url == "https://www.ad.nl/dossier-instagram/duizenden-demonstranten-tegen-politiegeweld-dam-in-amsterdam-bomvol~aae7b3e9/?utm_source=twitter&utm_medium=social&utm_campaign=socialsharing_web") or (url == "https://www.gelderlander.nl/binnenland/duizenden-demonstranten-tegen-politiegeweld-dam-in-amsterdam-bomvol~aae7b3e9/") or (url=="https://www.ad.nl/binnenland/duizenden-demonstranten-tegen-politiegeweld-dam-in-amsterdam-bomvol~aae7b3e9/") or (url=="https://www.ad.nl/binnenland/dam-in-amsterdam-bomvol-met-duizenden-demonstranten-tegen-politiegeweld~aae7b3e9/") or (url=="https://www.pzc.nl/binnenland/dam-in-amsterdam-bomvol-met-duizenden-demonstranten-tegen-politiegeweld~aae7b3e9/") or (url=="https://www.ad.nl/binnenland/dam-in-amsterdam-bomvol-met-demonstranten-tegen-politiegeweld~aae7b3e9/"):
                    title = "Duizenden demonstranten tegen politiegeweld: Dam in Amsterdam bomvol"

                # This URL has a cookie gate
                elif (url == "https://www.ad.nl/binnenland/onbegrip-in-tweede-kamer-over-drukte-bij-racismebetoging-klap-in-gezicht-zorgverleners~addcbdf2/") or (url=="https://www.ad.nl/binnenland/massademonstratie-wekt-woede-dit-is-klap-in-gezicht~addcbdf2/?utm_source=twitter&utm_medium=social&utm_campaign=socialsharing_web"):
                    title = "Massademonstratie wekt woede: ‘Dit is klap in gezicht’"

                # This URL has a cookie gate
                elif (url == "https://www.ad.nl/dossier-instagram-ad-haagsche-courant/tachtig-anti-lockdownactivisten-aangehouden-bij-demonstratie-in-den-haag~aa966a30/?utm_source=twitter&utm_medium=social&utm_campaign=socialsharing_web"):
                    title = "Tachtig anti-lockdownactivisten aangehouden bij demonstratie in Den Haag"

                # This URL has a cookie gate
                elif (url=="https://www.geenstijl.nl/5153730/positie-halsema-onder-vuur-na-negeren-regels/"):
                    title = "Positie Halsema onder vuur na negeren regels"

                # This URL has a cookie gate
                elif (url=="https://youtu.be/R2JSFO0HeX8"):
                    title = "Demonstratie tegen de Lock Down en de anderhalve meter maatschappij!"

                # This URL has a cookie gate
                elif url == "http://ad.nl/rotterdam/vvd-amsterdamse-toestanden-voorkomen-bij-demonstratie-kozp-op-schouwburgplein~ab72c30a/":
                    title = "VVD: Amsterdamse toestanden voorkomen bij demonstratie KOZP op Schouwburgplein"

                # This URL has a cookie gate
                elif url == "https://www.destentor.nl/dossier-coronavirus/geen-laatste-afscheid-voor-echtpaar-door-getouwtrek-rond-coronaregels-bij-woonzorgcentrum~a68908c9/?utm_source=twitter&utm_medium=social&utm_campaign=socialsharing_web":
                    title = "Geen laatste afscheid voor echtpaar door getouwtrek rond coronaregels bij woonzorgcentrum"

                # This URL has a cookie gate
                elif url == "https://www.ad.nl/binnenland/burgemeester-halsema-vindt-een-politiek-standpunt-belangrijker-dan-mensenlevens~ae5f8e1f6/":
                    title = "Burgemeester Halsema vindt een politiek standpunt belangrijker dan mensenlevens"

                # These URLs doesn't have a video anymore
                elif (url == "https://youtube.com/watch?v=WFI0WNHHImo&feature=share&app=desktop") or (url == "https://youtu.be/rZ_DSCKwLFA") or (url == "https://youtu.be/Ww2YzlM1_Dk") or (url == "https://www.youtube.com/watch?v=KGEekP1102g") or (url=="https://youtu.be/w8D40uv9Wy0"):
                    title = None

                # URL with 500 error
                elif (url == "https://bit.ly/2zMG2EM"):
                    title = None

                # Facebook doesn't provide titles for posts
                elif (url == "https://www.facebook.com/1576224044/posts/10220303674493439/?app=fbl") or (url=="https://www.facebook.com/KOZPDenHaag/photos/a.103964200996568/292254242167562/?type=3&theater"):
                    title = None

                # These URLs are not external URLs
                elif (url == "https://Twitter.com/kareldoorman3/") or (url == "https://t.co/VX1a8WaIUM") or (url=="https://t.co/lQND3Q2SPk") or (url=="https://t.co/QgcHFjZmdA") or (url=="https://mobile.twitter.com/fvdemocratie/status/1267407150172831745") or (url=="https://t.co/HtWh2JswA3") or (url=="https://bit.ly/2zHG5SA"):
                    title = None

                # These are not Dutch websites, don't take it into account
                elif (url == "https://www.npr.org/2020/05/31/866428272/george-floyd-reverberates-globally-thousands-protest-in-germany-u-k-canada") or (url == "https://popculture.com/trending/news/man-beaten-after-chasing-protesters-sword-twitter-weighs-in/") or (url == "https://support.twitter.com/articles/20169199") or (url == "https://www.knowyourrightscamp.com/") or (url == "https://www.independent.co.uk/sport/football/european/coronavirus-news-latest-atalanta-valencia-champions-league-italy-crisis-bergamo-a9448541.html") or (url == "http://Halsema.De") or (url=="https://www.facebook.com/events/283690002806623/") or (url=="https://facebook.com/events/283690002806623/") or (url=="https://docs.google.com/document/d/1BRlF2_zhNe86SGgHa6-VlBO-QgirITwCTugSfKie5Fs/mobilebasic") or (url=="https://wapo.st/2AqMiCn") or (url=="http://g1.globo.com/globo-news/protestos-no-brasil/videos/t/todos-os-videos/v/policial-aponta-fuzil-para-manifestante-no-rio-de-janeiro/8592982") or (url=="https://www.google.com/amp/s/amp.theguardian.com/world/2020/may/20/black-americans-death-rate-covid-19-coronavirus"):
                    title = None

                # Otherwise make a request to the URL
                else:
                    # Make a request to the URL
                    # reqs = requests.get(url)

                    try:
                        reqs = requests.get(url)

                        # Get the HTML of the URL
                        soup = BeautifulSoup(reqs.text, 'html.parser')

                        # Get the title of the web page
                        title= soup.find_all('title')

                        # Get the title of the web page as a string
                        if type(title) == str:
                            title = title.get_text()
                        elif (title != None) and (title != []):
                            title = title[0].get_text()

                        if ("Pagina niet gevonden" in title) or ("Aanmelden bij Facebook" in title) or ("Not Found" in title) or ("Pagina bestaat niet meer" in title):
                            title = None

                        # Store the URL and title in the DataFrame
                        urls_df_text.loc[o, 'url'] = url
                        urls_df_text.loc[o, 'title_url'] = title

                        o = o+1
                    except:
                        pass

                if(title):
                    text = text + " " + str(title)
                else:
                    text = data.loc[i, 'full_text']

        data["preprocessed_text"][i] = clean_text(text, 'keep', 'string')
        data["preprocessed_text_no_hashtag"][i] = clean_text(text, 'lose', 'string')
        data["preprocessed_text_tokenized"][i] = clean_text(text, 'keep', 'list')
        data["preprocessed_text_tokenized_no_hashtag"][i] = clean_text(text, 'lose', 'list')
        
# A function that takes the full_text variable and the title of a web page (if a web page is shared)
# Stores those text data in one variable
# And then cleans that variable using the clean_text_partly() function
def clean_full_text_partly(data):
    urls_df_text = pd.DataFrame(columns=["url", "title_url"])
    o = 0

    for i in range(len(data)):
        text = data.loc[i, 'full_text']
        urls = data.loc[i, 'urls']

        if(urls != None):
            for url in urls:

                # If the URL is already handled, get the title from the DataFrame
                if (urls_df_text["url"]==url).any() == True:
                    index = urls_df_text[urls_df_text["url"]==url].index.values.astype(int)[0]
                    title = urls_df_text.loc[index, 'title_url']

                # This URL has a cookie gate
                elif url == "https://www.ad.nl/den-haag/me-grijpt-in-bij-lockdownprotest-den-haag-37-arrestaties~a4e031d0/":
                    title = "ME grijpt in bij lockdownprotest Den Haag: 37 arrestaties"

                # This URL has a cookie gate
                elif url == "https://www.ad.nl/amsterdam/maandag-demonstratie-op-de-dam-tegen-politiegeweld-in-amerika~a3c63f8f/":
                    title = "Maandag demonstratie op de Dam tegen politiegeweld in Amerika"

                # This URL has a cookie gate
                elif (url == "http://ad.nl") or (url == "http://AD.nl"):
                    title = "AD.nl, het laatste nieuws uit binnen- en buitenland, sport en show"

                # This URL has a cookie gate
                elif (url == "https://www.ad.nl/dossier-instagram/dam-in-amsterdam-bomvol-met-duizenden-demonstranten-tegen-politiegeweld~aae7b3e9/?utm_source=twitter&utm_medium=social&utm_campaign=socialsharing_web") or (url == "https://www.ad.nl/dossier-instagram/duizenden-demonstranten-tegen-politiegeweld-dam-in-amsterdam-bomvol~aae7b3e9/?utm_source=twitter&utm_medium=social&utm_campaign=socialsharing_web") or (url == "https://www.gelderlander.nl/binnenland/duizenden-demonstranten-tegen-politiegeweld-dam-in-amsterdam-bomvol~aae7b3e9/") or (url=="https://www.ad.nl/binnenland/duizenden-demonstranten-tegen-politiegeweld-dam-in-amsterdam-bomvol~aae7b3e9/") or (url=="https://www.ad.nl/binnenland/dam-in-amsterdam-bomvol-met-duizenden-demonstranten-tegen-politiegeweld~aae7b3e9/") or (url=="https://www.pzc.nl/binnenland/dam-in-amsterdam-bomvol-met-duizenden-demonstranten-tegen-politiegeweld~aae7b3e9/") or (url=="https://www.ad.nl/binnenland/dam-in-amsterdam-bomvol-met-demonstranten-tegen-politiegeweld~aae7b3e9/"):
                    title = "Duizenden demonstranten tegen politiegeweld: Dam in Amsterdam bomvol"

                # This URL has a cookie gate
                elif (url == "https://www.ad.nl/binnenland/onbegrip-in-tweede-kamer-over-drukte-bij-racismebetoging-klap-in-gezicht-zorgverleners~addcbdf2/") or (url=="https://www.ad.nl/binnenland/massademonstratie-wekt-woede-dit-is-klap-in-gezicht~addcbdf2/?utm_source=twitter&utm_medium=social&utm_campaign=socialsharing_web"):
                    title = "Massademonstratie wekt woede: ‘Dit is klap in gezicht’"

                # This URL has a cookie gate
                elif (url == "https://www.ad.nl/dossier-instagram-ad-haagsche-courant/tachtig-anti-lockdownactivisten-aangehouden-bij-demonstratie-in-den-haag~aa966a30/?utm_source=twitter&utm_medium=social&utm_campaign=socialsharing_web"):
                    title = "Tachtig anti-lockdownactivisten aangehouden bij demonstratie in Den Haag"

                # This URL has a cookie gate
                elif (url=="https://www.geenstijl.nl/5153730/positie-halsema-onder-vuur-na-negeren-regels/"):
                    title = "Positie Halsema onder vuur na negeren regels"

                # This URL has a cookie gate
                elif (url=="https://youtu.be/R2JSFO0HeX8"):
                    title = "Demonstratie tegen de Lock Down en de anderhalve meter maatschappij!"

                # This URL has a cookie gate
                elif url == "http://ad.nl/rotterdam/vvd-amsterdamse-toestanden-voorkomen-bij-demonstratie-kozp-op-schouwburgplein~ab72c30a/":
                    title = "VVD: Amsterdamse toestanden voorkomen bij demonstratie KOZP op Schouwburgplein"

                # This URL has a cookie gate
                elif url == "https://www.destentor.nl/dossier-coronavirus/geen-laatste-afscheid-voor-echtpaar-door-getouwtrek-rond-coronaregels-bij-woonzorgcentrum~a68908c9/?utm_source=twitter&utm_medium=social&utm_campaign=socialsharing_web":
                    title = "Geen laatste afscheid voor echtpaar door getouwtrek rond coronaregels bij woonzorgcentrum"

                # This URL has a cookie gate
                elif url == "https://www.ad.nl/binnenland/burgemeester-halsema-vindt-een-politiek-standpunt-belangrijker-dan-mensenlevens~ae5f8e1f6/":
                    title = "Burgemeester Halsema vindt een politiek standpunt belangrijker dan mensenlevens"

                # These URLs doesn't have a video anymore
                elif (url == "https://youtube.com/watch?v=WFI0WNHHImo&feature=share&app=desktop") or (url == "https://youtu.be/rZ_DSCKwLFA") or (url == "https://youtu.be/Ww2YzlM1_Dk") or (url == "https://www.youtube.com/watch?v=KGEekP1102g") or (url=="https://youtu.be/w8D40uv9Wy0"):
                    title = None

                # URL with 500 error
                elif (url == "https://bit.ly/2zMG2EM"):
                    title = None

                # Facebook doesn't provide titles for posts
                elif (url == "https://www.facebook.com/1576224044/posts/10220303674493439/?app=fbl") or (url=="https://www.facebook.com/KOZPDenHaag/photos/a.103964200996568/292254242167562/?type=3&theater"):
                    title = None

                # These URLs are not external URLs
                elif (url == "https://Twitter.com/kareldoorman3/") or (url == "https://t.co/VX1a8WaIUM") or (url=="https://t.co/lQND3Q2SPk") or (url=="https://t.co/QgcHFjZmdA") or (url=="https://mobile.twitter.com/fvdemocratie/status/1267407150172831745") or (url=="https://t.co/HtWh2JswA3") or (url=="https://bit.ly/2zHG5SA"):
                    title = None

                # These are not Dutch websites, don't take it into account
                elif (url == "https://www.npr.org/2020/05/31/866428272/george-floyd-reverberates-globally-thousands-protest-in-germany-u-k-canada") or (url == "https://popculture.com/trending/news/man-beaten-after-chasing-protesters-sword-twitter-weighs-in/") or (url == "https://support.twitter.com/articles/20169199") or (url == "https://www.knowyourrightscamp.com/") or (url == "https://www.independent.co.uk/sport/football/european/coronavirus-news-latest-atalanta-valencia-champions-league-italy-crisis-bergamo-a9448541.html") or (url == "http://Halsema.De") or (url=="https://www.facebook.com/events/283690002806623/") or (url=="https://facebook.com/events/283690002806623/") or (url=="https://docs.google.com/document/d/1BRlF2_zhNe86SGgHa6-VlBO-QgirITwCTugSfKie5Fs/mobilebasic") or (url=="https://wapo.st/2AqMiCn") or (url=="http://g1.globo.com/globo-news/protestos-no-brasil/videos/t/todos-os-videos/v/policial-aponta-fuzil-para-manifestante-no-rio-de-janeiro/8592982") or (url=="https://www.google.com/amp/s/amp.theguardian.com/world/2020/may/20/black-americans-death-rate-covid-19-coronavirus"):
                    title = None

                # Otherwise make a request to the URL
                else:
                    # Make a request to the URL
                    # reqs = requests.get(url)

                    try:
                        reqs = requests.get(url)

                        # Get the HTML of the URL
                        soup = BeautifulSoup(reqs.text, 'html.parser')

                        # Get the title of the web page
                        title= soup.find_all('title')

                        # Get the title of the web page as a string
                        if type(title) == str:
                            title = title.get_text()
                        elif (title != None) and (title != []):
                            title = title[0].get_text()

                        if ("Pagina niet gevonden" in title) or ("Aanmelden bij Facebook" in title) or ("Not Found" in title) or ("Pagina bestaat niet meer" in title):
                            title = None

                        # Store the URL and title in the DataFrame
                        urls_df_text.loc[o, 'url'] = url
                        urls_df_text.loc[o, 'title_url'] = title

                        o = o+1
                    except:
                        pass

                if(title):
                    text = text + " " + str(title)
                else:
                    text = data.loc[i, 'full_text']

        data["preprocessed_text_partly"][i] = clean_text_partly(text)
        
# Clean the text of the tweet
def clean_text(text, hashtag_text='keep', representation = 'string'):
    
    # Parameters
    # hashtag_text, default = 'keep'
        # 'keep' - keeps the hashtag text and only removes the '#' in the text
        # 'lose' - both removes the hashtag text and the '#' in the text
    # representation, default = 'string'
        # 'list' - returns a list of words
        # 'string' - returns a sentence in string format
    
    tweet = text
    
    # Make the tweet lowercase
    tweet = tweet.lower()
    
    # Remove words with less than two characters
    tweet = re.sub(r'\b\w{1,2}\b', '', tweet)
    
    # Remove URLs
    tweet = remove_url(tweet)
    
    # Remove punctuations unless they are part of a digit (such as "5.000")
    tweet = re.sub(r'(?:(?<!\d)[.,;:…‘]|[.,;:…‘](?!\d))', '', tweet)
    
    # Remove emojis
    tweet = demoji.replace(tweet, "")
    
    if hashtag_text == 'keep':
        tweet = tweet.replace("#", "")
        # Remove mentions (also the text after the @)
        tweet = ' '.join(re.sub("(@[A-Za-z0-9]+)", "", tweet).split())
    else:
        # Remove hashtags and mentions (also the text after the # and @)
        tweet = ' '.join(re.sub("(@[A-Za-z0-9]+)|(#[A-Za-z0-9]+)", "", tweet).split())
    
    # Remove non-alphanumeric charachters, line breaks and tabs
    tweet = ' '.join(re.sub("([:/\!@#$%^&*()_+{}[\];\"”\'|?<>~`\-\n\t’])", "", tweet).split())
    
    # Tokenize the tweet
    tweet = word_tokenize(tweet)
    
    # Use Dutch stop words
    stop_words = stopwords.words('dutch') + ["rt", "nan", "NaN"] 
    
    # Remove stopwords
    tweet = [w for w in tweet if not w in stop_words]
    
    if representation == 'list':
        return tweet
    else:
        return listToString(tweet)

# Function that partly cleans the text of a tweet
def clean_text_partly(text):
    
    tweet = text
    
    # Make the tweet lowercase
    tweet = tweet.lower()
    
    # Remove URLs
    tweet = remove_url(tweet)
    
    # Remove emojis
    tweet = demoji.replace(tweet, "")
    
    # Remove mentions (also the text after the @)
    tweet = ' '.join(re.sub("(@[A-Za-z0-9]+)", "", tweet).split())
    
    # Remove non-alphanumeric charachters, line breaks and tabs
    tweet = ' '.join(re.sub("([:/\!@$%^&*()_+{}[\];\"”\'|?<>~`\\n\t’])", "", tweet).split())
    
    return tweet;
    
# Function to convert a list to a string
def listToString(s):  
    
    # initialize an empty string 
    str1 = " " 
    
    # return string   
    return (str1.join(s)) 
    
def remove_url(tweet_text):
    if has_url_regex(tweet_text): 
        url_regex_list = regex_url_extractor(tweet_text)
        for url in url_regex_list:
            tweet_text = tweet_text.replace(url, "")
    return tweet_text

def has_url_regex(tweet_text):
    return regex_url_extractor(tweet_text)

def regex_url_extractor(tweet_text):
    return re.findall('https?:\/\/(?:[-\w\/.]|(?:%[\da-fA-F]{2}))+', tweet_text)

Downloading emoji data ...
... OK (Got response in 0.38 seconds)
Writing emoji data to /Users/jorenwouters/.demoji/codes.json ...
... OK


In [10]:
import json5

# Remove duplicate tweets
data = data.drop_duplicates('id')

# Change columns to the right date types
data["created_at"] = pd.to_datetime(data["created_at"])
data["org_tweet_created_at"] = pd.to_datetime(data["org_tweet_created_at"])

# Add 2 hours to created_at column
# Original datetime is in UTC, but The Netherlands is in UTC+2
data["created_at"] = data["created_at"] + pd.Timedelta(hours=2)
data["org_tweet_created_at"] = data["org_tweet_created_at"] + pd.Timedelta(hours=2)

# Get the month, day, hour and minute seperately of datetime
data["month"] = data["created_at"].dt.month
data["day"] = data["created_at"].dt.day
data["hour"] = data["created_at"].dt.hour
data["minute"] = data["created_at"].dt.minute

# Only select the data that is on the 1st June 2020
data = data[(data["day"]==1)&(data["month"]==6)]

# Apply clean_coordinates() to every row in tweet_coordinates column 
data["coordinates"] = data.apply(clean_coordinates, args=(["coordinates"]), axis=1)

# Create seperate columns for hashtags, user_mentions, urls, media and text variables
data["hashtags"] = ""
data["user_mentions"] = ""
data["urls"] = ""
data["media"] = ""
data["cleaned_media"] = ""
data["has_media"] = False
data["preprocessed_text"] = ""
data["preprocessed_text_no_hashtag"] = ""
data["preprocessed_text_tokenized"] = ""
data["preprocessed_text_tokenized_no_hashtag"] = ""
data["preprocessed_text_partly"] = ""

data.reset_index(inplace=True, drop=True)

# Convert entities column to four seperate columns (hashtags, user_mentions, urls and media)
for i in range(len(data)):
    entities_string = data.loc[i, 'entities']
    
    # Convert string to dictionary
    entities_dict = json5.loads(entities_string)

    # Convert it to multiple variables
    data["hashtags"][i] = entities_dict.get("hashtags")
    data["user_mentions"][i] = entities_dict.get("user_mentions")
    data["urls"][i] = entities_dict.get("urls")
    data["media"][i] = entities_dict.get("media")

# Clean hashtags, user_mentions, urls and media fields
data["hashtags"] = data.apply(clean_field, args=(["hashtags", "hashtag"]), axis=1)
data["user_mentions"] = data.apply(clean_field, args=(["user_mentions", "user_mentions"]), axis=1)
data["urls"] = data.apply(clean_field, args=(["urls", "urls"]), axis=1)
clean_media(data)

for i in range(len(data)):
    media = data.loc[i, 'cleaned_media']
    
    if media != None:
        data.loc[i, 'has_media'] = True
    
# We need a variable to count the number of cases in a certain time window
data['count'] = 1

# Get the amount of words and amount of characters in a tweet
data["n_words"] = data.apply(get_n_words, args=(["full_text"]), axis=1)
data["n_characters"] = data.apply(get_n_characters, args=(["full_text"]), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["hashtags"][i] = entities_dict.get("hashtags")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["user_mentions"][i] = entities_dict.get("user_mentions")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["urls"][i] = entities_dict.get("urls")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["medi

### What type of user that tweeted?

In [4]:
# Import dataset with labelled users
# This must be a dataframe with two columns (screen_name and type)
labelled_users = pd.read_pickle("~/Documents/Github Repository/early-warning-twitter/Processed datasets/Users/01-06-2020-amsterdam-demonstration-all-interesting-users-labelled.pkl")

data["type_user"] = data.apply(get_user_type, args=(["user_screen_name"]), axis=1)

## Preprocessing for Text Mining

Additionally, it is necessary to preprocess the text in the tweets, so that it can be analyzed for text mining.

In [5]:
# Cleans the text in a tweet
clean_full_text(data)

# Partly cleans the text in a tweet
# Necessary for back translation of the tweets
clean_full_text_partly(data)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["preprocessed_text"][i] = clean_text(text, 'keep', 'string')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["preprocessed_text_no_hashtag"][i] = clean_text(text, 'lose', 'string')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["preprocessed_text_tokenized"][i] = clean_text(text, 'keep', 'list')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guid

## Export datasets

In [13]:
# Reset the index
data.reset_index(inplace=True, drop=True)

#Export datasets in CSV and pickle file
data.to_csv("~/Documents/Github Repository/early-warning-twitter/Processed datasets/Tweets/01-06-2020-unique-tweets-amsterdam-demonstration.csv")
data.to_pickle("~/Documents/Github Repository/early-warning-twitter/Processed datasets/Tweets/01-06-2020-unique-tweets-amsterdam-demonstration.pkl")