# Data Preprocessing
In this Jupyter Notebook, the initial dataset (tweets-31-05--06-06-2020.csv) will be preprocessed. This includes:
* Importing of the data
* Preprocessing of the data
* Exporting the data to CSV and Pickle file

## Import the data
First, the data containing all the tweets will be imported.
This dataset contains tweets between 31-05-2020 and 06-07-2021 with the word "demonstratie" in it (Dutch for "demonstration") in the Dutch language.

In [1]:
# Import necessary packages
import pandas as pd
import ast

# Import the data and make sure that:
# 'tweet_hashtags' column is a list
data = pd.read_csv("~/Documents/Github Repository/early-warning-twitter/Original datasets/tweets-31-05--06-06-2020.csv", converters={'hashtags':eval}, index_col=0)

## Preprocess the data
Now, we will preprocess the data, so that we can use the final dataset for our analysis.

The preprocessing of our data consists of three steps:
* Converting the variables to the right data types
* Automatically getting the mentioned users from tweets (in other Notebook the mentioned users will be labelled)
* Adding a varibale that describes the type of user they mentioned 
* Preprocessing of text in tweets for text mining

### Converting the variables to the right data types

This will include:
- Changing variables to the right data types
- Transforming the hashtag objects to lists
- Transforming the coordinate objects to lists
- Transform the user determined Place (plain text) to a place name

In [8]:
# Import necessary packages
import re
import demoji
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Download demoji codes
demoji.download_codes()

# Create all the necessary functions that we need in this Jupyter Notebook

# Function to clean the 'tweet_hashtags' column
def clean_hashtag(row, variable):
    # Get the hashtags of the row
    hashtag = row[variable]
    
    if (hashtag != []) and (type(hashtag) != None):                             # Check if the tweet contains any hashtags
        hashtag_list = []
        for element in hashtag:
            hashtag_list.append(element["text"])  # Append every hashtag to a list
        return hashtag_list
    else:
        return None

# Function to clean the coordinates columns
def clean_coordinates(row, variable):
    import ast
    
    # Get the coordinates of the row
    coordinate = row[variable]
    
    # Check if the coordinates has any value
    if type(coordinate) == str:
        coordinate = ast.literal_eval(coordinate)  # Change the string to a dictionary (so we can get the necessary elements)
        i = 0
        for element in coordinate:
            if(i==0) :                             # This will skip the first element (not necessary)
                i = i+1
            else:
                return coordinate['coordinates']   # Return only the coordinates
    else:                                          # If the row is not a string it always is a nan, so we can set this to None
        return None

# Function to get only the place if tweet.place is acquired as a whole
def clean_place(row, variable):
    import pandas as pd
    
    # Get the place of the row
    place = row[variable]

    if not (pd.isnull(place)):
        # Remove first unnecessary characters of the string
        place = place[54:-1]

        # Change = to :
        place = place.replace('=', ':')

        # Convert string to list, so we can delete elements
        place = list(place.split(",")) 

        # Get the place name
        place = place[3]
        place = place.replace(':', ',')
        place = list(place.split(","))[1][1:-1]
    
        return place

# Function to Get user mentions from tweet
def get_user_mentions(row, variable):
    import re
    text = row[variable]
    
    # Regex to get the user mentions in a tweet
    user_mentions = re.findall("(?<![@\w])@(\w{1,25})", text)
    
    # If no user mentions do nothing, otherwise return list of user mentions
    if user_mentions != []:
        # Make all user mentions lowercase
        for i in range(len(user_mentions)):
            user_mentions[i] = user_mentions[i].lower()
        return user_mentions
    else:
        return None
    
def var_to_lower(row, variable):
    desc = row[variable].lower()
    return desc

# Clean the text of the tweet
def clean_text(row, variable, hashtag_text='keep', representation = 'string'):
    
    # Parameters
    # hashtag_text, default = 'keep'
        # 'keep' - keeps the hashtag text and only removes the '#' in the text
        # 'lose' - both removes the hashtag text and the '#' in the text
    # representation, default = 'string'
        # 'list' - returns a list of words
        # 'string' - returns a sentence in string format
    
    tweet = row[variable]
    
    # Make the tweet lowercase
    tweet = tweet.lower()
    
    # Remove words with less than two characters
    tweet = re.sub(r'\b\w{1,2}\b', '', tweet)
    
    # Remove URLs
    tweet = remove_url(tweet)
    
    # Remove punctuations unless they are part of a digit (such as "5.000")
    tweet = re.sub(r'(?:(?<!\d)[.,;:…‘]|[.,;:…‘](?!\d))', '', tweet)
    
    # Remove emojis
    tweet = demoji.replace(tweet, "")
    
    if hashtag_text == 'keep':
        tweet = tweet.replace("#", "")
        tweet = ' '.join(re.sub("(@[A-Za-z0-9]+)", "", tweet).split())
    else:
        # Remove hashtags and mentions
        tweet = ' '.join(re.sub("(@[A-Za-z0-9]+)|(#[A-Za-z0-9]+)", "", tweet).split())
    
    # Remove non-alphanumeric charachters, line breaks and tabs
    tweet = ' '.join(re.sub("([:/\!@#$%^&*()_+{}[\];\"”\'|?<>~`\-\n\t’])", "", tweet).split())
    
    # Tokenize the tweet
    tweet = word_tokenize(tweet)
    
    # Use Dutch stop words
    stop_words = stopwords.words('dutch') + ["rt", "nan", "NaN"] 
    
    # Remove stopwords
    tweet = [w for w in tweet if not w in stop_words]
    
    if representation == 'list':
        return tweet
    else:
        return listToString(tweet)

# Function to convert a list to a string
def listToString(s):  
    
    # initialize an empty string 
    str1 = " " 
    
    # return string   
    return (str1.join(s)) 

def remove_url(tweet_text):
    if has_url_regex(tweet_text): 
        url_regex_list = regex_url_extractor(tweet_text)
        for url in url_regex_list:
            tweet_text = tweet_text.replace(url, "")
    return tweet_text

def has_url_regex(tweet_text):
    return regex_url_extractor(tweet_text)

def regex_url_extractor(tweet_text):
    return re.findall('https?:\/\/(?:[-\w\/.]|(?:%[\da-fA-F]{2}))+', tweet_text)

def get_mentioned(row, variable):
    user_mentions = row[variable]
    list_mentioned = []
    
    # Check if tweet has mentioned_user
    if (user_mentions != None) and (type(user_mentions) != float):
        
        # Make a list of user mentions (if it is a string)
        if type(user_mentions) == str:
            user_mentions = user_mentions.strip('][').split(', ') 
        
        # For every user in user_mentions
        for screen_name in user_mentions:
            
            # Remove single quotes from string
            screen_name = screen_name.replace("'", "")
                        
            # Check if we have information on this user            
            if (labelled_users['screen_name'] == screen_name).any() == True:
                
                index = labelled_users[labelled_users['screen_name'] == screen_name].index[0]
                mentioned_type = labelled_users.loc[index, 'type']
                
                # Check if the mentioned_type is not nan
                if (type(mentioned_type) != float) and (mentioned_type != 'no type'):
                    mentioned_type = mentioned_type.lower()
                    list_mentioned.append(mentioned_type)
                # If we have information on this user, determine what type of user it is
                # Store type of user in the variable "mentioned"
                #list_mentioned.append(   Hier moet dan ngo het soort gebruiker (moeten we ergens opvragen)   )
                    
    if list_mentioned != []:        
        return list_mentioned
    else:
        return None

def get_user_type(row, variable):
    screen_name = row[variable]
    list_mentioned = []
    
    # Check if tweet has screen_name 
    if (screen_name != None) and (type(screen_name) != float):
            
        # Remove single quotes from string
        screen_name = screen_name.replace("'", "")
        
        # Make screen_name lowercase
        screen_name = screen_name.lower()
                        
        # Check if we have information on this user            
        if (labelled_users['screen_name'] == screen_name).any() == True:
                
            index = labelled_users[labelled_users['screen_name'] == screen_name].index[0]
            mentioned_type = labelled_users.loc[index, 'type']
                
            # Check if the mentioned_type is not nan
            if (type(mentioned_type) != float) and (mentioned_type != 'no type'):
                mentioned_type = mentioned_type.lower()
                return mentioned_type  
            else:
                return None
        else:
            return None

# Function that checks if the tweet is a retweet (if this hasn't already been extracted from the Twitter API)
def is_retweet(row, variable):
    tweet = row[variable]
    result = re.search("^(RT)\s{1}",tweet)
    if result != None:
        return True
    else:
        return False

Downloading emoji data ...
... OK (Got response in 0.37 seconds)
Writing emoji data to /Users/jorenwouters/.demoji/codes.json ...
... OK


In [9]:
# Remove duplicate tweets
data = data.drop_duplicates('id')

# Change columns to the right date types
data["created_at"] = pd.to_datetime(data["created_at"])
data["org_tweet_created_at"] = pd.to_datetime(data["org_tweet_created_at"])

# Add 2 hours to created_at column
# Original datetime is in UTC, but The Netherlands is in UTC+2
data["created_at"] = data["created_at"] + pd.Timedelta(hours=2)
data["org_tweet_created_at"] = data["org_tweet_created_at"] + pd.Timedelta(hours=2)

# Get the month, day, hour and minute seperately of datetime
data["month"] = data["created_at"].dt.month
data["day"] = data["created_at"].dt.day
data["hour"] = data["created_at"].dt.hour
data["minute"] = data["created_at"].dt.minute

# Only select the data that is on the 1st June 2020
data = data[data["day"]==1]

# Apply clean_coordinates() to every row in tweet_coordinates column 
data["coordinates"] = data.apply(clean_coordinates, args=(["coordinates"]), axis=1)

# Apply clean_hashtag() to every row in tweet_hashtags column
data["hashtags"] = data.apply(clean_hashtag, args=(["hashtags"]), axis=1)

# Apply clean_place() to every row in tweet_place column (only necessary if tweet.place is used as a whole)
data["place"] = data["place"].astype('string')
data["place"] = data.apply(clean_place, args=(["place"]), axis=1) 

# Check if tweet is a retweet
data["retweeted"] = data.apply(is_retweet, args=(["text"]), axis=1)

# We need a variable to count the number of cases in a certain time window
data['count'] = 1

### Automatically getting the mentioned users from tweets
Now, we will:
* automatically get all the users that are mentioned in the tweets, and
* determine the user type (based on an additional dataset)

In [14]:
# Get user mentions from tweet
data["user_mentions"] = data.apply(get_user_mentions, args=(["text"]), axis=1)

### What type of mentioned user?
The next we need to do is determine the type of user the Twitter user is.

We do this by using a labelled user dataset created by using another Jupyter Notebook (Get user information from tweets) + manual labelling.

In [15]:
# Import dataset with labelled users
# This must be a dataframe with two columns (screen_name and type)
labelled_users = pd.read_pickle("~/Documents/Github Repository/early-warning-twitter/Processed datasets/Users/01-06-2020-amsterdam-demonstration-all-interesting-users-labelled.pkl")

data["user_mentions_types"] = data.apply(get_mentioned, args=(["user_mentions"]), axis=1)

### What type of user that tweeted?

In [17]:
# Import dataset with labelled users
# This must be a dataframe with two columns (screen_name and type)
labelled_users = pd.read_pickle("~/Documents/Github Repository/early-warning-twitter/Processed datasets/Users/01-06-2020-amsterdam-demonstration-all-interesting-users-labelled.pkl")

data["type_user"] = data.apply(get_user_type, args=(["user_screen_name"]), axis=1)

## Preprocessing for Text Mining

Additionally, it is necessary to preprocess the text in the tweets, so that it can be analyzed for text mining.

In [18]:
data["preprocessed_text"] = data.apply(clean_text, args=(["text", 'keep', 'string']), axis=1)
data["preprocessed_text_no_hashtag"] = data.apply(clean_text, args=(["text", 'lose', 'string']), axis=1)
data["preprocessed_text_tokenized"] = data.apply(clean_text, args=(["text", 'keep', 'list']), axis=1)
data["preprocessed_text_tokenized_no_hashtag"] = data.apply(clean_text, args=(["text", 'lose', 'list']), axis=1)

## Export datasets
Now, we will export the different datasets.

In [19]:
# Reset the index
data.reset_index(inplace=True, drop=True)

# Export datasets in CSV and pickle file
data.to_csv("~/Documents/Github Repository/early-warning-twitter/Processed datasets/Tweets/01-06-2020-amsterdam-demonstration.csv")
data.to_pickle("~/Documents/Github Repository/early-warning-twitter/Processed datasets/Tweets/01-06-2020-amsterdam-demonstration.pkl")