# Create users dataset to label

The purpose of this Jupyter Notebook is to get a list of user that we can manually label.

In this dataset, we will focus on four specific types of users:
1. Users that tweeted at least 5 times
2. Users that were mentioned at least 5 times
3. Verified users that tweeted at least once
4. Verified users that were mentioned at least once

After we have extracted those users, we will automatically label part of the dataset.

Lastly, we will import the complete labelled dataset, roll some stats and also export it to a PKL file.

## Import all the needed data

In [4]:
import pandas as pd

# Import tweet data
data = pd.read_pickle("~/Documents/Github Repository/early-warning-twitter/Processed datasets/Tweets/01-06-2020-amsterdam-demonstration.pkl")

# Import user data from users that tweeted
tweeted_users = pd.read_pickle("~/Documents/Github Repository/early-warning-twitter/Processed datasets/Users/01-06-2020-amsterdam-demonstration-all-users-that-tweeted.pkl")

# Import user data from users that were mentioned in tweets
mentioned_users = pd.read_pickle("~/Documents/Github Repository/early-warning-twitter/Processed datasets/Users/01-06-2020-amsterdam-demonstration-all-users-that-mentioned.pkl")

## Get users that tweeted at least 5 times

In [5]:
# Function to count how many times a string value exists in a list
def count_values_list(df, variable):
    # Set counter (necessary for for-loop) and create empty dataframe
    i = 0
    values_df = pd.DataFrame(columns = ['value', 'count'])
    
    # Loop through each values element
    for index, values in df[variable].iteritems():
        
        # Check if values is a string (then convert to list)
        if type(values) == str:
            values = values.strip('][').split(', ') 
        
        # Check if there are any values (if list --> values available)
        if type(values) == list:
            
            # Loop through all values
            for value in values:
                
                value = value.lower()
                
                # If value is already in dataframe, add +1 to count
                if (values_df['value']==value).any() == True:
                    index_row = values_df[values_df['value']==value].index.values.astype(int)[0]
                    values_df.loc[index_row, 'count'] = values_df.loc[index_row, 'count'] + 1
                
                # Create new value in dataframe
                else:
                    values_df.loc[i,'value'] = value
                    values_df.loc[i,'count'] = 1
                    i = i+1
    return values_df

In [6]:
# Count how many times every users has tweeted
tweets_per_user = count_values_list(data, "user_screen_name")
tweets_per_user.sort_values('count', ascending=False, inplace=True)
tweets_per_user.rename(columns={"value":"screen_name"}, inplace=True)
tweets_per_user_5 = tweets_per_user[tweets_per_user["count"]>=5]

# Export the datasets
tweets_per_user_5.to_csv("~/Documents/Github Repository/early-warning-twitter/Processed datasets/Stats/01-06-2020-amsterdam-demonstration-tweets-per-user-5.csv")
tweets_per_user_5.to_pickle("~/Documents/Github Repository/early-warning-twitter/Processed datasets/Stats/01-06-2020-amsterdam-demonstration-tweets-per-user-5.pkl")  

In [35]:
# Get information of users that tweeted at least 5 times

# Get only the screen names of users that tweeted at least 5 times
tweets_per_user_5_screen_name = tweets_per_user_5[["screen_name"]]

# Merge tweets_per_user_5 with tweeted_users
tweets_per_user_5_user_info = pd.merge(left = tweets_per_user_5_screen_name, right=tweeted_users, left_on="screen_name", right_on="screen_name")

## Get users that were mentioned at least 5 times

In [37]:
mentions_count = count_values_list(data, "user_mentions")
mentions_count.sort_values('count', ascending=False, inplace=True)
mentions_count.rename(columns={"value":"screen_name"}, inplace=True)
mentions_count_5 = mentions_count[mentions_count["count"]>=5]

# Export the datasets
mentions_count_5.to_csv("~/Documents/Github Repository/early-warning-twitter/Processed datasets/Stats/01-06-2020-amsterdam-demonstration-mentioned-users-5.csv")
mentions_count_5.to_pickle("~/Documents/Github Repository/early-warning-twitter/Processed datasets/Stats/01-06-2020-amsterdam-demonstration-mentioned-users-5.pkl")  

In [38]:
# Get information of users that were mentioned at least 5 times

# Get only the screen names of users that were mentioned at least 5 times
mentions_count_5_screen_name = mentions_count[["screen_name"]]

# Merge mentions_count_5 with mentioned_users dataset
mentions_count_5_user_info = pd.merge(left = mentions_count_5, right=mentioned_users, left_on="screen_name", right_on="screen_name")

## Get verified users that tweeted at least one time

In [8]:
tweeted_verified_users = tweeted_users[tweeted_users["verified"]==True][["screen_name", "description"]]

In [9]:
len(tweeted_verified_users)

73

## Get verified users that were mentioned at least one time

In [11]:
mentioned_verified_users = mentioned_users[mentioned_users["verified"]==True][["screen_name", "description"]]

In [12]:
len(mentioned_verified_users)

197

## Unite all four users groups in one dataset

In [46]:
# Concatenate all the user
all_interesting_users = pd.concat([tweets_per_user_5_user_info, mentions_count_5_user_info, tweeted_verified_users, mentioned_verified_users], axis=0)

# Drop duplicates based on the screen name
all_interesting_users.drop_duplicates(subset="screen_name", inplace=True)

In [None]:
# Only get the screen_name and description
all_interesting_users = all_interesting_users[["screen_name", "description_lower"]]
all_interesting_users.reset_index(inplace=True, drop=True)

# This results in a Dataframe with 1475 rows that need to be labelled
all_interesting_users

## Automatically label part of the user dataset
In this section, we will automatically label part of the user dataset by using the description of the user.

In [63]:
# Add type column to DataFrame
all_interesting_users["type"] = ""

for i in range(1,len(all_interesting_users)):
    if type(all_interesting_users.loc[i, 'description_lower']) != float:   # Check if the description is not Nan
        if ('kamerlid' in all_interesting_users.loc[i, 'description_lower']) or ('lid tweede kamer' in all_interesting_users.loc[i, 'description_lower']) or ('member of european parliament' in all_interesting_users.loc[i, 'description_lower']) or ('member of the european parliament' in all_interesting_users.loc[i, 'description_lower']) or ('europarlementariër' in all_interesting_users.loc[i, 'description_lower']) or ('lid europees parlement' in all_interesting_users.loc[i, 'description_lower']):
            all_interesting_users.loc[i, "type"] = "Member of parliament"
        elif ('journalist' in all_interesting_users.loc[i, 'description_lower']) or ('nieuwschef' in all_interesting_users.loc[i, 'description_lower']) or ('verslaggever' in all_interesting_users.loc[i, 'description_lower']) or ('redacteur' in all_interesting_users.loc[i, 'description_lower']) or ('columnist' in all_interesting_users.loc[i, 'description_lower']):
            all_interesting_users.loc[i, "type"] = "Media people"
        elif ('politie' in all_interesting_users.loc[i, 'screen_name'].lower()):
            all_interesting_users.loc[i, "type"] = "Police"

In [64]:
all_interesting_users["type"].value_counts()

                        1394
Media people              67
Member of parliament      11
Police                     3
Name: type, dtype: int64

In [65]:
# Export the dataset in CSV and PKL format
all_interesting_users.to_csv("~/Documents/Github Repository/early-warning-twitter/Processed datasets/Users/01-06-2020-amsterdam-demonstration-all-interesting-users-partly-labelled.csv")
all_interesting_users.to_pickle("~/Documents/Github Repository/early-warning-twitter/Processed datasets/Users/01-06-2020-amsterdam-demonstration-all-interesting-users-partly-labelled.pkl")

## Manually label rest of the dataset
The rest of the user dataset is manually labelled in Excel according to labelling protocol described in the thesis.
Here we will import the dataset and also export it to a PKL file.

In [9]:
import pandas as pd

# Import complete labelled dataset
labelled_users = pd.read_csv('~/Documents/Github Repository/early-warning-twitter/Processed datasets/Users/01-06-2020-amsterdam-demonstration-all-interesting-users-labelled.csv', sep=";")

In [11]:
# Convert 'member of parliament' to 'politician'
for i in range(len(labelled_users)):
    type_user = labelled_users.loc[i, 'type']
    
    if(type_user == 'member of parliament'):
        labelled_users.loc[i, 'type'] = 'politician'

In [13]:
labelled_users["type"].value_counts()

no type                            1126
media people                        123
politician                           95
mass media                           52
political party                       9
political organization                8
government organization               8
police                                7
soccer club                           7
musician                              7
writer                                6
political activist                    6
comedian                              6
actress                               3
municipality                          3
mayor                                 2
part of government organization       2
social network                        2
virologist                            1
parlement                             1
province                              1
Name: type, dtype: int64

In [14]:
# Export the dataset
labelled_users.to_pickle('~/Documents/Github Repository/early-warning-twitter/Processed datasets/Users/01-06-2020-amsterdam-demonstration-all-interesting-users-labelled.pkl')