# Gathering Twitter Data - Tweets to raw text

*goal: download tweets from relevant accounts for training the model and evaluating it*

This notebook is structured into the following sections:
- 1.  collection of command line commands used to access the twitter api through twarc2 (for selected politicians where the entire twitter timeline was gathered)
- 2.  reloading the gathered tweets and cleaning them for further analysis
- 3.  loading the rehydrated politicians tweets from van Vliet's dataset
- 4.  filtering for US, UK & Australian politicians
- 5.  exporting the final dataframes to be inferred by climate classifiere and FossilBERT classifier

In [1]:
# import relevant libraries

import tika
import numpy as np
import pandas as pd

#twarc process
from twarc import Twarc2, expansions

import json
import datetime
from datetime import datetime
import re

# general libraries for cleaning and dealing with .csv files
import re
import os
import glob
import wget

from textblob import TextBlob
import nltk
nltk.download('punkt')
nltk.download('brown')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\lucas\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\lucas\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

In [2]:
#research account
client = Twarc2(bearer_token="add your research account bearer token")

## Relevant Twarc2 commands
### to be used within the command prompt (in Visual Studio Code)

In [3]:
#twarc is a command line tool, below are the commands to access the data:

# ----------------------------------------------------------------------------------------------------

# twitter accounts for "downplaying role of fossil fuels in climate change" and command used:
### - EnergyInDepth
# twarc2 timeline --use-search EnergyInDepth EnergyInDepth_09012023_full_timeline.jsonl

### - energycitizens
# twarc2 timeline --use-search energycitizens energycitizens_09012023_full_timeline.jsonl

### - APIenergy
# twarc2 timeline --use-search APIenergy APIenergy_09012023_full_timeline.jsonl


# ----------------------------------------------------------------------------------------------------

# twitter accounts for "undescoring the role of fossil fuels in climate change" and command used:
### - IPCC_CH
# twarc2 timeline --use-search IPCC_CH IPCC_CH_09012023_full_timeline.jsonl

### - UNFCCC
# twarc2 timeline --use-search UNFCCC UNFCCC_09012023_full_timeline.jsonl

### - WBG_Climate
# twarc2 timeline --use-search WBG_Climate WBG_Climate_09012023_full_timeline.jsonl

### - greenpeaceusa
# twarc2 timeline --use-search greenpeaceusa greenpeaceusa_22022023_full_timeline.jsonl

# ----------------------------------------------------------------------------------------------------

# twitter accounts for "political analysis" and command used:
### - algore
# twarc2 timeline --use-search algore algore_09012023_full_timeline.jsonl

### - BarackObama
# twarc2 timeline --use-search BarackObama BarackObama_09012023_full_timeline.jsonl

### - realDonaldTrump (NOT YET WORKING; MAYBE REFER TO https://www.thetrumparchive.com/ to download the .csv manually)
# twarc2 timeline --use-search realDonaldTrump realDonaldTrump_09012023_full_timeline.jsonl

### - JunkScience (Steve Milloy - lawyer, lobbyist, author and Fox News commentator)
# twarc2 timeline --use-search JunkScience JunkScience_09012023_full_timeline.jsonl






# twarc2 timeline --use-search BjornLomborg bjornlomborg_08042023_full_timeline.jsonl

# twarc2 timeline --use-search MikeHudema mikehudema_08042023_full_timeline.jsonl    

# twarc2 timeline --use-search TonyClimate tonyheller_08042023_full_timeline.jsonl

# twarc2 timeline --use-search GeorgeMonbiot georgemonbiot_08042023_full_timeline.jsonl




# twarc2 timeline --use-search AOC AOC_09042023_full_timeline.jsonl

# twarc2 timeline --use-search tedcruz tedcruz_09042023_full_timeline.jsonl
 
# twarc2 timeline --use-search RepDonBeyer RepDonBeyer_09042023_full_timeline.jsonl 

# twarc2 timeline --use-search SenJoniErnst SenJoniErnst_09042023_full_timeline.jsonl 

# twarc2 timeline --use-search SenSanders SenSanders_09042023_full_timeline.jsonl

# twarc2 timeline --use-search RepGosar RepGosar_09042023_full_timeline.jsonl

# twarc2 timeline --use-search SenJohnHoeven SenJohnHoeven_09042023_full_timeline.jsonl

# twarc2 timeline --use-search MikeCrapo MikeCrapo_09042023_full_timeline.jsonl

# twarc2 timeline --use-search MartinHeinrich MartinHeinrich_09042023_full_timeline.jsonl

# twarc2 timeline --use-search RepScottPeters RepScottPeters_09042023_full_timeline.jsonl

# twarc2 timeline --use-search SenKevinCramer SenKevinCramer_09042023_full_timeline.jsonl

# twarc2 timeline --use-search WestermanAR WestermanAR_09042023_full_timeline.jsonl

# twarc2 timeline --use-search RepMikeQuigley RepMikeQuigley_09042023_full_timeline.jsonl

# twarc2 timeline --use-search RepLizCheney RepLizCheney_09042023_full_timeline.jsonl

# twarc2 timeline --use-search Sen_JoeManchin Sen_JoeManchin_09042023_full_timeline.jsonl

# twarc2 timeline --use-search SpeakerMcCarthy SpeakerMcCarthy_09042023_full_timeline.jsonl

# twarc2 timeline --use-search RepCuellar RepCuellar_09042023_full_timeline.jsonl

# twarc2 timeline --use-search RepPfluger RepPfluger_09042023_full_timeline.jsonl

# twarc2 timeline --use-search SenatorLankford SenatorLankford_09042023_full_timeline.jsonl

# twarc2 timeline --use-search WesleyHuntTX WesleyHuntTX_09042023_full_timeline.jsonl

# twarc2 timeline --use-search RepFletcher RepFletcher_09042023_full_timeline.jsonl

# twarc2 timeline --use-search RepDanCrenshaw RepDanCrenshaw_09042023_full_timeline.jsonl

# twarc2 timeline --use-search SteveScalise SteveScalise__09042023_full_timeline.jsonl

# twarc2 timeline --use-search RepPaulTonko RepPaulTonko__09042023_full_timeline.jsonl

# twarc2 timeline --use-search BetoORourke BetoORourke__09042023_full_timeline.jsonl




# twarc2 timeline --use-search davidcicilline davidcicilline_09042023_full_timeline.jsonl
# twarc2 csv davidcicilline_09042023_full_timeline.jsonl davidcicilline_09042023.csv

# twarc2 timeline --use-search Malinowski Malinowski_09042023_full_timeline.jsonl
# twarc2 csv Malinowski_09042023_full_timeline.jsonl Malinowski_09042023.csv

# twarc2 timeline --use-search RepAndyHarrisMD RepAndyHarrisMD_09042023_full_timeline.jsonl
# twarc2 csv RepAndyHarrisMD_09042023_full_timeline.jsonl RepAndyHarrisMD_09042023.csv

# twarc2 timeline --use-search EliseStefanik EliseStefanik_09042023_full_timeline.jsonl
# twarc2 csv EliseStefanik_09042023_full_timeline.jsonl EliseStefanik_09042023.csv

# twarc2 timeline --use-search RepJamesClyburn RepJamesClyburn_09042023_full_timeline.jsonl
# twarc2 csv RepJamesClyburn_09042023_full_timeline.jsonl RepJamesClyburn_09042023.csv

# twarc2 timeline --use-search RepMoolenaar RepMoolenaar_09042023_full_timeline.jsonl
# twarc2 csv RepMoolenaar_09042023_full_timeline.jsonl RepMoolenaar_09042023.csv

# twarc2 timeline --use-search RepSmucker RepSmucker_09042023_full_timeline.jsonl
# twarc2 csv RepSmucker_09042023_full_timeline.jsonl RepSmucker_09042023.csv

# twarc2 timeline --use-search RepRodBlum RepRodBlum_09042023_full_timeline.jsonl
# twarc2 csv RepRodBlum_09042023_full_timeline.jsonl RepRodBlum_09042023.csv

# twarc2 timeline --use-search StaceyPlaskett StaceyPlaskett_09042023_full_timeline.jsonl
# twarc2 csv StaceyPlaskett_09042023_full_timeline.jsonl StaceyPlaskett_09042023.csv

# twarc2 timeline --use-search RepHuizenga RepHuizenga_09042023_full_timeline.jsonl
# twarc2 csv RepHuizenga_09042023_full_timeline.jsonl RepHuizenga_09042023.csv

# twarc2 timeline --use-search RepAndyBarr RepAndyBarr_09042023_full_timeline.jsonl
# twarc2 csv RepAndyBarr_09042023_full_timeline.jsonl RepAndyBarr_09042023.csv

# twarc2 timeline --use-search RepCloudTX RepCloudTX_09042023_full_timeline.jsonl
# twarc2 csv RepCloudTX_09042023_full_timeline.jsonl RepCloudTX_09042023.csv

# twarc2 timeline --use-search RepBera RepBera_09042023_full_timeline.jsonl
# twarc2 csv RepBera_09042023_full_timeline.jsonl RepBera_09042023.csv

# twarc2 timeline --use-search RepBrendanBoyle RepBrendanBoyle_09042023_full_timeline.jsonl
# twarc2 csv RepBrendanBoyle_09042023_full_timeline.jsonl RepBrendanBoyle_09042023.csv

# twarc2 timeline --use-search RepDebDingell RepDebDingell_09042023_full_timeline.jsonl
# twarc2 csv RepDebDingell_09042023_full_timeline.jsonl RepDebDingell_09042023.csv

# twarc2 timeline --use-search RepWilson RepWilson_09042023_full_timeline.jsonl
# twarc2 csv RepWilson_09042023_full_timeline.jsonl RepWilson_09042023.csv

# twarc2 timeline --use-search RepJackBergman RepJackBergman_09042023_full_timeline.jsonl
# twarc2 csv RepJackBergman_09042023_full_timeline.jsonl RepJackBergman_09042023.csv

# twarc2 timeline --use-search RepSarbanes RepSarbanes_09042023_full_timeline.jsonl
# twarc2 csv RepSarbanes_09042023_full_timeline.jsonl RepSarbanes_09042023.csv

# twarc2 timeline --use-search RepJudyChu RepJudyChu_09042023_full_timeline.jsonl
# twarc2 csv RepJudyChu_09042023_full_timeline.jsonl RepJudyChu_09042023.csv

# twarc2 timeline --use-search SenatorSinema SenatorSinema_09042023_full_timeline.jsonl
# twarc2 csv SenatorSinema_09042023_full_timeline.jsonl SenatorSinema_09042023.csv

# twarc2 timeline --use-search SecFudge SecFudge_09042023_full_timeline.jsonl
# twarc2 csv SecFudge_09042023_full_timeline.jsonl SecFudge_09042023.csv

# twarc2 timeline --use-search RepCartwright RepCartwright_09042023_full_timeline.jsonl
# twarc2 csv RepCartwright_09042023_full_timeline.jsonl RepCartwright_09042023.csv

# twarc2 timeline --use-search PeteSessions PeteSessions_09042023_full_timeline.jsonl
# twarc2 csv PeteSessions_09042023_full_timeline.jsonl PeteSessions_09042023.csv

# twarc2 timeline --use-search RepRickAllen RepRickAllen_09042023_full_timeline.jsonl
# twarc2 csv RepRickAllen_09042023_full_timeline.jsonl RepRickAllen_09042023.csv

# twarc2 timeline --use-search RogerMarshallMD RogerMarshallMD_09042023_full_timeline.jsonl
# twarc2 csv RogerMarshallMD_09042023_full_timeline.jsonl RogerMarshallMD_09042023.csv

# twarc2 timeline --use-search sethmoulton sethmoulton_09042023_full_timeline.jsonl
# twarc2 csv sethmoulton_09042023_full_timeline.jsonl sethmoulton_09042023.csv

# twarc2 timeline --use-search RepTedDeutch RepTedDeutch_09042023_full_timeline.jsonl
# twarc2 csv RepTedDeutch_09042023_full_timeline.jsonl RepTedDeutch_09042023.csv

# twarc2 timeline --use-search RepTrentKelly RepTrentKelly_09042023_full_timeline.jsonl
# twarc2 csv RepTrentKelly_09042023_full_timeline.jsonl RepTrentKelly_09042023.csv

# twarc2 timeline --use-search USRepKeating USRepKeating_09042023_full_timeline.jsonl
# twarc2 csv USRepKeating_09042023_full_timeline.jsonl USRepKeating_09042023.csv

# twarc2 timeline --use-search RepRobinKelly RepRobinKelly_09042023_full_timeline.jsonl
# twarc2 csv RepRobinKelly_09042023_full_timeline.jsonl RepRobinKelly_09042023.csv

# twarc2 timeline --use-search CongressmanHice CongressmanHice_09042023_full_timeline.jsonl
# twarc2 csv CongressmanHice_09042023_full_timeline.jsonl CongressmanHice_09042023.csv

# twarc2 timeline --use-search RepFilemonVela RepFilemonVela_09042023_full_timeline.jsonl
# twarc2 csv RepFilemonVela_09042023_full_timeline.jsonl RepFilemonVela_09042023.csv

# twarc2 timeline --use-search USRepMikeDoyle USRepMikeDoyle_09042023_full_timeline.jsonl
# twarc2 csv USRepMikeDoyle_09042023_full_timeline.jsonl USRepMikeDoyle_09042023.csv

# twarc2 timeline --use-search WhipKClark WhipKClark_09042023_full_timeline.jsonl
# twarc2 csv WhipKClark_09042023_full_timeline.jsonl WhipKClark_09042023.csv

# twarc2 timeline --use-search JeffFortenberry JeffFortenberry_09042023_full_timeline.jsonl
# twarc2 csv JeffFortenberry_09042023_full_timeline.jsonl JeffFortenberry_09042023.csv

# twarc2 timeline --use-search RepSpeier RepSpeier_09042023_full_timeline.jsonl
# twarc2 csv RepSpeier_09042023_full_timeline.jsonl RepSpeier_09042023.csv


# Jim Banks: Republican, 1979
# twarc2 timeline --use-search RepJimBanks RepJimBanks_09042023_full_timeline.jsonl
# twarc2 csv RepJimBanks_09042023_full_timeline.jsonl RepJimBanks_09042023.csv


# Rashida Tlaib: Democrat, 1976
# twarc2 timeline --use-search RepRashida RepRashida_09042023_full_timeline.jsonl
# twarc2 csv RepRashida_09042023_full_timeline.jsonl RepRashida_09042023.csv


# alex mooney: Republican, 1971
# twarc2 timeline --use-search RepAlexMooney RepAlexMooney_09042023_full_timeline.jsonl
# twarc2 csv RepAlexMooney_09042023_full_timeline.jsonl RepAlexMooney_09042023.csv


# cheri bustos: Democrat, 1961
# twarc2 timeline --use-search RepCheri RepCheri_09042023_full_timeline.jsonl
# twarc2 csv RepCheri_09042023_full_timeline.jsonl RepCheri_09042023.csv


# hakeem jeffries: Democrat, 1970
# twarc2 timeline --use-search RepJeffries RepJeffries_09042023_full_timeline.jsonl
# twarc2 csv RepJeffries_09042023_full_timeline.jsonl RepJeffries_09042023.csv

# RepAnnieKuster: Democrat, 1956
# twarc2 timeline --use-search RepAnnieKuster RepAnnieKuster_09042023_full_timeline.jsonl
# twarc2 csv RepAnnieKuster_09042023_full_timeline.jsonl RepAnnieKuster_09042023.csv

# RepBillJohnson: Republican, 1954
# twarc2 timeline --use-search RepBillJohnson RepBillJohnson_09042023_full_timeline.jsonl
# twarc2 csv RepBillJohnson_09042023_full_timeline.jsonl RepBillJohnson_09042023.csv

# RepBrianFitz: Republican, 1973
# twarc2 timeline --use-search RepBrianFitz RepBrianFitz_09042023_full_timeline.jsonl
# twarc2 csv RepBrianFitz_09042023_full_timeline.jsonl RepBrianFitz_09042023.csv

# chelliepingree: Democrat, 1955
# twarc2 timeline --use-search chelliepingree chelliepingree_09042023_full_timeline.jsonl
# twarc2 csv chelliepingree_09042023_full_timeline.jsonl chelliepingree_09042023.csv

# LtGovDennyHeck: Democrat, 1952
# twarc2 timeline --use-search LtGovDennyHeck LtGovDennyHeck_09042023_full_timeline.jsonl
# twarc2 csv LtGovDennyHeck_09042023_full_timeline.jsonl LtGovDennyHeck_09042023.csv




# ----------------------------------------------------------------------------------------------------

# conversion of .jsonl format to .csv

## twarc2 csv tweets.jsonl tweets.csv 

# requires: pip3 install --upgrade twarc-csv


# "downplaying" ---------------------

### - EnergyInDepth
# twarc2 csv EnergyInDepth_09012023_full_timeline.jsonl EnergyInDepth_09012023.csv

### - energycitizens
# twarc2 csv energycitizens_09012023_full_timeline.jsonl energycitizens_09012023.csv

### - APIenergy
# twarc2 csv APIenergy_09012023_full_timeline.jsonl APIenergy_09012023.csv


# "underscoring" ---------------------

### - IPCC_CH
# twarc2 csv IPCC_CH_09012023_full_timeline.jsonl IPCC_CH_09012023.csv

### - UNFCCC
# twarc2 csv UNFCCC_09012023_full_timeline.jsonl UNFCCC_09012023.csv

### - WBG_Climate
# twarc2 csv WBG_Climate_09012023_full_timeline.jsonl WBG_Climate_09012023.csv

### - greenpeaceusa
# twarc2 csv greenpeaceusa_22022023_full_timeline.jsonl greenpeaceusa_22022023.csv



# "politcal analysis" ---------------------

### - algore
# twarc2 csv algore_09012023_full_timeline.jsonl algore_09012023.csv

### - BarackObama
# twarc2 csv BarackObama_09012023_full_timeline.jsonl BarackObama_09012023.csv

### - realDonaldTrump
# twarc2 csv realDonaldTrump_09012023_full_timeline.jsonl realDonaldTrump_09012023.csv

### - JunkScience
# twarc2 csv JunkScience_09012023_full_timeline.jsonl JunkScience_09012023.csv


## look at this: https://figshare.com/articles/dataset/The_Twitter_Parliamentarian_Database/10120685

## Cleaning of the Twitter Dataset

### Functions

In [6]:
# function to clean the tweet strings
def tweet_cleaner(tweet_input):
    # tweet_input = tweet_input.lower() # lowercase everything
    tweet_input = tweet_input.encode('ascii', 'ignore').decode()  # remove unicode characters
    tweet_input = re.sub(r'https*\S+', ' ', tweet_input) # remove links
    tweet_input = re.sub(r'http*\S+', ' ', tweet_input)
    
    # cleaning up text
    tweet_input = re.sub(r'(\\n\\n)',' ',tweet_input) # catches double newlines
    tweet_input = re.sub(r'\\n',' ',tweet_input)
    tweet_input = re.sub(r':\s[https]\S+','.',tweet_input) # remove links that are quoted at the end of tweet
    tweet_input = re.sub(r'\s?https\S+','',tweet_input)
    tweet_input = re.sub(r'\s?http\S+','',tweet_input)
    tweet_input = re.sub(r'(Via)\s@\S+','', tweet_input)
    tweet_input = re.sub(r'\s(via)\s@\S+','', tweet_input)
    tweet_input = re.sub(r'@','',tweet_input)
    tweet_input = re.sub(r'&amp;','and', tweet_input)
    tweet_input = re.sub(r'#', '',tweet_input)
    tweet_input = re.sub(r'w/', 'with ', tweet_input)
    tweet_input = re.sub(r'\xa0',' ',tweet_input)
    #tweet_input = re.sub(r'\'','999', tweet_input) # maybe delete
    tweet_input = re.sub(r':[!\s]',': ',tweet_input) # maybe delete

    # shortened words in "twitter language"
    tweet_input = re.sub(r'DidYouKnow', 'Did you know',tweet_input)
    tweet_input = re.sub(r'DYK', 'Did you know',tweet_input)
    tweet_input = re.sub(r'ICYMI', 'In case you missed it',tweet_input)
    tweet_input = re.sub(r'FYI', 'For your information',tweet_input)

    # heavier cleaning
    tweet_input = re.sub(r'oilandgas', 'oil and gas', tweet_input)
    tweet_input = re.sub(r'natgas', 'natural gas', tweet_input)

    
    #text = re.sub(r'\w*\d+\w*', '', text)
    tweet_input = re.sub(r'\s{2,}', ' ', tweet_input)
    #text = re.sub(r'\'\w+', '', text) 
    #text = re.sub(r'\s[^\w\s]\s', '', text)

    return tweet_input


# function to remove specific hashtags within the tweet
def hashtag_cleaner(tweet_input, hashtag):
    tweet_input = re.sub(hashtag, '', tweet_input)
    return tweet_input



# function to transform the date format of "created_at" into datetime object
def tweet_date_to_datetime(tweet_date_input):
    datetime_object = datetime.strptime(tweet_date_input, '%Y-%m-%dT%H:%M:%S.000Z')
    return datetime_object



# special case for "The Parliamentarian Twitter Dataset": function to transform the date format of "created_at" into datetime object
# format: "2021-01-03 22:19:37"
def tweet_date_to_datetime_alternative(tweet_date_input):
    datetime_object = datetime.strptime(tweet_date_input, '%Y-%m-%d %H:%M:%S')
    return datetime_object


# another special case for "The Parliamentarian Twitter Dataset": function to transform the date format of "created_at" into datetime object
# here the days are written as abreviations of the word (e.g. Tuesday = Tue)
# format to parse: "Wed Nov 08 15:48:37 +0000 2017"
def tweet_date_to_datetime_words(tweet_date_input):
    datetime_object = datetime.strptime(tweet_date_input, '%a %b %d %H:%M:%S +0000 %Y')
    return datetime_object



# function to count the number of words in the input cleaned text

def word_count(sentence):
    n_words = len(sentence.split())
    return n_words



In [7]:
# opening the .csv

# define directory and extract the .csv files
dir_path_downplaying = "tweets/downplaying/"
dir_path_underscoring = "tweets/underscoring/"
dir_path_politicians = "tweets/politicians/"
dir_path_greenpeace_API = "tweets/greenpeace_API/"
dir_path_selected_polit = "tweets/selected_polit/"


filenames_only_csv_downplaying = glob.glob(dir_path_downplaying+"*.csv")
filenames_only_csv_underscoring = glob.glob(dir_path_underscoring+"*.csv")
filenames_only_csv_politicians = glob.glob(dir_path_politicians+"*.csv")
filenames_only_csv_greenpeace_API = glob.glob(dir_path_greenpeace_API+"*.csv")
filenames_only_csv_selected_polit = glob.glob(dir_path_selected_polit+"*.csv")

In [8]:
filenames_only_csv_selected_polit

['tweets/selected_polit\\algore_09012023.csv',
 'tweets/selected_polit\\AOC_09042023.csv',
 'tweets/selected_polit\\BarackObama_09012023.csv',
 'tweets/selected_polit\\BetoORourke__09042023.csv',
 'tweets/selected_polit\\chelliepingree_09042023.csv',
 'tweets/selected_polit\\CongressmanHice_09042023.csv',
 'tweets/selected_polit\\davidcicilline_09042023.csv',
 'tweets/selected_polit\\EliseStefanik_09042023.csv',
 'tweets/selected_polit\\JeffFortenberry_09042023.csv',
 'tweets/selected_polit\\LtGovDennyHeck_09042023.csv',
 'tweets/selected_polit\\Malinowski_09042023.csv',
 'tweets/selected_polit\\MartinHeinrich_09042023.csv',
 'tweets/selected_polit\\MikeCrapo_09042023.csv',
 'tweets/selected_polit\\PeteSessions_09042023.csv',
 'tweets/selected_polit\\RepAlexMooney_09042023.csv',
 'tweets/selected_polit\\RepAndyBarr_09042023.csv',
 'tweets/selected_polit\\RepAndyHarrisMD_09042023.csv',
 'tweets/selected_polit\\RepAnnieKuster_09042023.csv',
 'tweets/selected_polit\\RepBera_09042023.csv',

In [9]:

# tweet_database = pd.read_csv(filenames_only_csv[0], encoding='UTF-8', low_memory=False)
# 
# 
# 
# 
# column_selection_tweets = [ 'id',
#                             'text',
#                             'author_id',
#                             #'author_username',
#                             'author.name',
#                             'created_at',
#                             'entities.hashtags',
#                             'entities.mentions',
#                             'entities.urls',
#                             'lang',
#                             'in_reply_to_user_id',
#                             '__twarc.retrieved_at']
# 
# tweet_database_col_selec = tweet_database[column_selection_tweets]


## combining multiple tweets to databases

In [10]:
# load the current .csv file as a dataframe

# twarc data dictionary: 
# https://developer.twitter.com/en/docs/twitter-api/data-dictionary/introduction
# https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet

### what columns to keep:

# id:                   unique identifier of the requested tweet
# text:                 actual UTF-8 text of the Tweet
# author_id:            unique identifier of the User who posted tweet
# author.name:          Name of the User
# created_at:           creation time (for timeseries analysis)
# entities.hashtags:    could be helpful for cleaning the tweets
# entities.mentions:    could be helpful for cleaning the tweets
# entities.urls:        coule be helpful for cleaning the tweets
# lang:                 filter to english text
# in_reply_to_user_id:  use this to detect threads
# __twarc.retrieved_at: date when the tweet was scraped through twarc 


# filter from the 83 columns
column_selection_tweets = [ 'id',
                            'text',
                            'author_id',
                            #'author_username',
                            'author.name',
                            'created_at',
                            'entities.hashtags',
                            'entities.mentions',
                            'entities.urls',
                            'lang',
                            'in_reply_to_user_id',
                            '__twarc.retrieved_at',
                            'author.public_metrics.followers_count',
                            ]

# rearranged for clarity
column_rearranged =       [ 'id',
                            'text',
                            'cleaned_text',
                            'climate_related',
                            'downplaying',
                            'type',
                            'author_id',
                            'author.name',
                            'created_at',
                            'date',
                            'year',
                            'entities.hashtags',
                            'entities.mentions',
                            'entities.urls',
                            'lang',
                            'in_reply_to_user_id',
                            '__twarc.retrieved_at',
                            'author.public_metrics.followers_count',
                            ]




# pre-allocation of combined dataframes
combined_df_downplaying = pd.DataFrame(columns=column_selection_tweets)
combined_df_underscoring = pd.DataFrame(columns=column_selection_tweets)
combined_df_politicians = pd.DataFrame(columns=column_selection_tweets)
combined_df_greenpeace_API = pd.DataFrame(columns=column_selection_tweets)
combined_df_selected_polit = pd.DataFrame(columns=column_selection_tweets)


# combining the individual tweets

# downplaying
for name in range(len(filenames_only_csv_downplaying)):
    current_tweet_database = pd.read_csv(filenames_only_csv_downplaying[name], encoding='UTF-8', low_memory=False)
    current_tweet_database_col_selec = current_tweet_database[column_selection_tweets]

    combined_df_downplaying = pd.concat([combined_df_downplaying,
                            current_tweet_database_col_selec])

    combined_df_downplaying.reset_index(drop=True, inplace=True)

# underscoring
for name in range(len(filenames_only_csv_underscoring)):
    current_tweet_database = pd.read_csv(filenames_only_csv_underscoring[name], encoding='UTF-8', low_memory=False)
    current_tweet_database_col_selec = current_tweet_database[column_selection_tweets]

    combined_df_underscoring = pd.concat([combined_df_underscoring,
                            current_tweet_database_col_selec])

    combined_df_underscoring.reset_index(drop=True, inplace=True)

# politicians
for name in range(len(filenames_only_csv_politicians)):
    current_tweet_database = pd.read_csv(filenames_only_csv_politicians[name], encoding='UTF-8', low_memory=False)
    current_tweet_database_col_selec = current_tweet_database[column_selection_tweets]

    combined_df_politicians = pd.concat([combined_df_politicians,
                            current_tweet_database_col_selec])

    combined_df_politicians.reset_index(drop=True, inplace=True)

# Greenpeace & API
for name in range(len(filenames_only_csv_greenpeace_API)):
    current_tweet_database = pd.read_csv(filenames_only_csv_greenpeace_API[name], encoding='UTF-8', low_memory=False)
    current_tweet_database_col_selec = current_tweet_database[column_selection_tweets]

    combined_df_greenpeace_API = pd.concat([combined_df_greenpeace_API,
                            current_tweet_database_col_selec])

    combined_df_greenpeace_API.reset_index(drop=True, inplace=True)


# selected politicians
for name in range(len(filenames_only_csv_selected_polit)):
    current_tweet_database = pd.read_csv(filenames_only_csv_selected_polit[name], encoding='UTF-8', low_memory=False)
    current_tweet_database_col_selec = current_tweet_database[column_selection_tweets]

    combined_df_selected_polit = pd.concat([combined_df_selected_polit,
                            current_tweet_database_col_selec])

    combined_df_selected_polit.reset_index(drop=True, inplace=True)






combined_df_downplaying['cleaned_text'] = combined_df_downplaying['text'].apply(tweet_cleaner) # cleaning the tweet
combined_df_downplaying['date'] = combined_df_downplaying['created_at'].apply(tweet_date_to_datetime) # transforming twitter date format to .date format
combined_df_downplaying['year'] = combined_df_downplaying['date'].dt.year # extracting the year for timeseries analysis
combined_df_downplaying['climate_related'] = 1 # pre-labelled climate related (will be checked with climateBERT)
combined_df_downplaying['downplaying'] = 1 # pre-labelled downplaying based on source of tweets
combined_df_downplaying['type'] = 'tweet' # type of the text
combined_df_downplaying = combined_df_downplaying.loc[combined_df_downplaying['lang'] == 'en'] # filtering for English tweets

 
combined_df_underscoring['cleaned_text'] = combined_df_underscoring['text'].apply(tweet_cleaner)
combined_df_underscoring['date'] = combined_df_underscoring['created_at'].apply(tweet_date_to_datetime)
combined_df_underscoring['year'] = combined_df_underscoring['date'].dt.year
combined_df_underscoring['climate_related'] = 1
combined_df_underscoring['downplaying'] = 0
combined_df_underscoring['type'] = 'tweet'
combined_df_underscoring = combined_df_underscoring.loc[combined_df_underscoring['lang'] == 'en']


 
combined_df_politicians['cleaned_text'] = combined_df_politicians['text'].apply(tweet_cleaner)
combined_df_politicians['date'] = combined_df_politicians['created_at'].apply(tweet_date_to_datetime)
combined_df_politicians['year'] = combined_df_politicians['date'].dt.year
combined_df_politicians['climate_related'] = 1
combined_df_politicians['downplaying'] = ''
combined_df_politicians['type'] = 'tweet'
combined_df_politicians = combined_df_politicians.loc[combined_df_politicians['lang'] == 'en']



combined_df_greenpeace_API['cleaned_text'] = combined_df_greenpeace_API['text'].apply(tweet_cleaner)
combined_df_greenpeace_API['date'] = combined_df_greenpeace_API['created_at'].apply(tweet_date_to_datetime)
combined_df_greenpeace_API['year'] = combined_df_greenpeace_API['date'].dt.year
combined_df_greenpeace_API['climate_related'] = 1
combined_df_greenpeace_API['downplaying'] = ''
combined_df_greenpeace_API['type'] = 'tweet'
combined_df_greenpeace_API = combined_df_greenpeace_API.loc[combined_df_greenpeace_API['lang'] == 'en']



combined_df_selected_polit['cleaned_text'] = combined_df_selected_polit['text'].apply(tweet_cleaner)
combined_df_selected_polit['date'] = combined_df_selected_polit['created_at'].apply(tweet_date_to_datetime)
combined_df_selected_polit['year'] = combined_df_selected_polit['date'].dt.year
combined_df_selected_polit['climate_related'] = 1
combined_df_selected_polit['downplaying'] = ''
combined_df_selected_polit['type'] = 'tweet'
combined_df_selected_polit = combined_df_selected_polit.loc[combined_df_selected_polit['lang'] == 'en']






combined_df_downplaying = combined_df_downplaying[column_rearranged]
combined_df_underscoring = combined_df_underscoring[column_rearranged]
combined_df_politicians = combined_df_politicians[column_rearranged]
combined_df_greenpeace_API = combined_df_greenpeace_API[column_rearranged]
combined_df_selected_polit = combined_df_selected_polit[column_rearranged]

special case for Donald Trump (source: https://www.thetrumparchive.com/faq)

In [11]:
tweets_donald_trump = pd.read_csv("tweets/politicians/trump/trump_tweets_01-08-2021.csv",usecols=['id','text','isRetweet','date'])

In [12]:
tweets_donald_trump['cleaned_text'] = tweets_donald_trump['text'].apply(tweet_cleaner)
tweets_donald_trump['date'] = tweets_donald_trump['date'].apply(tweet_date_to_datetime_alternative)
tweets_donald_trump['year'] = tweets_donald_trump['date'].dt.year
tweets_donald_trump['climate_related'] = 1
tweets_donald_trump['downplaying'] = ' '
tweets_donald_trump['type'] = 'tweet'
tweets_donald_trump['author.name'] = 'Donald Trump'

In [13]:
tweets_donald_trump['word_count'] = tweets_donald_trump['cleaned_text'].apply(word_count).astype('int')
tweets_donald_trump_filtered = tweets_donald_trump[(tweets_donald_trump['word_count'] >= 6) & (tweets_donald_trump['isRetweet'] == 'f')]

In [14]:
combined_df_politicians_with_trump = pd.concat([combined_df_politicians,tweets_donald_trump_filtered])

In [15]:
combined_df_selected_politicians = pd.concat([combined_df_selected_polit,tweets_donald_trump_filtered]).reset_index()

In [16]:
combined_df_greenpeace_API.reset_index(drop = True, inplace = True)
combined_df_greenpeace_API['downplaying'] = [1 if name == 'American Petroleum Institute' else 0 for name in combined_df_greenpeace_API['author.name']]

In [18]:
# saving the combined files as csv

combined_df_downplaying.to_csv('downplaying_tweets_out.csv')
combined_df_underscoring.to_csv('underscoring_tweets_out.csv')
combined_df_politicians.to_csv('politicians_tweets_out.csv')
combined_df_greenpeace_API.to_csv('greenpeace_API_tweets_out.csv')

In [49]:
combined_df_selected_politicians.to_parquet('selected_politicians_tweets.parquet')

In [None]:
combined_df_politicians_with_trump.to_csv('politicians_tweets_with_trump_out.csv')

In [25]:
selected_climate_influencers = combined_df_politicians[combined_df_politicians['author.name'].isin(['George Monbiot', 'Steve Milloy', 'Tony Heller', 'Bjorn Lomborg', 'Mike Hudema'])].reset_index()

In [26]:
selected_climate_influencers.to_csv('climate_influencers_tweets_out.csv')

## Preparing the Politician Dataset 
(rehydrated from https://figshare.com/articles/dataset/The_Twitter_Parliamentarian_Database/10120685?file=18238628)

member info

In [9]:
member_info = pd.read_csv("tweets/politicians/rehydrated/2020_member_info.csv", encoding = 'utf16', engine = 'python')

# member info includes additional interesting information for analysis later on (e.g. function)
member_info_col_selected = member_info[['country',
                                        #'region',
                                        'name',
                                        'party', 
                                        'uid',
                                        ]].copy()

# only look at US politicians
member_info_only_us = member_info_col_selected[member_info_col_selected['country']=='United States']

# filter duplicates
member_info_only_us_no_duplicates = member_info_only_us.drop_duplicates(subset=['uid'])

In [10]:
# only look at UK politicians
member_info_only_uk = member_info_col_selected[member_info_col_selected['country']=='United Kingdom']

# filter duplicates
member_info_only_uk_no_duplicates = member_info_only_uk.drop_duplicates(subset=['uid'])

In [11]:
# only look at Australian politicians
member_info_only_australia = member_info_col_selected[member_info_col_selected['country']=='Australia']

# filter duplicates
member_info_only_australia_no_duplicates = member_info_only_australia.drop_duplicates(subset=['uid'])

In [10]:
#member_info_only_us_no_duplicates[['name','region']].to_csv("region_mapping.csv", sep = ";")

2021

US

In [12]:
# filtering tweet ids from the United States to not overdraft the Twitter API requests

data_politicians_2021 = pd.read_csv('tweets/politicians/rehydrated/2021.csv')

data_politicians_2021.columns = ['country', 'party', 'name', 'uid', 'district','created_at', 'id']

data_politicians_2021_only_us = data_politicians_2021[data_politicians_2021['country']=='United States']

data_politicians_2021_only_us_no_duplicates = data_politicians_2021_only_us.drop_duplicates(subset=['id'])



# exporting the tweet ids to a txt file as input for hydrator

#  tweet_ids_politicians_2021_only_us_no_duplicates = data_politicians_2021_only_us_no_duplicates['id']
#  
#  with open('tweets/politicians/rehydrated/tweet_ids_politicians_2021_only_us.txt', 'w') as f:
#      for id in tweet_ids_politicians_2021_only_us_no_duplicates:
#          f.write(str(id))
#          f.write('\n')


# reloading the rehydrated 2021 data:

rehydrated_tweets_politicians_2021 = pd.read_csv('tweets/politicians/rehydrated/parliamentarian_dataset_2021_only_us.csv')
rehydrated_tweets_politicians_2021_col_selected = rehydrated_tweets_politicians_2021[['text','id', 'retweet_screen_name','lang','user_followers_count']].copy()

joined_tweet_and_info_2021 = pd.merge(rehydrated_tweets_politicians_2021_col_selected, data_politicians_2021_only_us_no_duplicates, how = 'inner', on = 'id')

  data_politicians_2021 = pd.read_csv('tweets/politicians/rehydrated/2021.csv')


same for UK

In [42]:
data_politicians_2021_only_uk = data_politicians_2021[data_politicians_2021['country']=='United Kingdom']

data_politicians_2021_only_uk_no_duplicates = data_politicians_2021_only_uk.drop_duplicates(subset=['id'])



# exporting the tweet ids to a txt file as input for hydrator

# tweet_ids_politicians_2021_only_uk_no_duplicates = data_politicians_2021_only_uk_no_duplicates['id']
# 
# with open('tweets/politicians/rehydrated/tweet_ids_politicians_2021_only_uk.txt', 'w') as f:
#     for id in tweet_ids_politicians_2021_only_uk_no_duplicates:
#         f.write(str(id))
#         f.write('\n')


# reloading the rehydrated 2021 data:

rehydrated_tweets_politicians_2021_uk = pd.read_csv('tweets/politicians/rehydrated/parliamentarian_dataset_2021_only_uk.csv')
rehydrated_tweets_politicians_2021_uk_col_selected = rehydrated_tweets_politicians_2021_uk[['text','id', 'retweet_screen_name','lang','user_followers_count']].copy()

joined_tweet_and_info_2021_uk = pd.merge(rehydrated_tweets_politicians_2021_uk_col_selected, data_politicians_2021_only_uk_no_duplicates, how = 'inner', on = 'id')

  rehydrated_tweets_politicians_2021_uk = pd.read_csv('tweets/politicians/rehydrated/parliamentarian_dataset_2021_only_uk.csv')


same for Australia

In [43]:
data_politicians_2021_only_australia = data_politicians_2021[data_politicians_2021['country']=='Australia']

data_politicians_2021_only_australia_no_duplicates = data_politicians_2021_only_australia.drop_duplicates(subset=['id'])



# exporting the tweet ids to a txt file as input for hydrator

# tweet_ids_politicians_2021_only_australia_no_duplicates = data_politicians_2021_only_australia_no_duplicates['id']
# 
# with open('tweets/politicians/rehydrated/tweet_ids_politicians_2021_only_australia.txt', 'w') as f:
#     for id in tweet_ids_politicians_2021_only_australia_no_duplicates:
#         f.write(str(id))
#         f.write('\n')

# reloading the rehydrated 2021 data:

rehydrated_tweets_politicians_2021_australia = pd.read_csv('tweets/politicians/rehydrated/parliamentarian_dataset_2021_only_australia.csv')
rehydrated_tweets_politicians_2021_australia_col_selected = rehydrated_tweets_politicians_2021_australia[['text','id', 'retweet_screen_name','lang', 'user_followers_count']].copy()

joined_tweet_and_info_2021_australia = pd.merge(rehydrated_tweets_politicians_2021_australia_col_selected, data_politicians_2021_only_australia_no_duplicates, how = 'inner', on = 'id')


  rehydrated_tweets_politicians_2021_australia = pd.read_csv('tweets/politicians/rehydrated/parliamentarian_dataset_2021_only_australia.csv')


2020 and older (from all_twitter_ids file)

In [13]:
# creating batches of the original dataset to rehydrate (batches of 1.5 million tweets)

# tweet_ids_politicians_2020_second_batch = pd.read_csv("C:/Users/lucas/Downloads/all_tweet_ids.csv", skiprows= 500000, index_col=0, chunksize= 1500000).get_chunk(1500000)
# tweet_ids_politicians_2020_third_batch = pd.read_csv("C:/Users/lucas/Downloads/all_tweet_ids.csv", skiprows= 2000000, index_col=0, chunksize= 1500000).get_chunk(1500000)
# tweet_ids_politicians_2020_fourth_batch = pd.read_csv("C:/Users/lucas/Downloads/all_tweet_ids.csv", skiprows= 5500000, index_col=0, chunksize= 1500000).get_chunk(1500000)
# tweet_ids_politicians_2020_fifth_batch = pd.read_csv("C:/Users/lucas/Downloads/all_tweet_ids.csv", skiprows= 8000000, index_col=0, chunksize= 1000000).get_chunk(1000000)
# tweet_ids_politicians_2020_sixth_batch = pd.read_csv("C:/Users/lucas/Downloads/all_tweet_ids.csv", skiprows= 10000000, index_col=0, chunksize= 1000000).get_chunk(1000000)

In [24]:
# converting the tweet id batches to .csv files to feed into hydrator

# tweet_ids_politicians_2020_second_batch.to_csv('tweets/politicians/rehydrated/tweet_ids_politicians_2020_second_batch.csv')
# tweet_ids_politicians_2020_third_batch.to_csv('tweets/politicians/rehydrated/tweet_ids_politicians_2020_third_batch.csv')
# tweet_ids_politicians_2020_fourth_batch.to_csv('tweets/politicians/rehydrated/tweet_ids_politicians_2020_fourth_batch.csv')
# tweet_ids_politicians_2020_fifth_batch.to_csv('tweets/politicians/rehydrated/tweet_ids_politicians_2020_fifth_batch.csv')
# tweet_ids_politicians_2020_sixth_batch.to_csv('tweets/politicians/rehydrated/tweet_ids_politicians_2020_sixth_batch_test.csv')

reloading the hydrated 2020 and older data

US

In [15]:
# only rehydrated around 350'000 tweet ids from the total 11'000'000 due to Twitter API request limit in first batch:

rehydrated_tweets_politicians_2020_and_older_first_batch = pd.read_csv('tweets/politicians/rehydrated/parliamentarian_dataset_2020_first_batch.csv', usecols = ['text','id','user_id','created_at', 'retweet_screen_name','lang','user_followers_count'])

rehydrated_tweets_politicians_2020_and_older_second_batch = pd.read_csv('tweets/politicians/rehydrated/parliamentarian_dataset_2020_second_batch.csv', usecols = ['text','id','user_id','created_at', 'retweet_screen_name','lang','user_followers_count'])

rehydrated_tweets_politicians_2020_and_older_third_batch = pd.read_csv('tweets/politicians/rehydrated/parliamentarian_dataset_2020_third_batch.csv', usecols = ['text','id','user_id','created_at', 'retweet_screen_name','lang','user_followers_count'])

rehydrated_tweets_politicians_2020_and_older_fourth_batch = pd.read_csv('tweets/politicians/rehydrated/parliamentarian_dataset_2020_fourth_batch.csv', usecols = ['text','id','user_id','created_at', 'retweet_screen_name','lang','user_followers_count'])

rehydrated_tweets_politicians_2020_and_older_fifth_batch = pd.read_csv('tweets/politicians/rehydrated/parliamentarian_dataset_2020_fifth_batch.csv', usecols = ['text','id','user_id','created_at', 'retweet_screen_name','lang','user_followers_count'])

rehydrated_tweets_politicians_2020_and_older_sixth_batch = pd.read_csv('tweets/politicians/rehydrated/parliamentarian_dataset_2020_sixth_batch.csv', usecols = ['text','id','user_id','created_at', 'retweet_screen_name','lang','user_followers_count'])


# recombining to one large dataset
rehydrated_combined_tweets_politicians_2020 = pd.concat([   rehydrated_tweets_politicians_2020_and_older_first_batch,
                                                            rehydrated_tweets_politicians_2020_and_older_second_batch,
                                                            rehydrated_tweets_politicians_2020_and_older_third_batch,
                                                            rehydrated_tweets_politicians_2020_and_older_fourth_batch,
                                                            rehydrated_tweets_politicians_2020_and_older_fifth_batch,
                                                            rehydrated_tweets_politicians_2020_and_older_sixth_batch])

# renaming the user_id column for coherence across other datasets
rehydrated_combined_tweets_politicians_2020.rename(columns={'user_id' : 'uid'}, inplace = True)

# filtering for english tweets
rehydrated_combined_tweets_politicians_2020_only_english = rehydrated_combined_tweets_politicians_2020.loc[rehydrated_combined_tweets_politicians_2020['lang'] == 'en']


joined_tweet_and_info_2020_and_older = pd.merge(rehydrated_combined_tweets_politicians_2020_only_english, member_info_only_us_no_duplicates, how = 'inner', on = 'uid')

UK

In [45]:
joined_tweet_and_info_2020_and_older_uk = pd.merge(rehydrated_combined_tweets_politicians_2020_only_english, member_info_only_uk_no_duplicates, how = 'inner', on = 'uid')

Australia

In [46]:
joined_tweet_and_info_2020_and_older_australia = pd.merge(rehydrated_combined_tweets_politicians_2020_only_english, member_info_only_australia_no_duplicates, how = 'inner', on = 'uid')

applying data cleaning functions & combining to joined dataframe

US

In [47]:
joined_tweet_and_info_2021['cleaned_text'] = joined_tweet_and_info_2021['text'].apply(tweet_cleaner) # cleaning the tweet
joined_tweet_and_info_2021['date'] = joined_tweet_and_info_2021['created_at'].apply(tweet_date_to_datetime_alternative) # transforming twitter date format to .date format
joined_tweet_and_info_2021['year'] = joined_tweet_and_info_2021['date'].dt.year # extracting the year for timeseries analysis


joined_tweet_and_info_2020_and_older['cleaned_text'] = joined_tweet_and_info_2020_and_older['text'].apply(tweet_cleaner)
joined_tweet_and_info_2020_and_older['date'] = joined_tweet_and_info_2020_and_older['created_at'].apply(tweet_date_to_datetime_words)
joined_tweet_and_info_2020_and_older['year'] = joined_tweet_and_info_2020_and_older['date'].dt.year 


# creating one large dataset from 2020 and 2021:

joined_tweet_and_info_2020_and_older_col_selcted = joined_tweet_and_info_2020_and_older[['text','cleaned_text','name','party','country','date','year','id','uid','lang','retweet_screen_name','user_followers_count']].copy()

joined_tweet_and_info_2021_col_selcted = joined_tweet_and_info_2021[['text','cleaned_text','name','party','country','date','year','id','uid','lang','retweet_screen_name','user_followers_count']].copy()

full_tweet_dataset_politicians = pd.concat([joined_tweet_and_info_2020_and_older_col_selcted,
                                            joined_tweet_and_info_2021_col_selcted],
                                            ignore_index= True)

# adding climate_related column that will later be overwritten by classifier predictions
full_tweet_dataset_politicians['climate_related'] = 1


# only select tweets longer than 3 words:
full_tweet_dataset_politicians['cleaned_text_word_count'] = full_tweet_dataset_politicians['cleaned_text'].apply(word_count).astype('int')

full_tweet_dataset_politicians = full_tweet_dataset_politicians[full_tweet_dataset_politicians['cleaned_text_word_count'] >= 4]
full_tweet_dataset_politicians.reset_index(inplace = True, drop = True)

In [50]:
full_tweet_dataset_politicians.head(5)

Unnamed: 0,text,cleaned_text,name,party,country,date,year,id,uid,lang,retweet_screen_name,user_followers_count,climate_related,cleaned_text_word_count
0,#FosterYouthVoices must be included in the chi...,FosterYouthVoices must be included in the chil...,bobby l. rush,Democrat,United States,2017-05-22 19:07:07,2017,866732173252059136,305216911,en,,41606,1,19
1,"During Nat’l #DrugCourtMonth, I’d like to reco...","During Natl DrugCourtMonth, Id like to recogni...",bobby l. rush,Democrat,United States,2017-05-22 21:17:22,2017,866764952052355072,305216911,en,,41606,1,22
2,The #TrumpCuts budget is an assault on working...,The TrumpCuts budget is an assault on working ...,bobby l. rush,Democrat,United States,2017-05-23 16:02:57,2017,867048214029099009,305216911,en,JohnYarmuth,41606,1,22
3,"Between #TrumpCare and #TrumpCuts, @realDonald...","Between TrumpCare and TrumpCuts, realDonaldTru...",bobby l. rush,Democrat,United States,2017-05-24 13:38:41,2017,867374292694175746,305216911,en,RepCheri,41606,1,21
4,Thank you @RepBobbyRush for cosponsoring the A...,Thank you RepBobbyRush for cosponsoring the AD...,bobby l. rush,Democrat,United States,2017-05-24 16:17:06,2017,867414162800095232,305216911,en,AAHOA,41606,1,20


In [15]:
# check for duplicated tweets (probably due to retweets)

sum(full_tweet_dataset_politicians['cleaned_text'].duplicated())

21098

In [10]:
# total politicians in the sampled dataset
len(full_tweet_dataset_politicians['name'].unique())

415

UK

In [48]:
joined_tweet_and_info_2021_uk['cleaned_text'] = joined_tweet_and_info_2021_uk['text'].apply(tweet_cleaner) # cleaning the tweet
joined_tweet_and_info_2021_uk['date'] = joined_tweet_and_info_2021_uk['created_at'].apply(tweet_date_to_datetime_alternative) # transforming twitter date format to .date format
joined_tweet_and_info_2021_uk['year'] = joined_tweet_and_info_2021_uk['date'].dt.year # extracting the year for timeseries analysis


joined_tweet_and_info_2020_and_older_uk['cleaned_text'] = joined_tweet_and_info_2020_and_older_uk['text'].apply(tweet_cleaner)
joined_tweet_and_info_2020_and_older_uk['date'] = joined_tweet_and_info_2020_and_older_uk['created_at'].apply(tweet_date_to_datetime_words)
joined_tweet_and_info_2020_and_older_uk['year'] = joined_tweet_and_info_2020_and_older_uk['date'].dt.year 


# creating one large dataset from 2020 and 2021:

joined_tweet_and_info_2020_and_older_uk_col_selcted = joined_tweet_and_info_2020_and_older_uk[['text','cleaned_text','name','party','country','date','year','id','uid','lang','retweet_screen_name','user_followers_count']].copy()

joined_tweet_and_info_2021_uk_col_selcted = joined_tweet_and_info_2021_uk[['text','cleaned_text','name','party','country','date','year','id','uid','lang','retweet_screen_name','user_followers_count']].copy()

full_tweet_dataset_politicians_uk = pd.concat([joined_tweet_and_info_2020_and_older_uk_col_selcted,
                                            joined_tweet_and_info_2021_uk_col_selcted],
                                            ignore_index= True)

# adding climate_related column that will later be overwritten by classifier predictions
full_tweet_dataset_politicians_uk['climate_related'] = 1


# only select tweets longer than 3 words:
full_tweet_dataset_politicians_uk['cleaned_text_word_count'] = full_tweet_dataset_politicians_uk['cleaned_text'].apply(word_count).astype('int')

full_tweet_dataset_politicians_uk = full_tweet_dataset_politicians_uk[full_tweet_dataset_politicians_uk['cleaned_text_word_count'] >= 4]
full_tweet_dataset_politicians_uk.reset_index(inplace = True, drop = True)

In [41]:
full_tweet_dataset_politicians_uk['party'].unique()

array(['Labour', 'Conservative', 'Labour Co-op',
       'Scottish National Party', 'Liberal Democrat',
       'Democratic Unionist Party', 'Plaid Cymru', 'Sinn Féin',
       'Independent', 'Green Party', 'Alliance',
       'Social Democratic & Labour Party'], dtype=object)

Australia

In [49]:
joined_tweet_and_info_2021_australia['cleaned_text'] = joined_tweet_and_info_2021_australia['text'].apply(tweet_cleaner) # cleaning the tweet
joined_tweet_and_info_2021_australia['date'] = joined_tweet_and_info_2021_australia['created_at'].apply(tweet_date_to_datetime_alternative) # transforming twitter date format to .date format
joined_tweet_and_info_2021_australia['year'] = joined_tweet_and_info_2021_australia['date'].dt.year # extracting the year for timeseries analysis


joined_tweet_and_info_2020_and_older_australia['cleaned_text'] = joined_tweet_and_info_2020_and_older_australia['text'].apply(tweet_cleaner)
joined_tweet_and_info_2020_and_older_australia['date'] = joined_tweet_and_info_2020_and_older_australia['created_at'].apply(tweet_date_to_datetime_words)
joined_tweet_and_info_2020_and_older_australia['year'] = joined_tweet_and_info_2020_and_older_australia['date'].dt.year 


# creating one large dataset from 2020 and 2021:

joined_tweet_and_info_2020_and_older_australia_col_selcted = joined_tweet_and_info_2020_and_older_australia[['text','cleaned_text','name','party','country','date','year','id','uid','lang','retweet_screen_name','user_followers_count']].copy()

joined_tweet_and_info_2021_australia_col_selcted = joined_tweet_and_info_2021_australia[['text','cleaned_text','name','party','country','date','year','id','uid','lang','retweet_screen_name','user_followers_count']].copy()

full_tweet_dataset_politicians_australia = pd.concat([joined_tweet_and_info_2020_and_older_australia_col_selcted,
                                            joined_tweet_and_info_2021_australia_col_selcted],
                                            ignore_index= True)

# adding climate_related column that will later be overwritten by classifier predictions
full_tweet_dataset_politicians_australia['climate_related'] = 1


# only select tweets longer than 3 words:
full_tweet_dataset_politicians_australia['cleaned_text_word_count'] = full_tweet_dataset_politicians_australia['cleaned_text'].apply(word_count).astype('int')

full_tweet_dataset_politicians_australia = full_tweet_dataset_politicians_australia[full_tweet_dataset_politicians_australia['cleaned_text_word_count'] >= 4]
full_tweet_dataset_politicians_australia.reset_index(inplace = True, drop = True)

In [60]:
full_tweet_dataset_politicians_australia['party'].unique()

array(['Australian Labor Party', 'The Nationals',
       'Liberal Party of Australia', 'Independent', 'Nick Xenophon Team',
       'Australian Greens', "Katter's Australian Party",
       "Pauline Hanson's One Nation", "Derryn Hinch's Justice Party",
       'Liberal National Party of Queensland'], dtype=object)

timeseries follower counts long datasets

In [17]:
follower_count_selected_politicians = combined_df_selected_polit[['author.name','author.public_metrics.followers_count']].groupby([combined_df_selected_polit.date.dt.to_period("M"),'author.name']).mean().reset_index()


In [18]:
follower_count_selected_politicians.to_csv("./follower_timeseries/followers_timeseries_selected_pol.csv", sep = ";",index=False)


In [None]:


us_politicians_follower_count = full_tweet_dataset_politicians[['name','user_followers_count']].groupby([full_tweet_dataset_politicians.date.dt.to_period("M"),'name']).mean().reset_index()
uk_politicians_follower_count = full_tweet_dataset_politicians_uk[['name','user_followers_count']].groupby([full_tweet_dataset_politicians_uk.date.dt.to_period("M"),'name']).mean().reset_index()
australia_politicians_follower_count = full_tweet_dataset_politicians_australia[['name','user_followers_count']].groupby([full_tweet_dataset_politicians_australia.date.dt.to_period("M"),'name']).mean().reset_index()

In [None]:

us_politicians_follower_count.to_csv("./follower_timeseries/followers_timeseries_us_pol.csv", index=False)
uk_politicians_follower_count.to_csv("./follower_timeseries/followers_timeseries_uk_pol.csv", index=False)
australia_politicians_follower_count.to_csv("./follower_timeseries/followers_timeseries_australia_pol.csv", index=False)

## Creating and Exporting the golden label climate classifier dataset

creating a golden label dataset (sampling 2000 and manually labelling them in Label-Studio)

In [32]:
golden_labels_dataset_politicians = full_tweet_dataset_politicians.sample(2000)
golden_labels_dataset_politicians['climate_related_human_labelled'] = '-'
golden_labels_dataset_politicians.to_csv('tweets/politicians/rehydrated/golden_labels_dataset_politicians.csv')