<a href="https://colab.research.google.com/github/nurulnadira/COP26-Tweets/blob/main/Analysing_COP26_Tweets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

For this assignment, we put ourselves in the shoes of two data analysts working at the United Nation. Our task is to examine the impact of the conference on Twitter users and what attracted the attention of the users at the conference. 

We collected the tweets about COP26 conference to our list and tokenise the words to inspect them better. After cleaning the tokens, we looked at the frequency distribution of the tokens, drew a Word Cloud and plot the data. 

The words “world”, “energy”, “future” and “leader” are the most used words as most of the world leaders attended the conference and energy is the main topic that they focus on. The word "patriarchy", which is used to criticize the fact that almost all the leaders of the countries are male, is also among the most common words. The words “car” and “jets” were heavily used as many were criticising the irony of leaders flying jet instead of car to a climate change conference. Considering other commonly used words "crisis", "lies", it can be observed that the conference left a belated and disbelieving effect on people rather than a positive and comforting one. 

Since this code pulls new data every time it runs, the output of the code may change over time. We think that we have achieved a healthy result since we prepared this assignment at the conference time.

In [None]:
#Import Libraries

!pip install tweepy
import tweepy        # https://github.com/tweepy/tweepy
import json
import nltk
import string
import matplotlib.pyplot as plt
import pandas as pd
import re
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.corpus import inaugural
from nltk.corpus import stopwords
from collections import Counter
from nltk.probability import FreqDist
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('inaugural')

In [None]:
#Add keys and secrets

access_key = '****'
access_secret = '****'
api_key = '****'
api_secret = '****'

In [None]:
auth = tweepy.OAuthHandler(api_key, api_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)

In [None]:
#Set up list to hold tweets to append as we iterate through the pages

cop26_tweets = []

for page in tweepy.Cursor(api.search_tweets, 
                          q='COP26 OR cop26 OR COP OR COP-26 or #COP26 or #cop26', 
                          lang='en').pages(100):

    cop26_tweets.append(page)

In [None]:
#Count tweets
i=0
for search_result in cop26_tweets:
    for status in search_result: #for every tweet
            i=i+1
print(i)

In [None]:
#Get list of all tweets and exclude URLs

def get_tweets(twitter_pages):
    cop26_tweet_deets = []
    for search_result in twitter_pages:
        for status in search_result:
            status.text=re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', status.text) #remove URL
            status.text=re.sub("@[A-Za-z0-9]+",'',status.text)  #remove mention
            if not status.text.startswith('RT @'):
                cop26_tweet_deets.append(status.text) #Only take the status, exclude the account username
    return cop26_tweet_deets

In [None]:
def clean_tweets(tokens):
    
    #Only take nouns
    token_pos_tag = nltk.pos_tag(tokens)
    token_nouns=[]
    token_nouns=[word for (word,pos) in token_pos_tag if (pos=='NN' or pos=='NNS' or pos=='NNP' or pos=='NNPS')]
    
    #Convert token to lowercase
    lowercase_tokens = [token.lower() for token in token_nouns]
    
    #Define additional words and collection words to be removed
    additional_words=['@',':','rt','RT',',','https',"'","COP26","cop26","COP","COP-26","#COP26","#cop26","'","’"]
    
    #Define stopwords, punctuations, digits
    remove_these = set(stopwords.words('english') + list(string.punctuation) + list(string.digits)+list(additional_words))
    
    #Remove additional words, stopwords, punctuations, digits
    filtered_text = [token 
                 for token in lowercase_tokens 
                 if not token in remove_these]
    
    return filtered_text

In [None]:
#Get list of all tweets
cop26_tweet_deets=get_tweets(cop26_tweets) 

#Tokenize tweets
tweets_string = " ".join(cop26_tweet_deets)
tokens = word_tokenize(tweets_string)

#Clean tweets
filtered_text=clean_tweets(tokens)

#Calculate frequencies of tokens
simple_frequencies_dict = Counter(filtered_text)

In [None]:
clean_tweets_no_urls = pd.DataFrame(fdist.most_common(30),
                             columns=['words', 'count'])

clean_tweets_no_urls.head()

In [None]:
#Illustrate wordcloud
from wordcloud import WordCloud
cloud = WordCloud(width=800, height=260, max_font_size=160, 
                  colormap="viridis", 
                  background_color='white',).generate_from_frequencies(simple_frequencies_dict)
plt.figure(figsize=(16,12))
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()

In [None]:
#Frequency Distribution
fdist = FreqDist(simple_frequencies_dict)

# Frequency Distribution Plot
fdist.plot(30,cumulative=False)
plt.show()