# Introduction

When using social network data for analysis, some extra steps have to be performed to clean the data. Following are the steps we used to clean our tweets data:
1. <b>Remove URL (links)</b> - Tweets contain some links/URLs that we do not require in our topic modeling and sentiment analysis. Therefore, using "re" package of python, we removed URLs from the tweets using regular expressions.

2. <b>Clean Case Issues in Tweets</b> - Capitalisation of words can be a problem while analyzing tweets as an upper and lower case of the same word will be considered two separate words. To account for this, we converted all the words to lowercase using the .lower() function.

3. <b>Remove unnecessary words</b> - Since, stopwords are words that do not have any meaning to them and hence, have been removed from the tweets. Also, tweets scrapped are from climate change topics and are expected to be found in each tweet, hence were deleted from the tweets. We also applied the lemmentization technique on the tweets words which is considered to be much more informative than simple stemming. Apart from word reduction, lemmatization considers a language's full vocabulary to apply a morphological analysis to words. Moreover, digits were also removed from the words.

*References* 
* https://towardsdatascience.com/a-guide-to-cleaning-text-in-python-943356ac86ca
* https://www.analyticsvidhya.com/blog/2020/11/text-cleaning-nltk-library/

#### Import libraries

In [1]:
import pandas as pd
import re, nltk, string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.tokenize.treebank import TreebankWordDetokenizer

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/sanyaanand/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sanyaanand/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### Class created to clean the tweets dataset

In [2]:
# cleaned data
class clean_data():
    def __init__(self, data):
        self.data = data
    
    # function of removing url
    def remove_url(self,txt):
        return " ".join(re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+)", "", txt).split())
    
    # function of lowercase and punctuations
    def clean_text(self,text):
        text = text.lower()
        text = re.sub(r'\[.*?\]', '', text)
        text = re.sub(r'[%s]' % re.escape(string.punctuation), '', text)
        text = re.sub(r'\w*\d\w*', '', text)
        return text

    
    def clean(self):
        df = self.data.iloc[:,1:4]
        # rename the columns 
        df = df.rename(columns={'0':'date','1':'id','2':'tweets_text'})
        df = df.dropna(axis = 0)
        # apply remove url function
        df['clean_txt'] = [self.remove_url(i) for i in df['tweets_text']]
        # apply clean text function
        df['clean_txt'] = [self.clean_text(tweet) for tweet in df['clean_txt']]
        # removed stop words
        stop_words = set(stopwords.words('english'))
        words_in_tweet = [tweet.split() for tweet in df['clean_txt']]
        df['clean_words'] = [[word for word in tweet_words if not word in stop_words] for tweet_words in words_in_tweet]
        # remove collection words 
        collection_words = ['climatechange', 'climate', 'change']
        df['clean_words'] = [[w for w in word if not w in collection_words] for word in df['clean_words']]
        # Lemmentiser
        wordnet_lemmatizer = WordNetLemmatizer()
        df['clean_words'] = [[wordnet_lemmatizer.lemmatize(word) for word in w]for w in df['clean_words']]
        # remove digits from words
        df['clean_words'] = [[w.replace('\d+', '') for w in words] for words in df['clean_words']]
        # detokenise the clean words to form a sentence
        df['clean_words_text'] = [TreebankWordDetokenizer().detokenize(words) for words in df['clean_words']]
        # length of the tweet
        df['tweet_length'] = [len(tokens) for tokens in df['clean_txt']]
        return df
    

#### Apply the above created class to clean the tweets dataset and remove uncessary columns

In [3]:
# load tweets data set
tweets_data = pd.read_csv('./tweets.csv')

In [4]:
# clean the data
tweets = clean_data(tweets_data)
data = tweets.clean()
# drop unnescessary column
data = data.drop('clean_words',axis = 1)

#### Overview of cleaned data set

In [5]:
data.sample(5)

Unnamed: 0,date,id,tweets_text,clean_txt,clean_words_text,tweet_length
8458,2021-05-06 02:45:01,1.390135e+18,The GOP is in climate denial. https://t.co/5rQ...,the gop is in climate denial,gop denial,28
2713,2021-05-06 17:02:48,1.390351e+18,This was passed along from a friend at RBC. He...,this was passed along from a friend at rbc hea...,passed along friend rbc head cftc talking two ...,174
149,2021-05-06 21:45:29,1.390422e+18,"I don’t know if it’s exist but, could you anim...",i dont know if its exist but could you animato...,dont know exist could animator video trophy an...,258
1856,2021-05-06 18:34:44,1.390374e+18,#30x30 is an ambitious and necessary endeavor ...,is an ambitious and necessary endeavor to con...,ambitious necessary endeavor conserve nature b...,150
9550,2021-05-05 23:27:21,1.390086e+18,*adds elevated maternal death rate for black w...,adds elevated maternal death rate for black wo...,add elevated maternal death rate black woman p...,113


#### Save dataset to a csv file to be directly used for analysis

In [6]:
data.to_csv('cleaned_data.csv',index = False)