# NLP Data Cleaning

blog on data cleaning: https://monkeylearn.com/blog/text-cleaning/

In [1]:
import pandas as pd
import re
import pathlib
import os

To work with SpaCy, we have to download the corpus 

In [2]:
! python -m spacy download en_core_web_sm 

import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.lang.en import STOP_WORDS

Looking in indexes: https://metoffice.jfrog.io/metoffice/api/pypi/pypi/simple
Collecting en-core-web-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m0m eta [36m0:00:01[0m0:01[0m:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [3]:
data_dir = pathlib.Path('/project/informatics_lab/pip_nlp_data/')
data_fn = 'twitter_data_202207260000_202208010900.csv'
tweet_data = pd.read_csv(data_dir / data_fn)
tweet_data.head()

Unnamed: 0,tweet_id,created_at,tweet,like_count,quote_count,reply_count,retweet_count
0,1551734038204923904,2022-07-26 00:59:59+00:00,$2.7 billion for climate change (slashing carb...,15,1,0,6
1,1551734021591269377,2022-07-26 00:59:55+00:00,@nathaliejacoby1 Climate change. The rise in t...,2,0,0,0
2,1551734013815029761,2022-07-26 00:59:53+00:00,@JacobsVegasLife @LasVegasLocally This is a ch...,8,0,1,0
3,1551733993740980224,2022-07-26 00:59:48+00:00,Climate Change and Energy Minister Chris Bowen...,18,0,8,5
4,1551733979316887554,2022-07-26 00:59:45+00:00,"@Thebs15800518 At 5:30, @SecGranHolm tries to ...",0,0,0,0


In [4]:
tweet_data.shape

(167946, 7)

In [5]:
tweet_data = tweet_data.drop_duplicates()
tweet_data.shape

(146069, 7)

### Clean the data

First step is to create a new column in the pandas dataframe for cleaned tweet text

In [6]:
tweet_data['clean'] = tweet_data.tweet

Make all text lower case 

In [7]:
tweet_data.clean = tweet_data.clean.str.lower()

Before we remove punctuation from the text, we identify any hashtags within the tweets and put them in a seperate column

In [8]:
tweet_data['hashtags'] = tweet_data.clean.apply(lambda x: [word for word in x.split(' ') if word.startswith('#')])

Next we remove any punctuation, URLs, mentions of other twitter users and any AMP HTML references. 

In [9]:
tweet_data.clean = tweet_data.clean.apply(lambda x: re.sub(r'&amp\S+', '', x))
tweet_data.clean = tweet_data.clean.apply(lambda x: re.sub(r"(@\S+)|(#\S+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", x))

Remove any excess white space 

In [10]:
tweet_data.clean = tweet_data.clean.apply(lambda x: re.sub(r'\n', ' ', x))

tweet_data.clean = tweet_data.clean.apply(lambda x: x.strip())
tweet_data.clean = tweet_data.clean.apply(lambda x: ' '.join(x.split()))

Finally, we remove any emoji's in the tweets

In [11]:
import emoji
tweet_data.clean = tweet_data.clean.apply(lambda x: emoji.replace_emoji(x, replace=''))

In [12]:
print('Before cleaning: ', tweet_data.tweet.iloc[1])
print('\nAfter cleaning: ', tweet_data.clean.iloc[1])

Before cleaning:  @nathaliejacoby1 Climate change. The rise in temperature will be bad enough, but the secondary consequences - famine, disease, war, global political and economic instability - are terrifying on an epic scale.

After cleaning:  climate change the rise in temperature will be bad enough but the secondary consequences famine disease war global political and economic instability are terrifying on an epic scale


Note: that for sentiment analysis, some of the text content that we have cleaned out here, like punctuation and emojis, could be useful

### Remove stop words and lemmatize

Lemmatize

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.
Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications.

In [13]:
print('Before lemmatization: ', tweet_data.clean.iloc[1])
tweet_data['clean'] = tweet_data.clean.apply(lambda x: ' '.join([token.lemma_ for token in nlp(x) if not token.is_space]))
tweet_data.clean = tweet_data.clean.str.lower()
print('\nAfter lemmatization: ', tweet_data.clean.iloc[1])

Before lemmatization:  climate change the rise in temperature will be bad enough but the secondary consequences famine disease war global political and economic instability are terrifying on an epic scale

After lemmatization:  climate change the rise in temperature will be bad enough but the secondary consequence famine disease war global political and economic instability be terrify on an epic scale


Remove stopwords

The SpaCy python package provides a dictionary containing stopwords, things like 'the', 'be', 'a' etc. These words help make text flow, but don't add much information to a sentence. By removing them, we are able to give more focus to important information. 

In [14]:
print('Before removing stopwords: ', tweet_data.clean.iloc[1])
tweet_data.clean = tweet_data.clean.apply(lambda x: ' '.join([word for word in x.split(' ') if word not in STOP_WORDS]))
print('\nAfter removing stopwords: ', tweet_data.clean.iloc[1])

Before removing stopwords:  climate change the rise in temperature will be bad enough but the secondary consequence famine disease war global political and economic instability be terrify on an epic scale

After removing stopwords:  climate change rise temperature bad secondary consequence famine disease war global political economic instability terrify epic scale


### Frequently used words

If we check the frequently used words, we can see that most of these are useful terms that we would expect to appear in relation to climate change 

In [15]:
from collections import defaultdict
word_freq = defaultdict(int)
for sent in tweet_data.clean:
    sent = sent.split(' ')
    for i in sent:
        word_freq[i] += 1
        
for word in sorted(word_freq, key=word_freq.get, reverse=True)[:20]:
    print(word, word_freq[word])

climate 136999
change 130934
people 12751
like 9753
s 9100
year 8879
need 8398
world 8370
think 7363
global 7254
cause 6945
know 6698
time 6198
fight 6097
bill 5927
real 5761
use 5474
help 5465
want 5451
new 5294


In [16]:
tweet_data.head()

Unnamed: 0,tweet_id,created_at,tweet,like_count,quote_count,reply_count,retweet_count,clean,hashtags
0,1551734038204923904,2022-07-26 00:59:59+00:00,$2.7 billion for climate change (slashing carb...,15,1,0,6,27 billion climate change slash carbon emissio...,[]
1,1551734021591269377,2022-07-26 00:59:55+00:00,@nathaliejacoby1 Climate change. The rise in t...,2,0,0,0,climate change rise temperature bad secondary ...,[]
2,1551734013815029761,2022-07-26 00:59:53+00:00,@JacobsVegasLife @LasVegasLocally This is a ch...,8,0,1,0,chill podcast happen salt lake city great salt...,[]
3,1551733993740980224,2022-07-26 00:59:48+00:00,Climate Change and Energy Minister Chris Bowen...,18,0,8,5,climate change energy minister chris bowen hit...,[]
4,1551733979316887554,2022-07-26 00:59:45+00:00,"@Thebs15800518 At 5:30, @SecGranHolm tries to ...",0,0,0,0,530 try hide fact begin sign legislation shut ...,"[#biden, #oil, #buildbackbetter]"


### Save clean dataset

In [17]:
out_path = pathlib.Path(os.environ['SCRATCH']) / (data_fn.split('.')[0] + '_clean.csv')
tweet_data.to_csv(out_path, index=False)

In [18]:
(data_fn.split('.')[0] + '_clean.csv')

'twitter_data_202207260000_202208010900_clean.csv'