## Import Libraries

In [35]:
## Pandas required to manipulate data into user-friendly data structure
import pandas as pd

## nltk is a leading Python library for Natural Language Processing (NLP) tasks
import nltk

## TextBlob is used for sentiment analysis
from textblob import TextBlob

## stopwords is an nltk function that provides a list of commonly used, low value words (e.g. 'the', 'a' etc.)
from nltk.corpus import stopwords

## Identifies specified patterns within a string 
import re

## Provides list of punctuation
import string

## word_tokenize is an nltk function that splits a string into a list of words
from nltk import word_tokenize

## WordNetLemmatizer is a function that converts a word to its base form (e.g. feet to foot)
from nltk.stem.wordnet import WordNetLemmatizer

## Pickle allows Python objects to be saved for later use, and retrieved
import pickle

## Set Pandas Display Options

In [36]:
## Set width of pandas dataframe to ensure entire Tweet is displayed
pd.set_option('display.max_colwidth', 3000)

## Import Data

In [37]:
## Import data from Excel file (created in Step 1)
df = pd.read_excel('labelled_data.xlsx', index_col=0, usecols='A:E')
df.columns = ['network','datetime','original_tweet','subject']

In [38]:
## Display first five rows of the dataframe
df.head()

Unnamed: 0,network,datetime,original_tweet,subject
0,@VodafoneUK,2019-12-04 08:05:14,@VodafoneUK Plus £2.28 package &amp; posting ! ! !,Device
1,@VodafoneUK,2019-12-04 08:04:05,I have repeatedly asked how to get a refund so I can use another provider. I have also asked how to escalate my complaint. @VodafoneIN refuses to give me this information. @VodafoneUK @VodafoneGroup @rmstakkar @Nairkavita,Customer Service
2,@VodafoneUK,2019-12-04 08:01:19,"I have supplied visa details twice, I have been subjected to horrendously rude staff instore, and now Vodafone are stealing my money by removing services I have paid for. Tourists should not use Vodafone. @VodafoneIn @VodafoneUK @VodafoneGroup @rmstakkar @Nairkavita",Customer Service
3,@VodafoneUK,2019-12-04 07:57:42,@VodafoneIN promised yesterday I’d receive no more calls and would get an email in 30 mins. No email received. Today I received yet another call. Vodaphone incompetence means I’ll be losing the data I’ve paid for from midnight @VodafoneUK @VodafoneGroup @rmstakkar @Nairkavita,Customer Service
4,@VodafoneUK,2019-12-04 07:57:16,@VodafoneUK you send texts about rewards - this morning Lindt. It takes me to my app but they are never there. Doesn’t matter how quickly I look. It actually becomes annoying.,Promotion


## Missing Values

Identify any missing values within the dataframe.  This will prevent further issues with data processing further down the line.

In [39]:
## Use isna().sum() to identify any missing values within the dataframe
df.isna().sum()

network           0
datetime          0
original_tweet    1
subject           0
dtype: int64

As can be seen above, there is only one missing value.  Since it is not possible to predict the contents of a blank tweet, this missing record will be dropped.

In [40]:
## Drop missing values
df.dropna(inplace=True)

In [41]:
## Reset index to ensure sequential index
df = df.reset_index()
df.drop(['index'], axis=1, inplace=True)

## Drop All Retweets from Dataset

To avoid duplicated content, all retweets will be dropped from the dataframe.  Retweets can be identified as those starting with the string "RT".  

In [42]:
## Iterate through each tweet.  For any tweet starting with 'RT', drop it from the dataframe
for i in range(0, len(df)):
    
    if df['original_tweet'][i].startswith('RT'):
        
        df.drop(index=i, inplace=True)

In [43]:
## Remaining number of tweets
len(df)

12562

In [44]:
## Reset index to ensure sequential index
df = df.reset_index()
df.drop(['index'], axis=1, inplace=True)

## Make Subject Categories Lower Case

The subject will act as a target variable.  To ensure consistent labelling, the different values that this field assumes are checked.

In [46]:
## value_counts() identifies the different values taken on by this field
df['subject'].value_counts(normalize=True)

                    0.651568
Other               0.134851
Customer Service    0.073714
Network             0.043146
Contract            0.041952
Promotion           0.025553
Device              0.021971
Broadband           0.006687
CUstomer Service    0.000318
COntract            0.000080
CuStomer Service    0.000080
other               0.000080
Name: subject, dtype: float64

As can be seen, 'Other' and 'other' are considered different subjects.  So is 'COntract' and 'Contract'.  To prevent this causing issues later on, all values in this field will be made lower case.

In [47]:
## Make all values within the subject field lower case
df['subject'] = df['subject'].str.lower()

To confirm that this has worked, re-run the value_counts() function.

In [49]:
## value_counts() identifies the different values taken on by this field
df['subject'].value_counts(normalize=True)

                    0.651568
other               0.134931
customer service    0.074112
network             0.043146
contract            0.042032
promotion           0.025553
device              0.021971
broadband           0.006687
Name: subject, dtype: float64

## Use TextBlob to calculate Sentiment Polarity

One of the objectives of this project is to provide an 'on-the-pulse' measure of customer satisfactions.  In order to support this aim, TextBlob can be used to calculate a sentiment score for each tweet.  The sentiment score ranges from -1 (very negative) to 0 (neutral) to 1 (very positive).  

In [50]:
## Create a new field that indicates sentiment of a tweet using TextBlob
df['sentiment'] = df['original_tweet'].map(lambda text: TextBlob(text).sentiment.polarity)

To check that these sentiment scores are reasonable, 5 randomly selected positive, neutral and negative tweets are checked.

In [51]:
## Print 5 random reviews with the highest positive sentiment (1)
print('5 random reviews with the highest positive sentiment polarity: \n')
positive = df.loc[df.sentiment == 1, ['original_tweet']].sample(5).values
for tweet in positive:
    print(tweet[0])

5 random reviews with the highest positive sentiment polarity: 

@DCass71 @ThreeUK I'm happy with Vodafone tbh!
The VideoTech Innovation Awards are tonight and we've been nominated for our collaborative project with @@MusionEventsLtd and @VodafoneUK! Best of luck to all of the nominees! 

@digitaltveurope
#VIAWARDS19 #DigitalTVAwards #LiveStreaming #5G https://t.co/8by1vtEckf
@bradfreeman123 @ThreeUK @O2 will always be the best 👌🏻
@threeuk fraud support 👍 Top-notch service https://t.co/BTTQWLKikK https://t.co/H3upvVngGs
Thank you @O2 I won some tickets to #Hotboozapalooza. It was my bday on 26th Nov. It made a perfect gift. Can’t wait to go 🙌🏽


In [54]:
## Print 5 random reviews with neutral sentiment (0)
print('5 random reviews with neutral sentiment polarity: \n')
neutral = df.loc[df.sentiment == 0, ['original_tweet']].sample(5).values
for tweet in neutral:
    print(tweet[0])

5 random reviews with neutral sentiment polarity: 

@VodafoneUK I look forward to hearing from you.
@O2 May I add @O2 I have taken this further to the Ombudsman. So yes, thank you, I did follow your link.
@VodafoneUK Last Christmas #verymerewards
@O2 Oops actually it was in my junk mail 🤦🏻‍♀️
@philsturgeon @ThreeUK @sprint Every byte counts, it sounds like


In [55]:
## Print 5 random reviews with the most negative sentiment (-1)
print('5 random reviews with the most negative sentiment polarity: \n')
negative = df.loc[df.sentiment == -1, ['original_tweet']].sample(5).values
for tweet in negative:
    print(tweet[0])

5 random reviews with the most negative sentiment polarity: 

@ThreeUK @ThreeUKSupport This is crazy! I have been trying since Friday morning!!!!
@VodafoneUK is THE WORST phone provider. I called to pay my bill, it says it hasn’t gone through. I check my balance and it’s come out. It’s now in “pending” limbo. And Vodafone are saying it’s not there problem. DO NOT GET A VODAFONE CONTRACT. #Vodafone
@O2 Just tweeting to let the rest of us know of this nasty technique. Eternal vigilance is the price of freedom.
@ThreeUK I’ve just swapped from EE to ThreeUK and the download speeds are terrible!! I’ve gone from 65mos to 5mps!!! #DOWNLOAD #downloadspeed ##three
@paolopescatore @VodafoneUK @VodafoneGroup Problem is 5G coverage is terrible


As can be seen from the above, TextBlob has provided a reasonable (but not perfect) score of sentiment.  

## Create List of Stopwords

Commonly used words, such as 'a', 'you', 'the' etc. add little value to an NLP task.  They do not help a model to distinguish between different tweets.  For this reason, it is important that they are removed.  To do this, all of these useless words (known as stopwords) must be identified.  

As a starting point, NLTK provide a comprehensive list of stopwords.  This can then be augmented with punctuation, which also provides no value.  The names of the major phone networks will also be removed.  

In [56]:
## Call the NLTK list of english stopwords
stopwords_list = stopwords.words('english')

## Add common punctuation to this list
stopwords_list += string.punctuation
stopwords_list += ["/n","''", '""', '...', '``',"'",'’','amp']
stopwords_list += ['vodafone', 'three','ee','o2']

Finally, the stopwords list can be previewed.

In [57]:
## Preview the first 20 items on the stopwords list
stopwords_list[0:20]

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his']

## Processing Tweeets: Remove Stopwords, Make Lower Case and Tokenize

To pre-process each tweet, the following steps must be completed:

1. Stopwords and any other low value text patterns must be removed
2. All tweets must be made lower case (to ensure any model understands that DOG is the same word as Dog and dog)
3. All tweets must be tokenized.  That is each string must be split into a list.

To make this process most efficient, a function will be created to complete each of these tasks for a given tweet.

In [58]:
## Function to process tweet
def process_tweet(tweet):
    
    ## Remove "@username" from each Tweet
    pattern = '(\w*@\w*)'
    p = re.compile(pattern)
    tweet = p.sub('',tweet)
    
    ## Remove links from each Tweet
    pattern2 = '((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:@\-_=#]+\.([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:@\-_=#])*'
    p = re.compile(pattern2)
    tweet = p.sub('',tweet)
    
    ## Remove non-english characters
    pattern3 = '([^\x00-\x7A])+'
    p = re.compile(pattern3)
    tweet = p.sub('',tweet)

    ## Tokenize tweet
    tokens = nltk.word_tokenize(tweet)
    
    ## Retain only words that are not in the Stopwords list
    stopwords_removed = [token.lower() for token in tokens if token.lower() not in stopwords_list]
    return stopwords_removed

Having defined this function, each tweet can be processed. The processed tweets are added to an empty list.

In [59]:
## Empty list to contain processed tweets
tokenized_tweets = []

## For loop that iterates through each tweet and processes it, before adding to the list tokenized_tweets
for tweet in list(df['original_tweet']):
    
    tokenized_tweets.append(process_tweet(tweet))
        

To understand the effect of processing a tweet, look at a given tweet before and after processing.

In [60]:
## Tweet before processing
df.loc[15]['original_tweet']

'Thanks @VodafoneUK for an early Christmas Present. https://t.co/0cYvktGGqv'

In [61]:
## Tweet after processing
tokenized_tweets[15]

['thanks', 'early', 'christmas', 'present']

As can be gauged from the above, the processed tweet retains its core meaning, without retaining any surplus-to-requirement words.

## Lemmatization

The next stage of preocessing, it to lemmatize tweets.  Lemmatization is the process of grouping alternative variations of a word, so that they can be interpreted as having a single meaning.  For example, goats becomes goat, and feet becomes foot.  

NLTK provides a comprehensive package to assist with lemmatization called WordNetLemmatizer().  This is used below.

In [62]:
## Define the WordNetLemmatizer() function
lemmatizer = WordNetLemmatizer()

In [63]:
## Create an empty list to contain lemmatized words
lemmatized_tweets = []

## Iterate through each tokenized word to convert to its lemmatized form
for tweet in tokenized_tweets:
    
    lemmatized = []
    
    for word in tweet:
        
        lemmatized.append(lemmatizer.lemmatize(word))
    
    lemmatized_tweets.append(lemmatized)
        
        

## Save Tokenized/Lemmatized Tweets to Modified Dataframe

The tokenized/lemmatized tweets can now be added to the existing dataframe.  They will be added both as tokens and as a string to ensure maximum flexibility during exploratory data analysis and modelling.

In [64]:
## Create a new column in the existing dataframe to contain lemmatized tweets as tokens
df['lemmatized_tweets_tokens'] = lemmatized_tweets

In [65]:
## Create a new column in the existing dataframe to contain the lemmatized tweets as a string
df['lemmatized_tweets_string'] = str()

In [66]:
## For loop to make convert lemmatized/tokenized tweets back into a string
for i in range(0, len(df)):
    
    df['lemmatized_tweets_string'][i] = " ".join(df.loc[i]['lemmatized_tweets_tokens'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [67]:
## Preview the first five rows
df.head()

Unnamed: 0,network,datetime,original_tweet,subject,sentiment,lemmatized_tweets_tokens,lemmatized_tweets_string
0,@VodafoneUK,2019-12-04 08:05:14,@VodafoneUK Plus £2.28 package &amp; posting ! ! !,device,0.0,"[plus, 2.28, package, posting]",plus 2.28 package posting
1,@VodafoneUK,2019-12-04 08:04:05,I have repeatedly asked how to get a refund so I can use another provider. I have also asked how to escalate my complaint. @VodafoneIN refuses to give me this information. @VodafoneUK @VodafoneGroup @rmstakkar @Nairkavita,customer service,-0.3,"[repeatedly, asked, get, refund, use, another, provider, also, asked, escalate, complaint, refuse, give, information]",repeatedly asked get refund use another provider also asked escalate complaint refuse give information
2,@VodafoneUK,2019-12-04 08:01:19,"I have supplied visa details twice, I have been subjected to horrendously rude staff instore, and now Vodafone are stealing my money by removing services I have paid for. Tourists should not use Vodafone. @VodafoneIn @VodafoneUK @VodafoneGroup @rmstakkar @Nairkavita",customer service,-0.3,"[supplied, visa, detail, twice, subjected, horrendously, rude, staff, instore, stealing, money, removing, service, paid, tourist, use]",supplied visa detail twice subjected horrendously rude staff instore stealing money removing service paid tourist use
3,@VodafoneUK,2019-12-04 07:57:42,@VodafoneIN promised yesterday I’d receive no more calls and would get an email in 30 mins. No email received. Today I received yet another call. Vodaphone incompetence means I’ll be losing the data I’ve paid for from midnight @VodafoneUK @VodafoneGroup @rmstakkar @Nairkavita,customer service,-0.25,"[promised, yesterday, id, receive, call, would, get, email, 30, min, email, received, today, received, yet, another, call, vodaphone, incompetence, mean, ill, losing, data, ive, paid, midnight]",promised yesterday id receive call would get email 30 min email received today received yet another call vodaphone incompetence mean ill losing data ive paid midnight
4,@VodafoneUK,2019-12-04 07:57:16,@VodafoneUK you send texts about rewards - this morning Lindt. It takes me to my app but they are never there. Doesn’t matter how quickly I look. It actually becomes annoying.,promotion,-0.155556,"[send, text, reward, morning, lindt, take, app, never, doesnt, matter, quickly, look, actually, becomes, annoying]",send text reward morning lindt take app never doesnt matter quickly look actually becomes annoying


## Split DataFrame into Labelled and Unlabelled Observations

Finally, the dataframes will be split into observations with a subject label, and observations without a subject label.  These will be used as follows in a semi-supervised learning approach:

- The labelled dataframe will be used to build the initial set of models
- A set of labels will then be generated for the unlabelled dataset using this model
- The entire dataset will then be used to rebuild the models, and identify performance improvement.

In [68]:
## Labelled dataframe created and saved using Pickle
df_labelled = df.loc[df['subject']!= " "]
df_labelled.to_pickle('cleaned_labelled_tweets')

In [69]:
## Unabelled dataframe created and saved using Pickle
df_unlabelled = df.loc[df['subject']== " "]
df_unlabelled.to_pickle('cleaned_unlabelled_tweets')