# --Task 2 - Twitter post classifier (Using company labels)

In this task, we need to predict the name of the company based on the inbound text from customers. For this problem, I am going with multi-class text classification using tensorflow.Keras 

Importing most of the libraries required for the code.

In [2]:
import itertools
import os

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.preprocessing import text, sequence
from tensorflow.keras import utils

from sklearn.preprocessing import LabelBinarizer, LabelEncoder
from sklearn.metrics import confusion_matrix


Loading the dataset using pandas and displaying the properties of the dataset.

In [3]:
df1 = pd.read_csv('twcs.csv')
print(df1.info()) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2811774 entries, 0 to 2811773
Data columns (total 7 columns):
 #   Column                   Dtype  
---  ------                   -----  
 0   tweet_id                 int64  
 1   author_id                object 
 2   inbound                  bool   
 3   created_at               object 
 4   text                     object 
 5   response_tweet_id        object 
 6   in_response_to_tweet_id  float64
dtypes: bool(1), float64(1), int64(1), object(4)
memory usage: 131.4+ MB
None


Splitting the dataset columns into individual lists.

In [4]:
author_list = []; author_list = df1['author_id']
text_list = []; text_list = df1['text']

Knowing how many unique company names exist within the database. It is 91.

In [5]:
authors = df1
authors.info()
authors = authors[authors.author_id.apply(lambda x: x.isalpha())]
unique_authors_list = []
unique_authors_list = authors['author_id'].unique()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2811774 entries, 0 to 2811773
Data columns (total 7 columns):
 #   Column                   Dtype  
---  ------                   -----  
 0   tweet_id                 int64  
 1   author_id                object 
 2   inbound                  bool   
 3   created_at               object 
 4   text                     object 
 5   response_tweet_id        object 
 6   in_response_to_tweet_id  float64
dtypes: bool(1), float64(1), int64(1), object(4)
memory usage: 131.4+ MB


In [6]:
print(unique_authors_list, len(unique_authors_list))

['sprintcare' 'VerizonSupport' 'ChipotleTweets' 'AskPlayStation'
 'marksandspencer' 'MicrosoftHelps' 'ATVIAssist' 'AdobeCare' 'AmazonHelp'
 'XboxSupport' 'AirbnbHelp' 'nationalrailenq' 'AirAsiaSupport' 'Morrisons'
 'NikeSupport' 'AskAmex' 'McDonalds' 'YahooCare' 'AskLyft' 'UPSHelp'
 'Delta' 'AppleSupport' 'Tesco' 'SpotifyCares' 'comcastcares'
 'AmericanAir' 'TMobileHelp' 'VirginTrains' 'SouthwestAir' 'AskeBay'
 'GWRHelp' 'sainsburys' 'AskPayPal' 'HPSupport' 'ChaseSupport' 'CoxHelp'
 'DropboxSupport' 'VirginAtlantic' 'AzureSupport' 'AlaskaAir'
 'ArgosHelpers' 'AskTarget' 'GoDaddyHelp' 'CenturyLinkHelp' 'AskPapaJohns'
 'askpanera' 'Walmart' 'USCellularCares' 'AsurionCares' 'GloCare'
 'NeweggService' 'VirginAmerica' 'DunkinDonuts' 'TfL' 'asksalesforce'
 'Kimpton' 'AskCiti' 'IHGService' 'LondonMidland' 'JetBlue' 'BoostCare'
 'JackBox' 'AldiUK' 'HiltonHelp' 'GooglePlayMusic' 'OfficeSupport'
 'DellCares' 'TwitterSupport' 'GreggsOfficial' 'ATT' 'TacoBellTeam'
 'AskRBC' 'ArbysCares' 'NortonSup

Performing rearrangement of the data and printing out the total number of rows in the newly rearranged data.

In [7]:
author_newlist, text_newlist = [], []

for authorIndex in range(int(len(author_list))):
    try:
        if(author_list[authorIndex].isalpha()):
            if(author_list[authorIndex+1].isnumeric()):
                text_newlist.append(text_list[authorIndex+1])
                author_newlist.append(author_list[authorIndex])
        else:
            continue

    except:
        continue
print(len(text_newlist))

957278


Checking the newly formed data for garbage useless values

In [8]:
for i in range(10):   
    print(author_newlist[i], ":", text_newlist[i])

sprintcare : @sprintcare and how do you propose we do that
sprintcare : @sprintcare I did.
sprintcare : @sprintcare is the worst customer service
sprintcare : @sprintcare You gonna magically change your connectivity for me and my whole family ? 🤥 💯
sprintcare : @sprintcare Since I signed up with you....Since day 1
sprintcare : @115714 y’all lie about your “great” connection. 5 bars LTE, still won’t load something. Smh.
sprintcare : @115714 whenever I contact customer support, they tell me I have shortcode enabled on my account, but I have never in the 4 years I've tried https://t.co/0G98RtNxPK
VerizonSupport : @VerizonSupport I finally got someone that helped me, thanks!
VerizonSupport : somebody from @VerizonSupport please help meeeeee 😩😩😩😩 I'm having the worst luck with your customer service
VerizonSupport : @VerizonSupport My friend is without internet we need to play videogames together please our skills diminish every moment without internetz


Created a dataframe for newly formed data and verified it.

In [9]:
data = {
  "author" : author_newlist,
  "text": text_newlist
}

df = pd.DataFrame(data)
print(df.head(10)) 

           author                                               text
0      sprintcare      @sprintcare and how do you propose we do that
1      sprintcare                                 @sprintcare I did.
2      sprintcare          @sprintcare is the worst customer service
3      sprintcare  @sprintcare You gonna magically change your co...
4      sprintcare  @sprintcare Since I signed up with you....Sinc...
5      sprintcare  @115714 y’all lie about your “great” connectio...
6      sprintcare  @115714 whenever I contact customer support, t...
7  VerizonSupport  @VerizonSupport I finally got someone that hel...
8  VerizonSupport  somebody from @VerizonSupport please help meee...
9  VerizonSupport  @VerizonSupport My friend is without internet ...


Definitions and modules for cleaning the texts.

In [10]:
import re
import string
from nltk.corpus import stopwords


import nltk
nltk.download('stopwords')
stop = set(stopwords.words("english"))

def remove_emoji(data):
    emoj = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)
    return re.sub(emoj, '', data)

def remove_URL(text):
    url = re.compile(r"https?://\S+|www\.\S+")
    return url.sub(r"", text)

def remove_html(text):
    html = re.compile(r"<.*?>")
    return html.sub(r"", text)

def remove_punct(text):
    table = str.maketrans("", "", string.punctuation)
    return text.translate(table)

def remove_stopwords(text):
    text = [word.lower() for word in text.split() if word.lower() not in stop]
    return " ".join(text)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Likhi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Performing the cleaning and removing useless information from data

In [11]:
df['text'] = df.text.map(lambda x: remove_URL(x)) #remove URL elements
df['text'] = df.text.map(lambda x: remove_html(x)) #remove htmlelements
df['text'] = df.text.map(lambda x: remove_emoji(x)) #remove emojis
df['text'] = df.text.map(lambda x: remove_punct(x)) #remove punctuations
df['text'] = df.text.map(lambda x: remove_stopwords(x)) #remove stopwords
df['text'] = df['text'].str.replace('\d+', '') #remove numbers from text strings



Due to my hardware issues, dividing the data into six bins - equal portions.

In [12]:
bins = 6
frames = np.array_split(df, bins)

Verifying.

Checking all the bins for total number of labels and rows.

In [14]:
bin_index = 0
for i in range(int(len(frames))):
    print("Bin ", i, ": ", "number of labels/classes -> ", len(frames[1]["author"]), "and Number of rows -> ", len(frames[i]["author"].unique()))

Bin  0 :  number of labels/classes ->  159547 and Number of rows ->  91
Bin  1 :  number of labels/classes ->  159547 and Number of rows ->  91
Bin  2 :  number of labels/classes ->  159547 and Number of rows ->  91
Bin  3 :  number of labels/classes ->  159547 and Number of rows ->  91
Bin  4 :  number of labels/classes ->  159547 and Number of rows ->  91
Bin  5 :  number of labels/classes ->  159547 and Number of rows ->  91


Setting up a counter for counting the total number of words in the selected bin and printing them out.

In [15]:
from collections import Counter

#counts unique words from all text column
def counter_word(text):
    count = Counter()
    for i in text.values:
        for word in i.split():
            count[word] += 1
    return count

max_words = len(counter_word(frames[bin_index]["text"]))
print(max_words)

78027


Checking and setting the sizes of test and train

In [16]:
train_size = int(len(frames[bin_index]) * .7)
print ("Train size: %d" % train_size)
print ("Test size: %d" % (len(frames[bin_index]) - train_size))

Train size: 111682
Test size: 47865


Spliting the reformed data into test and train

In [17]:
train_posts = frames[bin_index]['text'][:train_size]
train_tags = frames[bin_index]['author'][:train_size]

test_posts = frames[bin_index]['text'][train_size:]
test_tags = frames[bin_index]['author'][train_size:]

Creating a tokenizer and Tokenizing the text column of test and train.

In [18]:
tokenize = text.Tokenizer(num_words=max_words, char_level=False)

In [19]:
tokenize.fit_on_texts(train_posts) # only fit on train
x_train = tokenize.texts_to_matrix(train_posts)
x_test = tokenize.texts_to_matrix(test_posts)

Encoding labels/Classes

In [20]:
encoder = LabelEncoder()
encoder.fit(train_tags)
y_train = encoder.transform(train_tags)
y_test = encoder.transform(test_tags)

In [21]:
num_classes = np.max(y_train) + 1
y_train = utils.to_categorical(y_train, num_classes)
y_test = utils.to_categorical(y_test, num_classes)

Verifying.

In [22]:
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)

x_train shape: (111682, 78027)
x_test shape: (47865, 78027)
y_train shape: (111682, 91)
y_test shape: (47865, 91)


Setting up batch size (random choice) and epochs (model is overfitting so I chose low epoch number)

In [23]:
batch_size = 32
epochs = 2

My model design and compilation of the design.

In [24]:
model = Sequential()
model.add(Dense(512, input_shape=(max_words,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

Fitting my model using the parametres and variables given above. Highest accuracy that I got is 95.58%.

In [25]:
Tweet_model = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_split=0.1)

Epoch 1/2
Epoch 2/2


Evaluating my model: Accuracy is 92.046%

In [26]:
score = model.evaluate(x_test, y_test, batch_size=batch_size, verbose=1)
print('Test accuracy:', score[1])

Test accuracy: 0.9204638004302979


Visual result.

In [27]:
text_labels = encoder.classes_ 

for i in range(10):
    prediction = model.predict(np.array([x_test[i]]))
    predicted_label = text_labels[np.argmax(prediction)]
    print(test_posts.iloc[i][:50], "...")
    print('Actual label:' + test_tags.iloc[i])
    print("Predicted label: " + predicted_label + "\n")

amazonhelp metal credit card impacting qi charging ...
Actual label:AmazonHelp
Predicted label: AmazonHelp

 hola received item missing part contact seller th ...
Actual label:AmazonHelp
Predicted label: AskeBay

 policy price changes purchase ...
Actual label:AmazonHelp
Predicted label: AmazonHelp

applesupport contact  know specifics ...
Actual label:AppleSupport
Predicted label: AppleSupport

 solid customer service  ...
Actual label:AppleSupport
Predicted label: AmazonHelp

 applesupport yes update twitter apps lagging also ...
Actual label:AppleSupport
Predicted label: AppleSupport

phone crashes every mins banging love new update  ...
Actual label:AppleSupport
Predicted label: AppleSupport

applesupport won’t videos stream devices onto appl ...
Actual label:AppleSupport
Predicted label: AppleSupport

applesupport  international dialing code brazil io ...
Actual label:AppleSupport
Predicted label: AppleSupport

applesupport home screen look locked time datelate ...
Actual label:Ap

Saving the model

In [28]:
model.save('saved_model/my_model2')

INFO:tensorflow:Assets written to: saved_model/my_model2\assets


Loading the model (Commented out since it has no purpose)

In [None]:
#new_model = tf.keras.models.load_model('saved_model/my_model2')

model summary (Commented out since it has no purpose)

In [None]:
#new_model.summary()