## Data Cleaning and Preprocessing
**Author** - Kushal 

#### Understanding data
* Following few cells are about understanding the data, the email text,languages and other parts that make it up. 
* First of all the data has a few duplicate values, i.e some emails are repeated more than once.
* Also, many emails have multiple languages.

In [11]:
import pandas as pd
import numpy as np

In [12]:
data = pd.read_excel('sampledata.xlsx')
data.head()

Unnamed: 0,Information
0,"Ciao Maurizio,\n\nGrazie mille per aver trovat..."
1,"Ciao Maurizio,\n\nGrazie mille per aver trovat..."
2,こんにちは、 のユーザー様。\n\nこの問題についてご連絡いただきありがとうございます。 申...
3,こんにちは、 のユーザー様。\n\nこの問題についてご連絡いただきありがとうございます。 申...
4,"Hi Stephane,\n\nThank you for keeping me updat..."


In [13]:
# Understanding the structure of emails

x = data['Information'][4]
x.split('\n')

['Hi Stephane,',
 '',
 "Thank you for keeping me updated on this issue. I'm happy to hear that the issue got resolved after all and you can now use   in its full functionality again. ",
 'Also many thanks for your suggestions. For cards that are mainly using (for example) QR codes, the scanning example will also adapt to this format.',
 'We hope to improve this feature for all cards in the future. ',
 '',
 "In case you experience any further problems with your   app, please don't hesitate to contact me again.",
 '',
 'Best regards,',
 '',
 '',
 'Solveig Miriam Brandt',
 'Customer Support',
 '',
 '  GmbH',
 'C-HUB / Hafenstraße 25-27',
 '68159 Mannheim']

In [14]:
# There are duplicate emails in the dataset. 

print('Total Emails: ',len(data))
print('Unique Emails: ',data.nunique())

Total Emails:  596
Unique Emails:  Information    492
dtype: int64



####  Removing Salutations
* The following method treats the emails irrespective of the language that it contains. Since the salutation, address and  other email meta-data is unlikely to provide any useful information when it comes to summarization, I have simply omitted the sentences based on their length.
* Sentences of smaller size correspond to the salutions (in the beginning) and other details in the end. Morover larger sentences are always a part of the email body or payload.

In [15]:
def remove_salutation():
    '''
       Removes salutation and other information not required for summarization by simply
       considering the length of the sentence.
    '''
    
    emails = list(data['Information'])
    other_info = []
    email_text = []
    
    for sample in emails:
        #print(sample)
        #print('---------------------------------------------------------------------')
        sample = sample.split('\n')
        n = len(sample)
        text = []
        info = []
        for i in range(1,n-1):
            if len(sample[i])>50:
                text.append(sample[i])
            elif sample[i] != '':
                info.append(sample[i])
                
        info_text = '---'.join([x for x in info])        
        str_text = ''.join([x for x in text])
        
        other_info.append(info_text)
        email_text.append(str_text)
    
    return email_text,other_info

In [16]:
emails,other_info = remove_salutation()
email_set = list(set(emails))
info_set = list(set(other_info))

#### Detecting Languages 

* Following cells make use of two libraries, langdetect to detect languages in the text and iso639 to map the ISO language codes to their actual name.
* langdetect uses a non-deterministic algorithm to detect langauges and hence can be ambiguous.
* 9 languages are detected in the whole dataset, with the majority of them being english. For further data cleaning tasks, I  have considered only top 6 langauges for convenience.

In [17]:

from langdetect import detect,detect_langs
from iso639 import languages
from langdetect import DetectorFactory

DetectorFactory.seed = 0 # to deal with non-determinism

def get_all_languages():
    '''Gets all the languages used in the dataset.'''
    
    langs = []
    lang_names = []
    for email in email_set:
        lang = detect(email)
        if lang not in langs:
            langs.append(lang)
            name = languages.get(alpha2=lang).name
            lang_names.append(name)

  
    return lang_names,langs

In [18]:
lang_names,langs = get_all_languages()
print(langs)
print(lang_names)

['en', 'de', 'it', 'nl', 'fr', 'es', 'ja', 'ru', 'pl']
['English', 'German', 'Italian', 'Dutch', 'French', 'Spanish', 'Japanese', 'Russian', 'Polish']


In [19]:
email_df = pd.DataFrame({'Cleaned Emails':emails,'Other Info':other_info})
email_df.head()

Unnamed: 0,Cleaned Emails,Other Info
0,Grazie mille per aver trovato il tempo per met...,"Tanti saluti,---Isabelle van Capelleveen---Cus..."
1,Grazie mille per aver trovato il tempo per met...,"Tanti saluti,---Isabelle van Capelleveen---Cus..."
2,この問題についてご連絡いただきありがとうございます。 申し訳ありませんが私は日本語語が話せま...,その他にもご質問や改善の提案、一般的なご意見などございましたら、お気軽にお問い合わせください...
3,この問題についてご連絡いただきありがとうございます。 申し訳ありませんが私は日本語語が話せま...,その他にもご質問や改善の提案、一般的なご意見などございましたら、お気軽にお問い合わせください...
4,Thank you for keeping me updated on this issue...,"Best regards,---Solveig Miriam Brandt---Custom..."


In [20]:
def get_lang(row):
    text = row['Cleaned Emails']
    lang = detect(text)
    return lang    

In [21]:
# Most of the emails are in english

email_df['Language'] = email_df.apply(get_lang,axis=1)
email_df['Language'].value_counts()

en    474
de     33
it     31
nl     22
fr     20
es     11
ru      3
pl      1
ja      1
Name: Language, dtype: int64

In [100]:
eng_df = email_df[email_df['Language'] == 'en']
#email_df.to_pickle('./new_df')
#eng_df.to_pickle('./eng_df')

### Stemming
* Only top 6 languages are considered. Hence, russian,japanese and polish tuples have been dropped from the dataframe.
* Stemming is a crude rule-based method of converting different words to its root or base form. Snowball Stemmer from NLTK supports many languages.
* Before stemming, the input text is tokenized, converted to lower case and then stemmed.

In [22]:
email_df.head()
indices = email_df[(email_df.Language=='ru')|(email_df.Language=='ja')|(email_df.Language=='pl')].index
email_df.drop(indices,inplace=True)

In [23]:
import nltk
from nltk.stem.snowball import SnowballStemmer

en_stemmer = SnowballStemmer('english')
fr_stemmer = SnowballStemmer('french')
de_stemmer = SnowballStemmer('german')
it_stemmer = SnowballStemmer('italian')
es_stemmer = SnowballStemmer('spanish')
nl_stemmer = SnowballStemmer('dutch')
stemmers = {'en':en_stemmer,'fr':fr_stemmer,'de':de_stemmer,'it':it_stemmer,'es':es_stemmer,'nl':nl_stemmer}

def stem_text(row):
    ''' Stems text based with snowball stemmer based on the language.'''
    
    lang = row['Language']
    text = row['Cleaned Emails']
    text = ''.join([x.lower() for x in text])
    #print(text)
    tokens = nltk.word_tokenize(text)
    #print(tokens)
    stemmer = stemmers[lang]
    #print(stemmer)
    stemmed_text = ' '.join([stemmer.stem(token) for token in tokens])
    
    return stemmed_text
    
    

In [24]:
email_df['Cleaned Emails'] = email_df.apply(stem_text,axis=1)

In [25]:
email_df['Cleaned Emails'][0]

"grazi mill per aver trovato il tempo per mettervi in contatto con noi per questo problema ! purtroppo non parlo l'italiano , quindi spero vada bene lo stesso se rispondo in ingles : thank you so much for reach out and take the time to contact us about this issu ! usual , the notif alert on the app icon indic that there are new flyer or catalogu avail in your `` offer '' section in and should clear as soon as all notif that are tag as `` new '' have been open onc . unfortun , the notif alert on the app icon not disappear even though all new offer in have alreadi been open seem to be a bug on veri few devic at the moment . i can assur you that our develop are alreadi awar of the problem and tri to solv it as soon as possibl . altern , you can also disabl the notif badg complet in your general system set under `` notif '' - > `` `` - > `` badg app icon '' . in the meantim , we sincer apolog for the inconveni this caus and hope that you can use in it full function again soon . se dovest a

#### Text Normalisation
* Cucco is used for normalising the text since it supports multiple languages.
* Text can be normalised based on a number of rules that we can provide like- removing stop words, punctuation, whitespace etc.

In [34]:
import cucco
from cucco import Cucco

norm_en = Cucco(language='en')
norm_es = Cucco(language='es')
norm_fr = Cucco(language='fr')
norm_it = Cucco(language='it')
norm_de = Cucco(language='de')
norm_nl = Cucco(language='nl')

normalisers = {'en':norm_en,'es':norm_es,'fr':norm_fr,'it':norm_it,'nl':norm_nl,'de':norm_de}

def normalise(row):
    ''' Performs text normalisation for multiple languages. Removes stopwords,punctuation etc.'''
    
    lang = row['Language']
    text = row['Cleaned Emails']
    sents = nltk.sent_tokenize(text)
    normaliser = normalisers[lang]
    rules = ['remove_stop_words', 'replace_punctuation', 'remove_extra_whitespaces']
    norm_text = ' '.join([normaliser.normalize(sent,rules) for sent in sents])
    
    return norm_text
    

In [35]:
email_df['Cleaned Emails'] = email_df.apply(normalise,axis=1)

In [42]:
email_df[email_df.Language=='fr']['Cleaned Emails'][10]

'merc votr messag feedback somm tres heureux ’ entendr plaît travaillon jour jour amélior développ notr appliqu utilis afin ’ ultim rendr portefeuill physiqu obsolèten ’ hésit partag ’ appliqu amis'

In [52]:
email_df[email_df.Language=='de']['Cleaned Emails'][17]

'dank fur nachricht dafur dass zeit genomm ruckmeld geb durft frag zugang alt handy kart geratewechsel neu handy ubertrag konn zuvor backup kart erstell welch neu handy wiederherstell konn dafur gibt backup funktion innerhalb app erlaubt kart cloud weit mobil gerat ubertrag funktioniert folgendermassen1 bitt geh einstell wahl backup android backup ios 2 konn entwed facebook googl privat email adress einlogg 3 kart gespeichert sobald erfolgreich angemeldet ios konn danach backup erstell geh android 1 bitt geh wied uber einstell backup neu handy2 meld denselb dat alt gerat beispielsweis googl benutzt bitt wied uber googl anmeld uber email adress 3 sobald angemeldet werd kart automat wiederhergestellt offnet fenst option entwed neu backup anzuleg alt backup wiederherzustell bitt wahl zweit option backup wiederherstell danach sollt all kart automat neu handy verfug stehenbitt beacht dass nutz problem hatt kart uber facebook wiederherzustell empfehl dah entwed googl email adress verwend pro

### LDA
* Latent Dirichlet Allocation.
* Implemented using sci-kit learn and performed only on english data.
* 5 topics have been considered.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [3]:
import pandas as pd
eng_df = pd.read_pickle('./eng_df')

In [4]:
cv = CountVectorizer(max_df=0.95,min_df=2,stop_words='english')
term_matrix = cv.fit_transform(eng_df['Cleaned Emails'])
term_matrix

<342x973 sparse matrix of type '<class 'numpy.int64'>'
	with 17793 stored elements in Compressed Sparse Row format>

In [5]:
lda = LatentDirichletAllocation(n_components=5)
lda.fit(term_matrix)



LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_components=5, n_jobs=1,
             n_topics=None, perp_tol=0.1, random_state=None,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

In [6]:
len(lda.components_)
lda.components_.shape

(5, 973)

In [58]:
lda.components_

array([[0.22163034, 1.55830537, 0.22256588, ..., 0.22248275, 0.21900539,
        0.22310291],
       [0.22022823, 0.22307812, 2.99329219, ..., 0.22211049, 0.21790258,
        0.21554685],
       [0.22203806, 0.2209804 , 0.22090522, ..., 0.22775048, 0.22432041,
        0.21888648],
       [0.27438229, 0.27474994, 0.27122582, ..., 1.89427068, 1.88765363,
        1.89585393],
       [3.56592575, 1.08621551, 1.37921427, ..., 0.2245659 , 0.2186602 ,
        0.22401345]])

In [59]:
len(lda.components_[0])

973

In [62]:
topic = lda.components_[0]
top_words_indices = topic.argsort()[-10:]
for index in top_words_indices:
    print(cv.get_feature_names()[index])

feedback
issue
don
time
taking
scanners
thank
card
scanning
contact


In [79]:
topic_word_dict = {}
for index,topic in enumerate(lda.components_):
    words = [cv.get_feature_names()[i] for i in topic.argsort()[-10:]]
    topic_word_dict[index] = words
    print('Top words for topic {}'.format(index))
    print(words)
    print('-'*120)

Top words for topic 0
['feedback', 'issue', 'don', 'time', 'taking', 'scanners', 'thank', 'card', 'scanning', 'contact']
------------------------------------------------------------------------------------------------------------------------
Top words for topic 1
['restore', 'contact', 'facebook', 'device', 'mail', 'address', 'cards', 'google', 'account', 'backup']
------------------------------------------------------------------------------------------------------------------------
Top words for topic 2
['attention', 'caused', 'stores', 'tesco', 'loyalty', 'thank', 'cards', 'digital', 'information', 'acceptance']
------------------------------------------------------------------------------------------------------------------------
Top words for topic 3
['suggestions', 'don', 'questions', 'time', 'thank', 'contact', 'app', 'feedback', 'cards', 'card']
------------------------------------------------------------------------------------------------------------------------
Top words for

In [83]:
topics = lda.transform(term_matrix)
eng_df['topic'] = topics.argmax(axis=1)

def assign_topics(row):
    topic = row['topic']
    words = topic_word_dict[topic]
    
    return words

In [84]:
eng_df['topic words'] = eng_df.apply(assign_topics,axis=1)

In [85]:
eng_df.head()

Unnamed: 0,Cleaned Emails,lang,topic,topic words
0,"Merci pour votre message! Malheureusement, mon...",en,1,"[restore, contact, facebook, device, mail, add..."
2,この問題についてご連絡いただきありがとうございます。 申し訳ありませんが私は日本語語が話せま...,en,3,"[suggestions, don, questions, time, thank, con..."
3,Grazie mille per aver trovato il tempo per met...,en,4,"[code, pin, notifications, access, app, lock, ..."
4,Thank you so much for reaching out and taking ...,en,3,"[suggestions, don, questions, time, thank, con..."
5,Thank you so much for reaching out and taking ...,en,4,"[code, pin, notifications, access, app, lock, ..."


In [86]:
print(eng_df['Cleaned Emails'][4])
print('-'*120)
print(eng_df['topic'][4])
print('-'*120)
print(topic_word_dict[eng_df['topic'][4]])
print('-'*120)

Thank you so much for reaching out and taking the time to contact us about this issue! Please excuse the delayed response. I'm happy to inform you that you can already enlarge the front and back pictures of your cards simply by tapping on it once. Your card pictures will then get enlarged as well as rotated. However, I will also suggest to our developers to make zooming already in the "Notes" tab possible for future versions of  . I hope I was able to help you. If you have any further questions, suggestions for improvements or general feedback, please don't hesitate to contact me again.
------------------------------------------------------------------------------------------------------------------------
3
------------------------------------------------------------------------------------------------------------------------
['suggestions', 'don', 'questions', 'time', 'thank', 'contact', 'app', 'feedback', 'cards', 'card']
--------------------------------------------------------------

In [87]:
print(eng_df['Cleaned Emails'][0])
print('-'*120)
print(eng_df['topic'][0])
print('-'*120)
print(topic_word_dict[eng_df['topic'][0]])
print('-'*120)

Merci pour votre message! Malheureusement, mon français n'est pas si bon. J'espère que ça ne vous dérange pas, mais je vais devoir poursuivre en anglais :Thank you so much for reaching out and taking the time to contact us about this issue - I'm happy to help you with this!There is a feature in   that allows you to save your cards and transfer them to a second mobile device - the   Backup. You can create or restore a backup of your cards the following way:1. Go to the "Settings" tab in   and choose "  Backup" (Android) or "Backup" (iOS).2. Sign in via Facebook, Google or sign up using a private mail address.3. As soon as you are logged in you have successfully created a backup (iOS) or you can click on "Backup Now" to create your backup (Android).1. Go to the "Settings" tab in   and choose "Backup" again on your new device.2. Sign in with the same details you used when you created the account (e.g. when you used Google, you have to log in via Google again and not via mail address).3. A

#### Other information
* The information omitted while removing salutations can be used to get some information, maybe the people involved in the email. This is just a crude attempt to extract info from minimal data.
* Named-Entity Recognition can be performed on the text to get all the proper nouns used in the salutations.

In [92]:
info = list(email_df['Other Info'])

In [94]:
info[0].split('---')

['Tanti saluti,',
 'Isabelle van Capelleveen',
 'Customer Support',
 '  GmbH',
 'C-HUB / Hafenstraße 25-27']

In [96]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [133]:
text = info[10].split('---')
text = ' ,'.join([x for x in text])
text

'Merci encore et une bonne fin de journée. ,Isabelle van Capelleveen ,Customer Support ,  GmbH ,C-HUB / Hafenstraße 25-27'

In [134]:
doc = nlp(text)

ents = []
if doc.ents:
    for ent in doc.ents:
        ents.append(ent.text)
        print(f'{ent.text:{30}}{ent.label_:{30}}{spacy.explain(ent.label_):{60}}')
else:
    print('No entities')
    pass

Merci                         ORG                           Companies, agencies, institutions, etc.                     
Isabelle van Capelleveen      PERSON                        People, including fictional                                 
Customer Support              PERSON                        People, including fictional                                 
C-HUB / Hafenstraße           ORG                           Companies, agencies, institutions, etc.                     
25-27                         CARDINAL                      Numerals that do not fall under another type                


* The information is not totally accurate but still provides some basic info.
* Summarization has been  performed in different notebook.