<div style="float: left;">![lefebvre-147x30.png](attachment:lefebvre-147x30.png)</div>



#  <font color=#001978>Used machine learning to classify emails and turn them into insights</font>

### <font color=#001978>COMPREHEND - Ability to read and understand incoming emails</font>

In [1]:
# Alberto Valverde Escribano.
# Software engineer and R&D Robotics.
# Code used machine learning to classify emails and turn them into insights.

from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, MiniBatchKMeans
from helpers import *

emails = pd.read_csv('alberto-valverde-gmail.csv')
emails = emails.head(10)

# Lets create a new frame with the data we need.
email_df = pd.DataFrame(parse_into_emails(emails.message))

# Drop emails with empty body, to or from_ columns.
email_df.drop(email_df.query("body == '' | to == '' | from_ == ''").index, inplace=True)
stopwords = ENGLISH_STOP_WORDS.union(['ect', 'hou', 'com', 'recipient'])
vect = TfidfVectorizer(analyzer='word', stop_words=stopwords, max_df=0.3, min_df=2)

X = vect.fit_transform(email_df.body)
features = vect.get_feature_names()

# Now we print the top terms across all documents.
print (top_mean_feats(X, features, None, 0.1, 10))

# As clustering algorithm KMeams is a perfect fit.
n_clusters = 3
clf = KMeans(n_clusters=n_clusters,
            max_iter=100,
            init='k-means++',
            n_init=1)
labels = clf.fit_predict(X)

# For larger datasets use mini-batch KMeans, so we dont have to read all data into memory.
# batch_size = 500
# clf = MiniBatchKMeans(n_clusters=n_clusters, init_size=1000, batch_size=batch_size, max_iter=100)
# clf.fit(X)

# Let's plot this with matplotlib to visualize it.
# First we need to make 2D coordinates from the sparse matrix.
X_dense = X.todense()
pca = PCA(n_components=2).fit(X_dense)
coords = pca.transform(X_dense)

# Lets plot it again, but this time we add some color to it.
# This array needs to be at least the length of the n_clusters.
label_colors = ["#2AB0E9", "#2BAF74", "#D7665E", "#CCCCCC",
                "#D2CA0D", "#522A64", "#A3DB05", "#FC6514"]
colors = [label_colors[i] for i in labels]

plt.scatter(coords[:, 0], coords[:, 1], c=colors)
# Plot the cluster centers
centroids = clf.cluster_centers_
centroid_coords = pca.transform(centroids)
plt.scatter(centroid_coords[:, 0], centroid_coords[:, 1], marker='X', s=200, linewidths=2, c='#444d60')
plt.show()

#Use this to print the top terms per cluster with matplotlib.
plot_tfidf_classfeats_h(top_feats_per_cluster(X, labels, features, 0.1, 25))



          features     score
0               di  0.122152
1            robot  0.094398
2               19  0.079114
3  oregonstateuniv  0.076505
4               22  0.074687
5    3dgnbpdpelg4w  0.066675
6           forbes  0.066675
7               06  0.066483
8        raspberry  0.064786
9               pi  0.064786


<Figure size 640x480 with 1 Axes>

<Figure size 1200x900 with 3 Axes>

### <font color=#001978>Comprehing single messages from Inbox</font>
Returning the top terms out of a specific email.

In [2]:
# Alberto Valverde Escribano.
# Software engineer and R&D Robotics.
# Code used machine learning to classify specific email and turn them into insights.

import numpy as np
import pandas as pd
#import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.manifold import TSNE
from sklearn.decomposition import TruncatedSVD 
from sklearn.preprocessing import normalize 

from helpers import * 

emails = pd.read_csv('split_emails.csv')
emails = emails.head(4000)
# Lets create a new frame with the data we need.
email_df = pd.DataFrame(parse_into_emails(emails.message))

# Drop emails with empty body, to or from_ columns. 
email_df.drop(email_df.query("body == '' | to == '' | from_ == ''").index, inplace=True)
print (email_df.head(9))

# After running this function, I created a new dataframe that looks like this:


                                                 body  \
0   Buenos días, llamé el 24 de abril a Intrum por...   
1   Hola buenas resulta que el día 27/07/2016 Iba ...   
2   Hola, me llamo Ramón. Estoy divorciado desde f...   
3   Randy,Can you send me a schedule of the salary...   
5   Greg,How about either next Tuesday or Thursday...   
6   Phillip Allen (pallen@enron.com)Mike Grigsby (...   
8   I don't think these are required by the ISP2. ...   
9   ---------------------- Forwarded by Phillip K ...   
10  Mr. Buckner,For delivered gas behind San Diego...   

                                                  to                    from_  
0                               tim.belden@enron.com  phillip.allen@enron.com  
1                            john.lavorato@enron.com  phillip.allen@enron.com  
2                             leah.arsdall@enron.com  phillip.allen@enron.com  
3                              randall.gay@enron.com  phillip.allen@enron.com  
5                            

### <font color=#001978>MACHINE LEARNING</font>
Transform de dataframes (from emails raw message) into workable text and extract the features

In [3]:
# At this point we are going to tokenize the bodies and convert them
# into a document-term matrix.

# Some note on min_df and max_df
# max_df=0.5 means "ignore all terms that appear in more then 50% of the documents"
# min_df=2 means "ignore all terms that appear in less then 2 documents"
stopwords = ENGLISH_STOP_WORDS.union(['que', 'el', 'una', 'ha' , "la", 'por', "lo" , "es", "los" , "al", "en", "mi", "es", "del",'para', 'hacer', 'tengo' , "pero"])
vect = TfidfVectorizer(analyzer='word', stop_words=stopwords, max_df=0.3, min_df=1)

X = vect.fit_transform(email_df.body)
features = vect.get_feature_names()

# The procesing message
msg_num = 0

# Saving Feature date for future Natural Language Processing
FeatureDataFramesForAI = top_feats_in_doc(X, features, msg_num, 10)

# Let's print the top 10 terms in document 1
print (FeatureDataFramesForAI)

# After running this function on a document, it came up with the following result.

        features     score
0          dicho  0.286853
1          deuda  0.286853
2           pago  0.286853
3           años  0.191235
4  documentación  0.191235
5          ahora  0.191235
6           días  0.191235
7          debía  0.095618
8            dni  0.095618
9          estos  0.095618


### <font color=#001978> All making sense if you look into the corresponding email:</font>

In [4]:
print (emails.message[msg_num])

Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>
Date: Mon, 14 May 2001 16:39:00 -0700 (PDT)
From: phillip.allen@enron.com
To: tim.belden@enron.com
Subject: 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Tim Belden <Tim Belden/Enron@EnronXGate>
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Sent Mail
X-Origin: Allen-P
X-FileName: pallen (Non-Privileged).pst

Buenos días, llamé el 24 de abril a Intrum por una deuda que ya cancelé por un asunto con la Caixa 
y al darles mi DNI resulta que me dice que tengo una deuda de 180 euros con Orange desde el 2007 
de la que yo no tenía constancia, no tengo ni idea si debía o no porque han pasado muchos años y 
no recuerdo pero no he recibido en todos estos años documentación de dicha deuda de ningún tipo ya 
que de ser así la hubiera pagado, me ha dicho que mandaron documentación a una dirección donde 
vivía en 2007, hoy me ha vuelto a llama

In [5]:
!pip install --upgrade azure-cognitiveservices-language-textanalytics

Requirement already up-to-date: azure-cognitiveservices-language-textanalytics in /home/nbuser/anaconda3_501/lib/python3.6/site-packages (0.2.0)
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [6]:
from azure.cognitiveservices.language.textanalytics import TextAnalyticsClient
from msrest.authentication import CognitiveServicesCredentials

import os

key_var_name = '855b09a467dd4d328e0fb84aaa15c8f3'

subscription_key = key_var_name

endpoint_var_name = 'https://analisislefebvre.cognitiveservices.azure.com/'

endpoint = endpoint_var_name

credentials = CognitiveServicesCredentials(subscription_key)

text_analytics = TextAnalyticsClient(endpoint_var_name, credentials=credentials)

documents = [
    {
        "id": "1",
        "language": "es",
        "text": emails.message[msg_num]
    }
]
response = text_analytics.sentiment(documents=documents)
for document in response.documents:
    print("Document Id: ", document.id, ", Sentiment Score: ",
          "{:.2f}".format(document.score))
    empaticValue = "{:.2f}".format(document.score)
    if float(empaticValue) > 0.5:
       empaticValueStr= "Satisfied"
    if float(empaticValue) < 0.5:
       empaticValueStr= "unsatisfied"
    if float(empaticValue) >= 1 :
       empaticValueStr= "happy"
print  (empaticValueStr)     
    

Document Id:  1 , Sentiment Score:  0.45
unsatisfied



### <font color=#001978>(NPL) NATURAL LANGUAGE PROCESSING</font>
####  Required follow-up actions based on its understanding of the email/s

In [7]:
!pip install pyspellchecker

You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [8]:
!pip install dialogflow

Collecting dialogflow
[?25l  Downloading https://files.pythonhosted.org/packages/cc/db/1e71dd7c5c748b1a03e679b8890550ddeff8b21b58a76b92eb5a7beff02a/dialogflow-0.7.2-py2.py3-none-any.whl (310kB)
[K     |████████████████████████████████| 317kB 299kB/s eta 0:00:01     |████████████▋                   | 122kB 299kB/s eta 0:00:01
[?25hCollecting google-api-core[grpc]<2.0.0dev,>=1.14.0 (from dialogflow)
[?25l  Downloading https://files.pythonhosted.org/packages/29/3a/c528ef37f48d6ffba16f0f3c0426456ba21e0dd32be9c61a2ade93e07faa/google_api_core-1.14.3-py2.py3-none-any.whl (68kB)
[K     |████████████████████████████████| 71kB 2.5MB/s eta 0:00:011
[?25hCollecting google-auth<2.0dev,>=0.4.0 (from google-api-core[grpc]<2.0.0dev,>=1.14.0->dialogflow)
[?25l  Downloading https://files.pythonhosted.org/packages/c5/9b/ed0516cc1f7609fb0217e3057ff4f0f9f3e3ce79a369c6af4a6c5ca25664/google_auth-1.6.3-py2.py3-none-any.whl (73kB)
[K     |████████████████████████████████| 81kB 3.4MB/s eta 0:00:011
Coll

In [9]:
# Alberto Valverde Escribano.
# Software engineer and R&D Robotics.
# Code used Natural Langueage Processing to gives the ability to read, understand and derive meaning from email´s features result.

import dialogflow
from google.api_core.exceptions import InvalidArgument
from SpellCorrector import *

import os
from google.oauth2 import service_account

# Attributes
DIALOGFLOW_PROJECT_ID = 'lefebvre-ivivrs'
DIALOGFLOW_LANGUAGE_CODE = 'en-ES'
GOOGLE_APPLICATION_CREDENTIALS = 'lefebvre-ivivrs-c18aa8f99956.json'
SESSION_ID = '115760287604807540329'

credentials = service_account.Credentials.from_service_account_file("lefebvre-ivivrs-c18aa8f99956.json")
scoped_credentials = credentials.with_scopes(['https://www.googleapis.com/auth/cloud-platform'])

# Correcting the spelling
# correctedPhrase = getSpellCorrectedPhrase(phrase)

    
values= ""
for index, row in FeatureDataFramesForAI.iterrows():
    #print(row['features'])
    values= values + " " + row['features']   
    
correctedPhrase = values

# Initializing a client
session_client = dialogflow.SessionsClient(credentials=credentials)

session = session_client.session_path(DIALOGFLOW_PROJECT_ID, SESSION_ID)

text_input = dialogflow.types.TextInput(text=correctedPhrase, language_code=DIALOGFLOW_LANGUAGE_CODE)

query_input = dialogflow.types.QueryInput(text=text_input)

try:
    response = session_client.detect_intent(session=session, query_input=query_input)
except InvalidArgument:
    raise

print("Query text:", response.query_result.query_text)
print("Detected intent:", response.query_result.intent.display_name)
print("Detected intent confidence:", response.query_result.intent_detection_confidence)
print("Response:", response.query_result.fulfillment_text)
    
#saving value for next RabbitMQ event
MyRabbitmqMessage = "Se detecta un evento relacionado con " + response.query_result.fulfillment_text + "." + " Entidad " + response.query_result.intent.display_name + "el cliente parece sentirse: " + empaticValueStr 
print ("Empatic: " + empaticValueStr) 

Query text: dicho deuda pago años documentación ahora días debía dni estos
Detected intent: Abogados de derecho mercantil y derecho concursal
Detected intent confidence: 0.8566814661026001
Response: Derecho Concursal, Insolvencias y Concurso de Acreedores
Empatic: unsatisfied


In [10]:
!pip install googletrans

Collecting googletrans
  Downloading https://files.pythonhosted.org/packages/fd/f0/a22d41d3846d1f46a4f20086141e0428ccc9c6d644aacbfd30990cf46886/googletrans-2.4.0.tar.gz
Building wheels for collected packages: googletrans
  Building wheel for googletrans (setup.py) ... [?25ldone
[?25h  Created wheel for googletrans: filename=googletrans-2.4.0-cp36-none-any.whl size=20593 sha256=75d73a0b5ec4632d70995a32a5823844aff78c428593d350a137141093ed79a5
  Stored in directory: /home/nbuser/.cache/pip/wheels/50/d6/e7/a8efd5f2427d5eb258070048718fa56ee5ac57fd6f53505f95
Successfully built googletrans
Installing collected packages: googletrans
Successfully installed googletrans-2.4.0
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [11]:
# Alberto Valverde Escribano.
# Software engineer and R&D Robotics.
# Code used googletrans to translate the result to english.

from googletrans import Translator

T = Translator()

try:
  MyRabbitmqMessage = T.translate(MyRabbitmqMessage,src='es',dest='en').text
  print(MyRabbitmqMessage)
except:
  print ('translation Error!!!')

an event related to Bankruptcy Law, Insolvency and Creditors Contest is detected. Lawyers entity client concursalel commercial law and law seems to feel: Unsatisfied


In [12]:
!pip install pika

Collecting pika
[?25l  Downloading https://files.pythonhosted.org/packages/a1/ae/8bedf0e9f1c0c5d046db3a7428a4227fe36ec1b8e25607f3c38ac9bf513c/pika-1.1.0-py2.py3-none-any.whl (148kB)
[K     |████████████████████████████████| 153kB 3.2MB/s eta 0:00:01
[?25hInstalling collected packages: pika
Successfully installed pika-1.1.0
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [13]:
# Alberto Valverde Escribano.
# Software engineer and R&D Robotics.
# Code used pika to sends messages to the "email-clustering" queue.

import pika

credentials = pika.PlainCredentials('test', 'test')
connection = pika.BlockingConnection(
    pika.ConnectionParameters(host='lefebvre.westeurope.cloudapp.azure.com', credentials=credentials))
channel = connection.channel()

channel.queue_declare(queue='email-clustering')

channel.basic_publish(exchange='', routing_key='email-clustering', body=MyRabbitmqMessage)
print(" [x] Sent " + MyRabbitmqMessage)
print("Event was sended successful")
connection.close()

 [x] Sent an event related to Bankruptcy Law, Insolvency and Creditors Contest is detected. Lawyers entity client concursalel commercial law and law seems to feel: Unsatisfied
Event was sended successful
