# Text classification

Using the dataset `dataset_emails.csv` (or other dataset of your choice) create three text classificators:
* Using rule-based approach (regex)
* Using naive-bayes
* Using Spacy 3 

Finally, compare the results and show what is better and why. 

In [13]:
import re
from collections import defaultdict
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

df= pd.read_csv('dataset_emails.csv')

for index, row in df.iterrows():
    print(row['prompt'], row['label'])


Can I send an email, please? send
I'd like to compose an email. send
I need to send an email. send
Could you help me write an email? send
Is it possible to send an email with you? send
Let's write an email. send
Time to send an email. send
I want to email someone. send
Open email for writing. send
Compose a new message. send
I have an email to send. send
There's someone I need to email. send
I want to get in touch with [someone] through email. send
Could you draft an email for me? send
I need to send an email about [topic]. send
Time to send a quick email. send
Let's shoot someone an email. send
I have an important email to write. send
Is there a way to email [someone]? send
Can you help me send an email regarding [topic]? send
I need to drop someone a line. send
Let's ping someone with an email. send
Time to fire off an email. send
Is it cool if I send an email? send
Can you whip up an email for me? send
I could use an email assistant right now. send
Let's get in touch electronically.

In [14]:
import nltk
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

#nltk.download('punkt') Ejecutar cuando no se tenga
#nltk.download('stopwords')
#nltk.download('wordnet')

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

In [15]:
def preprocess_text(text):
    tokens = word_tokenize(text.lower())
    
    remove_stop= []
    clean_words=[]
    
    for word in tokens:
        if word.lower() not in stop_words:
            remove_stop.append(word)
            
    for word in remove_stop:
        lemma = lemmatizer.lemmatize(word)
        clean_words.append(lemma)
        
    return " ".join(clean_words)


df['clean_prompt']= df['prompt'].apply(preprocess_text)

df[['prompt', 'clean_prompt']].head()

Unnamed: 0,prompt,clean_prompt
0,"Can I send an email, please?","send email , please ?"
1,I'd like to compose an email.,'d like compose email .
2,I need to send an email.,need send email .
3,Could you help me write an email?,could help write email ?
4,Is it possible to send an email with you?,possible send email ?


In [16]:
X_train, X_test, y_train, y_test = train_test_split(df['clean_prompt'], df['label'], test_size=0.2, random_state=42)

In [20]:
def classify_email_regex(text): #Palabras clave para clasificar los correos electronicos
    send_keywords = ['send', 'compose', 'draft', 'write', 'respond', 'reply']
    list_keywords = ['list', 'show', 'display', 'view', 'inbox']
    trash_keywords = ['delete', 'trash', 'remove', 'erase']
    untrash_keywords = ['recover', 'restore', 'undelete', 'untrash']
    forward_keywords = ['forward', 'share', 'send on']
    star_keywords = ['star', 'important', 'highlight', 'prioritize']
    read_keywords = ['read', 'hear', 'narrate', 'tell']
    trash_list_keywords = ['trash list', 'deleted emails', 'trash folder']

    if any(re.search(keyword, text) for keyword in send_keywords):
        return 'send'
    elif any(re.search(keyword, text) for keyword in list_keywords):
        return 'list'
    elif any(re.search(keyword, text) for keyword in trash_keywords):
        return 'trash'
    elif any(re.search(keyword, text) for keyword in untrash_keywords):
        return 'untrash'
    elif any(re.search(keyword, text) for keyword in forward_keywords):
        return 'forward'
    elif any(re.search(keyword, text) for keyword in star_keywords):
        return 'star'
    elif any(re.search(keyword, text) for keyword in read_keywords):
        return 'read'
    elif any(re.search(keyword, text) for keyword in trash_list_keywords):
        return 'trash_list'
    else:
        return 'unknown'
    
df['predict'] = df['clean_prompt'].apply(classify_email_regex)

df[['prompt','label','predict']].head()


Unnamed: 0,prompt,label,predict
0,"Can I send an email, please?",send,send
1,I'd like to compose an email.,send,send
2,I need to send an email.,send,send
3,Could you help me write an email?,send,send
4,Is it possible to send an email with you?,send,send


In [21]:
accuracy = accuracy_score(df['label'], df['predict'])
print(f"Accuracy of the model:{accuracy:.2f}")

print("Classification Report:\n", classification_report(df['label'], df['predict']))

Accuracy of the model:0.44
Classification Report:
               precision    recall  f1-score   support

     forward       0.94      0.65      0.77       100
        list       0.44      0.63      0.52       100
        read       0.76      0.55      0.64       100
       reply       0.00      0.00      0.00       100
        send       0.29      0.56      0.38       100
        star       0.88      0.74      0.80       100
       trash       0.32      0.57      0.41       100
  trash_list       0.00      0.00      0.00       100
     unknown       0.27      0.71      0.40       100
     untrash       1.00      0.03      0.06       100

    accuracy                           0.44      1000
   macro avg       0.49      0.44      0.40      1000
weighted avg       0.49      0.44      0.40      1000



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
