# 01-email_preprocessing [enron]

* __Status__ : OK
* __Dataset__ : /thetidinbox_1004/notebooks/Marine/enron1.tar.gz [as an example]
* __Source__ : Enron Email Dataset [https://www.cs.cmu.edu/~enron/]
* __Labeled__ : Yes. Spam/Ham


## 📚 **Import libraries**

In [47]:
import re
import tarfile
import numpy as np
import pandas as pd
from string import punctuation
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import MultinomialNB

In [48]:
# enron1.tar.gz contains all the emails
enron_files = 'enron1.tar.gz'

## Methodology :

### Preparing the text data:

1. **extract_emails(fname)** : This functions reads the tarfile and stores into a list. List has two columns 'Message' and 'Class'. Message contains extracted emails and Class has the category like 'ham' or 'spam'. The list is then converted into a dataframe and then returned. The argument passed here is the folder name. 

2. **populate_df()** : This function populates all the emails into a Dataframe. This function also drops rows having NA values and duplicate rows.This function returns a dataframe.

3. **clean_email(email)** : This function removes all punctuation, urls, numbers, and newlines. Also converts it into lower case. Arguments passed here is the messages i.e. emails and returns same cleaned email.

4. **preproces_email(email)** :  This function splits the text string into individual words, stem each word, and append the stemmed word to words. Make sure there's a single space between each stemmed word. Arguments passed here is the messages i.e. emails and returns the text of the email.

5. **stopword_removal(email)** : This functions removes the stopwords left in the email text passed as argument and returns text without stop words

In [49]:
def extract_emails(fname):
    """ Extract the zipped emails and load them into a pandas df """

    rows = []
    # tarfiles are used to read and write tar archives
    originalfile = tarfile.open(fname, 'r:gz')
    for member in originalfile.getmembers():
        if 'ham' in member.name:
            f = originalfile.extractfile(member)
            if f is not None:
                row = f.read()
                rows.append({'message': row, 'class': 'ham'}) 
        if 'spam' in member.name:
            f = originalfile.extractfile(member)
            if f is not None:
                row = f.read()
                rows.append({'message': row, 'class': 'spam'})
    originalfile.close()

    return pd.DataFrame(rows)

In [50]:
def populate_df():
    """ Populate the dataframe with all the emails """

    emails_df = pd.DataFrame({'message': [], 'class': []})

    unzipped_file = extract_emails(enron_files)
    emails_df = emails_df.append(unzipped_file)
    print("The dimensions of Dataframe created after extracting the tarfile:",emails_df.shape)
    
    # Dropping all the rows with NA values
    emails_df.dropna()
    # Dropping all the duplicates but keeping the first instance
    emails_df.drop_duplicates(keep='first',inplace=True)
    print("The dimensions of Dataframe after droping rows containing NA values and duplicates: ",emails_df.shape)
    
    return(emails_df)

In [51]:
def clean_email(email):
    
    # Remove mentions
    email = re.sub(r'@\w+', '', email)
    # Remove urls
    email = re.sub(r'http\S+', ' ', email)
    # Remove digits
    email = re.sub("\d+", " ", email)
    email = email.replace('\n', ' ')
    # Remove punctuations
    email = email.translate(str.maketrans("", "", punctuation))
    email = email.lower()
    
    return email

In [52]:
def preproces_email(email):

    words = ""
    # Create the stemmer.
    stemmer = SnowballStemmer("english")
    # Split text into words.
    email = email.split()
    for word in email:
        # Optional: remove unknown words.
        words = words + stemmer.stem(word) + " "
    
    return words

In [53]:
def stopword_removal(email):

    stop_words = set(stopwords.words('english')) 

    email = email.split()
    filtered_sentence = ""

    for w in email: 
        if w not in stop_words: 
            filtered_sentence = filtered_sentence + w +" "

    return filtered_sentence

### Main Function :

This function calls the above defined functions for cleaning the data and displays every functions output. Final text displayed is a clean text which will be further used for modelling.

In [54]:
if __name__ == '__main__':
    emails_df = populate_df()
    print("Downloaded and unziped the files")
    print("==================================================================================================================")
    # Translate bytes objects into strings.
    emails_df['message'] = emails_df['message'].apply(lambda x: x.decode('latin-1'))

    # Reset pandas df index.
    emails_df = emails_df.reset_index(drop=True)

    # Map 'spam' to 1 and 'ham' to 0
    emails_df['class'] = emails_df['class'].map({'spam': 1, 'ham': 0})

    print("After mapping spam and ham emails to 1 and 0 respectively : \n",emails_df.iloc[2500].values)
    print("==================================================================================================================")
    
    emails_df['message'] = emails_df['message'].apply(clean_email)

    print("After cleaning email : \n",emails_df.iloc[2500].values)
    print("==================================================================================================================")
    
    #emails_df['message'] = emails_df['message'].apply(preproces_email)
    
    #print("After preprocessing the texts : \n",emails_df.iloc[2500].values)
    #print("==================================================================================================================")
    
    emails_df['message'] = emails_df['message'].apply(stopword_removal)
    
    print("Removing the stopwords : \n",emails_df.iloc[2500].values)


  emails_df = emails_df.append(unzipped_file)


The dimensions of Dataframe created after extracting the tarfile: (5172, 2)
The dimensions of Dataframe after droping rows containing NA values and duplicates:  (4994, 2)
Downloaded and unziped the files
After mapping spam and ham emails to 1 and 0 respectively : 
 ['Subject: txu fuel co . nom . s for 2 / 20 / 01\r\n( see attached file : hplno 220 . xls )\r\n- hplno 220 . xls'
 0]
After cleaning email : 
 ['subject txu fuel co  nom  s for        \r  see attached file  hplno    xls \r  hplno    xls'
 0]
Removing the stopwords : 
 ['subject txu fuel co nom see attached file hplno xls hplno xls ' 0]


In [55]:
emails_df

Unnamed: 0,message,class
0,subject christmas tree farm pictures,0
1,subject vastar resources inc gary production h...,0
2,subject calpine daily gas nomination calpine d...,0
3,subject issue fyi see note already done stella...,0
4,subject meter nov allocation fyi forwarded lau...,0
...,...,...
4989,subject pro forma invoice attached divide cove...,1
4990,subject str rndlen extra time word bodyhtml,1
4991,subject check bb hey derm bbbbb check paris ma...,1
4992,subject hot jobs global marketing specialties ...,1


In [56]:
emails_df.to_csv('email_df.csv', index=False)

Yay! Email Preprocessing functions done ✅


➡️ Let's jump to the Spam/Ham classifier notebook [/thetidinbox_1004/notebooks/Marine/02-email_spam_classification_[enron].ipynb]

🚨 **To do:** transfer these functions to a .py file at the end of the project.