Our data is from 2 source zip files named trec07 and 1998. They are unstructured text files, that contain email headers, subject and body. In this part, we parse through the text files, convert into a structured pandas dataset, and save it as "structured.xlsx"

### If on Google Colab

Execute the following cells only if running on Google Colab, to install the needed packages and download the datasets from Google drive.

In [1]:
!pip install PyDrive
!pip install XlsxWriter

Collecting PyDrive
[?25l  Downloading https://files.pythonhosted.org/packages/52/e0/0e64788e5dd58ce2d6934549676243dc69d982f198524be9b99e9c2a4fd5/PyDrive-1.3.1.tar.gz (987kB)
[K    100% |████████████████████████████████| 993kB 9.1MB/s 
Building wheels for collected packages: PyDrive
  Running setup.py bdist_wheel for PyDrive ... [?25l- \ done
[?25h  Stored in directory: /content/.cache/pip/wheels/fa/d2/9a/d3b6b506c2da98289e5d417215ce34b696db856643bad779f4
Successfully built PyDrive
Installing collected packages: PyDrive
Successfully installed PyDrive-1.3.1
Collecting XlsxWriter
[?25l  Downloading https://files.pythonhosted.org/packages/33/50/136b801d106fcebb2428a764e5c599e020d8227a3623db078e05eb4793a5/XlsxWriter-1.0.5-py2.py3-none-any.whl (142kB)
[K    100% |████████████████████████████████| 143kB 5.5MB/s 
[?25hInstalling collected packages: XlsxWriter
Successfully installed XlsxWriter-1.0.5


In [0]:
import os
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

Authenticate PyDrive API to access Google drive

In [0]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

Now, we download and untar the two datasets, 1998 and trec07p, that we are going to work with.

In [0]:
download = drive.CreateFile({'id': '1QtoxpJmd1lys7c7LaYXiOjbzMdMOpeVX'})
download.GetContentFile('1998.tar')

In [0]:
download2 = drive.CreateFile({'id': '1xaJL1eoccrCyS45xgF23dVY_KCER-oAD'})
download2.GetContentFile('trec07p.tar')

In [0]:
!tar xf 1998.tar
!tar xf trec07p.tar

### If not on Google Colab

If running locally, make sure to have the 1998 dataset and trec07p dataset in the same directory as this notebook.

In [11]:
import os
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

NotHeaders contains a list of words that do not qualify as an email header

In [12]:
notHeaders = ['A',
 'A.normal2{color',
 'A.normal{color',
 'A.over{color',
 'ATT',
 'Action',
 'Added',
 'Address',
 'An',
 'Asunto',
 'Author',
 'Auto-Submitted',
 'BATCH',
 'Betreff',
 'Beurteilung',
 'Bookmark',
 'Brma',
 'BrmaSmtpAuthUser',
 'C',
 'Call',
 'Cancel-Lock',
 'Carlson',
 'Caveats',
 'Changeset',
 'Cia1iis',
 'Cited',
 'Classification',
 'Comment',
 'Company',
 'Complaints',
 'Congratulations',
 'Contact',
 'Copyright',
 'Corp.<http',
 'Corporation<http',
 'Credibility',
 'Date-warning',
 'Datum',
 'De',
 'Details',
 'De\xa0',
 'Diagnostic-Code',
 'Dinah',
 'Envoyé\xa0',
 'FATAL',
 'FDA',
 'FONT-SIZE',
 'Featuring',
 'Fixes',
 'Gesendet',
 'Hash',
 'Hinweis',
 'ISIN',
 'Importance',
 'Inc.<http',
 'Index',
 'Jabber-ID',
 'Jim',
 'John',
 'Kopia',
 'LINE-HEIGHT',
 'List-Help',
 'Log',
 'London',
 'Lookup',
 'MOTD',
 'Mail-Followup-To',
 'Managed-by',
 'Market',
 'Metze',
 'Modified',
 'NOTE',
 'Name',
 'Napster<http',
 'Newshawk',
 'Notice',
 'Number',
 'OTC',
 'Objet\xa0',
 'Old-Return-Path',
 'OpenPGP',
 'Organisation',
 'Organization',
 'PADDING-BOTTOM',
 'PHONE',
 'Package',
 'Page',
 'Para',
 'Phone',
 'Posted',
 'Precedence',
 'President',
 'Priority',
 'Products',
 'Pubdate',
 'Publicitate',
 'REF',
 'RT-Ticket',
 'Rangel',
 'Received-SPF',
 'References',
 'Reminder',
 'Removed',
 'Reporting-MTA',
 'S.umbol',
 'Sent',
 'Severity',
 'Site',
 'Skickat',
 'Source',
 'Spoken',
 'Sym8oL',
 'Symbol',
 'Sys.putenv("http_proxy"="http',
 'TELEPHONE',
 'Talk',
 'Tel',
 'Teste',
 'Ticker',
 'Till',
 'Timing',
 'Tname',
 'Try',
 'Type',
 'UEI',
 'URL',
 'Visit',
 'Von',
 'WASHINGTON',
 'WKN',
 'Webpage',
 'Website',
'A.RVTS2',
 'ASTIG',
 'AXIS',
 'Ambieen',
 'Anova',
 'Below',
 'CARGO',
 'CNN',
 'CNNMoney',
 'City',
 'Clinton',
 'Collins',
 'Given',
 'HEADLINES',
 'IRAN',
 'Iterations',
 'Juego!<o',
 'Kernel',
 'L<Parrot',
 'Parrot',
 'Price']

Since many email body fields contain HTML markup too, we define a method that filters out visible text from HTML content.

In [13]:
def parseTextFromHTML(body):
    '''
    eliminate HTML markup tags and return only content
    '''
    body = body.lower()
    htmlstartindex = body.find('<html>')
    htmlendindex = body.find('</html>')
    
    #if(htmlstartindex == -1 or htmlendindex == -1):
        #return body
    
    htmlcontent = body#[htmlstartindex:htmlendindex+7]
    soup = BeautifulSoup(htmlcontent, 'html.parser')
    [s.extract() for s in soup(['style', 'script'])]
    return soup.getText().strip()
    #return body[:htmlstartindex] + soup.getText().strip() + body[htmlendindex+7:]

In [14]:
"""# parse the email file(spam or ham) to create a pandas dataframe
def getDFFromEmail(path, spam):
    l = []
    precurrkv = []
    sufcurrkv = []
    count = 0
    body = False
    try:
        stream = open(path, errors='strict', encoding='UTF-8')
        stream.readlines()
        stream.seek(0)
        print("Using UTF-8 encoding")
    except (Exception):
        print("Using ANSI encoding")
        stream = open(path, errors='strict', encoding='ANSI')
        stream.readlines()
        stream.seek(0)
        
    for line in stream:
        if(line.startswith('<DOCTYPE')):
            continue
            
        colonIndex = line.find(":")
        
        if(count > 15 and not body):
            body = True
            l.append(precurrkv)
            
        if(line[0].isupper() and line.find(' ',0,colonIndex)<0 and colonIndex >= 0 and colonIndex < 30 and not body 
           and line[:colonIndex] not in notHeaders):
            count = 0
            if(len(sufcurrkv) > 0 and len(precurrkv) > 0):
                precurrkv[1] = "".join((precurrkv[1], "".join(sufcurrkv)))
                l.append(precurrkv)
            elif(len(precurrkv) > 0):
                l.append(precurrkv)
            precurrkv = line.split(sep=":", maxsplit=1)
            sufcurrkv.clear()

        elif(str(line[0:2]).isspace() and not body):
            count += 1
            sufcurrkv.append(line)
        elif(str(line).startswith('>') and not body):
            body = True
            sufcurrkv = [''.join((':'.join(precurrkv), "".join(sufcurrkv)))]
        else:
            #count += 1
            sufcurrkv.append(line)

    l.append(["Body", parseTextFromHTML("".join(sufcurrkv))])
    l.append(["Spam", 'Spam' if spam else 'Ham'])
    l.append(["Tname", path])

    d = pd.DataFrame(np.array(l)).drop_duplicates(subset=0)
    return d    """
""""""

''

We define a method that parses an email text file represented by the path parameter and returns a pandas dataframe.

In [15]:
# parse the email file(spam or ham) to create a pandas dataframe
from collections import deque
from codecs import open
def getDFFromEmail(path, spam):
    '''
    path is path to file
    spam is a boolean indicating spam or ham (not spam)
    returns a dataframe containing email fields and their values for given path's file
    '''
    l = dict() #used for parsing text into dict of keys and values
    precurrkv = []  #list of keys, for eg in Sender : Vighnesh, here sender is key and Vighnesh is value
    sufcurrkv = deque() #list of values
    count = 0 #character index of read line
    body = False #indicates that body has started
  
#figure out the correct encoding using try except
    try:
        stream = open(path, errors='strict')
        stream.readlines()
        stream.seek(0)
        print("Using UTF-8 encoding")
    except (Exception):
        try:
            print("Using ISO-8859-1 encoding")
            stream = open(path, errors='strict', encoding='iso-8859-1')
            stream.readlines()
            stream.seek(0)
        except (Exception):
            return ""
        

    for line in stream:
        #skip doctype statements
        if(line.startswith('<DOCTYPE')):
            continue
            
        colonIndex = line.find(":")
        
        #line is part of body
        if(count > 15 and not body):
            body = True

        #check if new email header has been read
        # first letter isupper for header,no spaces in header,colonIndex is within bounds and header is not in notHeaders
        if(line[0].isupper() and line.find(' ',0,colonIndex)<0 and colonIndex >= 0 and colonIndex < 30 and not body 
           and line[:colonIndex] not in notHeaders):
            count = 0 #reset index to 0
            #append value to key in dictionary
            if(sufcurrkv and len(precurrkv) > 0):
                fieldname = sufcurrkv.popleft()
                l[fieldname] = [''.join(sufcurrkv)]
            
            precurrkv = line.split(sep=":", maxsplit=1)
            sufcurrkv.clear()
            sufcurrkv.extend(precurrkv)

        #read line is part of previous header
        elif(str(line[0:2]).isspace() and not body):
            count += 1
            sufcurrkv.append(line)
            
        # > indicates previous email characters, thus body has started
        elif(str(line).startswith('>') and not body):
            body = True
            sufcurrkv.append(line)
            
        # body has started
        else:
            #count += 1
            sufcurrkv.append(line)

    #create keys and values for body, spam and filename
    fieldname = sufcurrkv.popleft()
    l[fieldname] = sufcurrkv.popleft()    
    l["Body"] = [parseTextFromHTML("".join(sufcurrkv))]
    l["Spam"] = ['Spam'] if spam else ['Ham']
    l["Tname"] = [path]

    d = pd.DataFrame(l)
    return d.T.reset_index()    

We define a generator for the 1998 dataset that iterates over all email text files in a directory and yields resulting dataframes.

In [16]:
# iterate over all files in the directory to create the dataframes
def getAllDFFromDirectory(directorypath):
    '''
    directorypath is path to 1998 dataset directory
    '''
    for filename in os.listdir(directorypath):
        print("Getting DF for "+"".join((directorypath, filename)))
        if(filename.startswith("spm")):
            yield getDFFromEmail("".join((directorypath, filename)), True)
        elif(filename.count("msg") > 0):
            yield getDFFromEmail("".join((directorypath, filename)), False)
        else:
            yield getDFFromEmail("".join((directorypath, filename)), True)

In [17]:
from functools import reduce

# create dataframes from all directories from that contain spam or ham email text files

In [18]:
directorypath = "1998/1998/03/"
dfs = [i for i in getAllDFFromDirectory(directorypath) if(isinstance(i,pd.DataFrame))]

Getting DF for 1998/1998/03/890929468.24864.txt
Using UTF-8 encoding
Getting DF for 1998/1998/03/890929472.24865.txt
Using UTF-8 encoding
Getting DF for 1998/1998/03/890929475.24866.txt
Using UTF-8 encoding
Getting DF for 1998/1998/03/890929479.24867.txt
Using UTF-8 encoding
Getting DF for 1998/1998/03/890929482.24868.txt
Using UTF-8 encoding
Getting DF for 1998/1998/03/890929485.24869.txt
Using UTF-8 encoding
Getting DF for 1998/1998/03/890929489.24870.txt
Using UTF-8 encoding
Getting DF for 1998/1998/03/890929492.24871.txt
Using UTF-8 encoding
Getting DF for 1998/1998/03/890929496.24872.txt
Using UTF-8 encoding
Getting DF for 1998/1998/03/890929499.24873.txt
Using UTF-8 encoding
Getting DF for 1998/1998/03/890929562.24883.txt
Using UTF-8 encoding
Getting DF for 1998/1998/03/890929566.24884.txt
Using UTF-8 encoding
Getting DF for 1998/1998/03/890929569.24885.txt
Using UTF-8 encoding
Getting DF for 1998/1998/03/890929572.24886.txt
Using UTF-8 encoding
Getting DF for 1998/1998/03/890956

We define a generator for the trec07p dataset that iterates over all email text files in a directory and yields dataframes.

In [19]:
# iterate over all files in the directory to create the dataframes
def getAllDFFromDirectory2(directorypath, target):
    '''
    passes filename and target(spam or ham) to getDFFromEmail function explained above
    '''
    for i,filename in enumerate(os.listdir(directorypath)[:5000]):
        print("Getting DF for "+"".join((directorypath, filename)))
        
        if(target[i] == 'spam'):
            yield getDFFromEmail("".join((directorypath, filename)), True)
        else:
            yield getDFFromEmail("".join((directorypath, filename)), False)

In [20]:
def getTrec07Target():
    '''
    index file has target(spam or ham) and path to data
    '''
    for line in open("trec07p/trec07p/full/index"):
        yield line.split(" ")[0]

target = [i for i in getTrec07Target()]
print(len(target))

directorypath = "trec07p/trec07p/data/"
dfs2 = [i for i in getAllDFFromDirectory2(directorypath, target) if(isinstance(i,pd.DataFrame))]

75419
Getting DF for trec07p/trec07p/data/inmail.1
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.100
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1000
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10000
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10001
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10002
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10003
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10004
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10005
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10006
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10007
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10008
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10009
Using UTF-8 encoding
Getting DF for trec07p/t

Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1011
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10110
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10111
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10112
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10113
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10114
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10115
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10116
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10117
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10118
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10119
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1012
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10120
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10121
Using UTF-8 encoding
G

Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10216
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10217
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10218
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10219
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1022
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10220
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10221
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10222
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10223
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10224
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10225
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10226
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10227
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10228
Using UTF-8 encoding


Getting DF for trec07p/trec07p/data/inmail.10322
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10323
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10324
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10325
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10326
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10327
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10328
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10329
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1033
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10330
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10331
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10332
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10333
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10334
Using UTF-8 encoding
Getting DF for trec07

Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10431
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10432
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10433
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10434
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10435
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10436
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10437
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10438
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10439
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1044
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10440
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10441
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10442
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10443
Using UTF-8 encoding


Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10539
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1054
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10540
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10541
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10542
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10543
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10544
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10545
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10546
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10547
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10548
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10549
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1055
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10550
Using UTF-8 encoding
G

Getting DF for trec07p/trec07p/data/inmail.10645
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10646
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10647
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10648
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10649
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1065
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10650
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10651
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10652
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10653
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10654
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10655
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10656
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10657
Using UTF-8 encoding
Getting DF for trec07

Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10755
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10756
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10757
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10758
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10759
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1076
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10760
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10761
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10762
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10763
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10764
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10765
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10766
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10767
Using UTF-8 encoding


Getting DF for trec07p/trec07p/data/inmail.10865
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10866
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10867
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10868
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10869
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1087
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10870
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10871
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10872
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10873
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10874
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10875
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10876
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10877
Using UTF-8 encoding
Getting DF for trec07

Getting DF for trec07p/trec07p/data/inmail.10971
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10972
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10973
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10974
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10975
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10976
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10977
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10978
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10979
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1098
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10980
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10981
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10982
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10983
Using UTF-8 encoding
Getting DF for trec07

Getting DF for trec07p/trec07p/data/inmail.11080
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11081
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11082
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11083
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11084
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11085
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11086
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11087
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11088
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11089
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1109
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11090
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11091
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11092
Using UTF-8 encoding
Getting DF for trec07

Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11187
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11188
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11189
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1119
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11190
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11191
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11192
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11193
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11194
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11195
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11196
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11197
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11198
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11199
Using UTF-8 encoding


Getting DF for trec07p/trec07p/data/inmail.11296
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11297
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11298
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11299
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.113
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1130
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11300
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11301
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11302
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11303
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11304
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11305
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11306
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11307
Using UTF-8 encoding
Getting DF for trec07p/

Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11401
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11402
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11403
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11404
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11405
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11406
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11407
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11408
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11409
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1141
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11410
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11411
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11412
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11413
Using UTF-8 encoding


Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11508
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11509
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1151
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11510
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11511
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11512
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11513
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11514
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11515
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11516
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11517
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11518
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11519
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1152
Using UTF-8 encoding
G

Getting DF for trec07p/trec07p/data/inmail.11614
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11615
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11616
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11617
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11618
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11619
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1162
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11620
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11621
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11622
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11623
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11624
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11625
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11626
Using UTF-8 encoding
Getting DF for trec07

Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11724
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11725
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11726
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11727
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11728
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11729
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1173
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11730
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11731
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11732
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11733
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11734
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11735
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11736
Using UTF-8 encoding


Getting DF for trec07p/trec07p/data/inmail.1183
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11830
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11831
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11832
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11833
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11834
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11835
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11836
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11837
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11838
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11839
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1184
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11840
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11841
Using UTF-8 encoding
Getting DF for trec07p

Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1194
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11940
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11941
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11942
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11943
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11944
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11945
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11946
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11947
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11948
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11949
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1195
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11950
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.11951
Using UTF-8 encoding
G

Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12045
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12046
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12047
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12048
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12049
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1205
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12050
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12051
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12052
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12053
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12054
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12055
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12056
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12057
Using UTF-8 encoding


Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12152
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12153
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12154
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12155
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12156
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12157
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12158
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12159
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1216
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12160
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12161
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12162
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12163
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12164
Using UTF-8 encoding


Getting DF for trec07p/trec07p/data/inmail.12258
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12259
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1226
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12260
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12261
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12262
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12263
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12264
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12265
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12266
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12267
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12268
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12269
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1227
Using UTF-8 encoding
Getting DF for trec07p

Getting DF for trec07p/trec07p/data/inmail.12369
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1237
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12370
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12371
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12372
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12373
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12374
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12375
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12376
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12377
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12378
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12379
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1238
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12380
Using UTF-8 encoding
Getting DF for trec07p

Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12480
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12481
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12482
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12483
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12484
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12485
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12486
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12487
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12488
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12489
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1249
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12490
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12491
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12492
Using UTF-8 encoding


Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12589
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1259
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12590
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12591
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12592
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12593
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12594
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12595
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12596
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12597
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12598
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12599
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.126
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1260
Using UTF-8 encoding
Get

Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12696
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12697
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12698
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12699
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.127
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1270
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12700
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12701
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12702
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12703
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12704
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12705
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12706
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12707
Using UTF-8 encoding
Ge

Getting DF for trec07p/trec07p/data/inmail.12804
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12805
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12806
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12807
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12808
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12809
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1281
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12810
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12811
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12812
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12813
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12814
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12815
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12816
Using UTF-8 encoding
Getting DF for trec07

Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12912
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12913
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12914
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12915
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12916
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12917
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12918
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12919
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1292
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12920
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12921
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12922
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12923
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.12924
Using UTF-8 encoding


Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13021
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13022
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13023
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13024
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13025
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13026
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13027
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13028
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13029
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1303
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13030
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13031
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13032
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13033
Using UTF-8 encoding


Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13132
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13133
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13134
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13135
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13136
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13137
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13138
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13139
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1314
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13140
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13141
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13142
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13143
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13144
Using UTF-8 encoding


Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1324
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13240
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13241
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13242
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13243
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13244
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13245
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13246
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13247
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13248
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13249
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1325
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13250
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13251
Using UTF-8 encoding
G

Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1335
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13350
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13351
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13352
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13353
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13354
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13355
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13356
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13357
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13358
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13359
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1336
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13360
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13361
Using UTF-8 encoding
G

Getting DF for trec07p/trec07p/data/inmail.13456
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13457
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13458
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13459
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1346
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13460
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13461
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13462
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13463
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13464
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13465
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13466
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13467
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13468
Using UTF-8 encoding
Getting DF for trec07

Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13568
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13569
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1357
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13570
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13571
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13572
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13573
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13574
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13575
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13576
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13577
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13578
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13579
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1358
Using UTF-8 encoding
G

Getting DF for trec07p/trec07p/data/inmail.13679
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1368
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13680
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13681
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13682
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13683
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13684
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13685
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13686
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13687
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13688
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13689
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1369
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13690
Using UTF-8 encoding
Getting DF for trec07p

Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13790
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13791
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13792
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13793
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13794
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13795
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13796
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13797
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13798
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13799
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.138
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1380
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13800
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13801
Using UTF-8 encoding
Ge

Getting DF for trec07p/trec07p/data/inmail.139
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1390
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13900
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13901
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13902
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13903
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13904
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13905
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13906
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13907
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13908
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13909
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1391
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13910
Using UTF-8 encoding
Getting DF for trec07p/t

Getting DF for trec07p/trec07p/data/inmail.14006
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14007
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14008
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14009
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1401
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14010
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14011
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14012
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14013
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14014
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14015
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14016
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14017
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14018
Using UTF-8 encoding
Getting DF for trec07

Getting DF for trec07p/trec07p/data/inmail.14117
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14118
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14119
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1412
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14120
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14121
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14122
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14123
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14124
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14125
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14126
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14127
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14128
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14129
Using UTF-8 encoding
Getting DF for trec07

Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14223
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14224
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14225
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14226
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14227
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14228
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14229
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1423
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14230
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14231
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14232
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14233
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14234
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14235
Using UTF-8 encoding


Getting DF for trec07p/trec07p/data/inmail.14332
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14333
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14334
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14335
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14336
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14337
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14338
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14339
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1434
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14340
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14341
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14342
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14343
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14344
Using UTF-8 encoding
Getting DF for trec07

Getting DF for trec07p/trec07p/data/inmail.14442
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14443
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14444
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14445
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14446
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14447
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14448
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14449
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1445
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14450
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14451
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14452
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14453
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.14454
Using UTF-8 encoding
Getting DF for trec07

In [21]:
# Test for number of columns in dataframes
#dfs2.map(lambda df: df.shape[0])
#[i for i in zip(map(lambda df: df.shape[0],dfs2), range(len(dfs2))) if(i[0]>40)]


Use functools.reduce to merge all resulting dataframes into a single dataframe representing all emails.

In [22]:
df_final = reduce(lambda left,right: pd.merge(left,right,how='outer',on='index'), dfs+dfs2)

In [23]:
df_final = df_final.set_index('index').T

In [24]:
df_final['index'] = [i for i in range(df_final.shape[0])]
df_final.set_index('index', inplace=True)

In [25]:
df_final.head()

index,Return-Path,Delivered-To,Received,From,To,Message-Id,Reply-To,Subject,Mime-Version,Content-Type,...,X-PMX-Version-Mac,X-PerlMx-Spam,X-Sagator-Scanner,X-Sagator-ID,Nos,X-AuditID,Content-Language,SINGAPORE,X-imss-approveListMatch,X-twelveapples.com-MsgID
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,<aj881c@ix.netcom.com>\n,bait@mikhail.qcc.sk.ca\n,from local.nethost.org(really [24553/24554]) ...,aj881c <aj881c@ix.netcom.com>\n,<bagpipes@acadia.net>\n,<19943672.886214@relay.comanche.denmark.eu> M...,aj881c@ix.netcom.com\nAuthenticated sender is...,2-1\n,1.0\n,"text/plain; charset=""us-ascii""\n",...,,,,,,,,,,
1,<iwbp@mailcity.com>\n,bait@mikhail.qcc.sk.ca\n,from mail.hic.net (1Cust113.tnt8.lax3.da.uu.n...,iwbp@mailcity.com\n,members@your.net\n,,,"Exclusive Internet Business, 1st Time Offered...",,,...,,,,,,,,,,
2,<am74rt@worldnet.att.net>\n,bait@mikhail.qcc.sk.ca\n,from local.nethost.org(really [24553/24554]) ...,am74rt <am74rt@worldnet.att.net>\n,<badams@eastky.com>\n,<19943672.886214@relay.comanche.denmark.eu> T...,am74rt@worldnet.att.net\n\nAuthenticated send...,2-17\n,1.0\n,"text/plain; charset=""us-ascii""\n",...,,,,,,,,,,
3,<subwiz1@friendlyserver.com>\n,bait@mikhail.qcc.sk.ca\n,from subwiz1@friendlyserver.com œby net1.aoci...,"""D.Reynolds"" <subwiz1@friendlyserver.com>\n",,<199802161222.EAA24869@net1.aoci.com>\n,subwiz1@friendlyserver.com\n,ADV: FREE DOWNLOAD:Register your web site to ...,,,...,,,,,,,,,,
4,<carlover@goplay.com>\n,bait@mikhail.qcc.sk.ca\n,from mail.anet-chi.com (1Cust245.tnt13.atl2.d...,carlover@goplay.com\n,carlovers@america.com\n,,,AUTOMOBILE OPPORTUNITY\n,,,...,,,,,,,,,,


Notice that some of the columns are duplicates, for eg. 'Message-Id', 'Message-ID' & 'Message-id', so we define a method to combine these duplicate columns.

In [26]:
def combineDuplicateColumns(df_final, colNames):
    '''
    combines all colNames to 1 and drops duplicate columns
    '''
    originalCol = ''
    for i,duplicateCol in enumerate(colNames):
        if(i == 0):
            originalCol = duplicateCol
            continue
        
        df_final.loc[df_final[originalCol].isna(), originalCol] = df_final.loc[df_final[originalCol].isna()][duplicateCol]
    
    
    df_final.drop(labels=colNames[1:], axis=1, inplace=True)

In [27]:
# Combine duplicate columns
combineDuplicateColumns(df_final, ['Message-Id', 'Message-ID', 'Message-id'])
combineDuplicateColumns(df_final, ['Reply-To', 'Reply-to'])
combineDuplicateColumns(df_final, ['Mime-Version', 'MIME-version', 'MIME-Version'])
combineDuplicateColumns(df_final, ['Content-Type', 'Content-type'])
combineDuplicateColumns(df_final, ['Content-Transfer-Encoding', 'Content-transfer-encoding'])
combineDuplicateColumns(df_final, ['Error-To', 'Errors-To', 'Errors-to'])
combineDuplicateColumns(df_final, ['Content-Length','Content-length'])

Now that all duplicate columns are eliminated, lets display all columns of our final dataframe.

In [28]:
# Check for more duplicate columns
np.sort(df_final.columns.values)

array(['Account', 'Addressee', 'Amount', 'Approved-By', 'Architecture',
       'Army', 'Arrival-Date', 'Association', 'Attn',
       'Authentication-Results', 'Availability', 'Bcc', 'Body', 'Boxer',
       'BroadcastJobID', 'BroadcastRecipientID', 'CC', 'COMM', 'COMMENTS',
       'Cc', 'Cialis', 'Coefficients', 'Comments', 'Conf', 'Content-Base',
       'Content-Class', 'Content-Description', 'Content-Disposition',
       'Content-ID', 'Content-Language', 'Content-Length',
       'Content-Location', 'Content-MD5', 'Content-Transfer-Encoding',
       'Content-Type', 'Content-class', 'Content-description',
       'Content-disposition', 'Content-language', 'Copied',
       'DKIM-Signature', 'DSN', 'Date', 'Delivered-To', 'Design',
       'Development', 'Disposition-Notification-To', 'Dobbs',
       'DomainKey-Signature', 'DomainKey-Status', 'E-Mail', 'E-mail',
       'EMAIL', 'ERROR', 'EXTRAS.</span></font><o', 'Email', 'Error',
       'Error-To', 'FACT', 'FROM', 'Fax', 'Final-Recipient',

Let's keep the columns that we need and eliminate the rest.

In [29]:
#From, To, Body, Subject, Message-Id, X-UIDL, Sender, Spam
c1 = df_final.columns
c2 = pd.Index(['Spam', 'Body', 'Subject', 'From', 'To', 'Message-Id', 'X-UIDL', 'Sender'])
c1 = c1.difference(c2)

df_new = df_final.drop(labels=c1.format(), axis=1, inplace=False)
df_new.columns

Index(['From', 'To', 'Message-Id', 'Subject', 'Body', 'Spam', 'X-UIDL',
       'Sender'],
      dtype='object', name='index')

Here is our final dataframe. We now have data in a structured format. Saving it as "structured.xlsx". 

In [30]:
df_new.head()

index,From,To,Message-Id,Subject,Body,Spam,X-UIDL,Sender
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,aj881c <aj881c@ix.netcom.com>\n,<bagpipes@acadia.net>\n,<19943672.886214@relay.comanche.denmark.eu> M...,2-1\n,email marketing works!!\n\nbull's eye gold is ...,Spam,,
1,iwbp@mailcity.com\n,members@your.net\n,<>\n,"Exclusive Internet Business, 1st Time Offered...",>>>this is the most exciting breakthrough ever...,Spam,,
2,am74rt <am74rt@worldnet.att.net>\n,<badams@eastky.com>\n,<19943672.886214@relay.comanche.denmark.eu> T...,2-17\n,email marketing works!!\n\nbull's eye gold is ...,Spam,,
3,"""D.Reynolds"" <subwiz1@friendlyserver.com>\n",,<199802161222.EAA24869@net1.aoci.com>\n,ADV: FREE DOWNLOAD:Register your web site to ...,free download.register your web site to over 7...,Spam,,
4,carlover@goplay.com\n,carlovers@america.com\n,<>\n,AUTOMOBILE OPPORTUNITY\n,do you love cars?\n\nwant your own business?\n...,Spam,,


In [31]:
#Save to excel 
excelwriter = pd.ExcelWriter('structured.xlsx', engine='xlsxwriter')

df_new.to_excel(excelwriter, index=False)

excelwriter.save()

in.html

pre-development%20agenda%20meeting%20in%20singapore

fromgeneva
thiru%20balasubramaniam
16%20april%202007

the%20government%20of%20singapore%20will%20host%20a%20pre-development%20agenda%20meeting
from%2030%20may%20to%201%20june%202007.%20the%20meeting%20is%20being%20convened%20by%20the
intellectual%20property%20office%20of%20singapore%20(ipos)%20with%20invitations%20being
signed%20by%20ms.%20liew%20woon%20yin,%20director-general%20(ipos).

according%20to%20the%20invitation%20which%20was%20sent%20out%20on%2029%20march%202007,%20the
singaporean%20government%20is%20encouraged%20by%20positive%20outcomes%20achieved%20in
the%203rd%20session%20of%20the%20provisional%20committee%20on%20proposals%20related%20to%20a
wipo%20development%20agenda%20(pcda).%20in%20order%20to%20%22ensure%20that%20the%20pcda
succeeds%20in%20its%20endeavor,%20to%20submit%20a%20report%20to%20the%20general%20assembly,
later%20this%20year%22,%20singapore%20considers%20it%20imperative%20that%20the%2071
proposals%2

bnstory/national/home

red%20tape%20blocking%20medicine%20for%20africa
organizations%20seek%20amendments%20to%20allow%20production%20of%20generic%20versions
for%20export

gloria%20galloway

%20from%20wednesday's%20globe%20and%20mail

ottawa%20—%20three%20years%20after%20jean%20chrétien%20said%20canada%20would%20allow
generic%20drug%20makers%20to%20send%20copies%20of%20brand-name%20medicines%20to%20poor
countries%20to%20combat%20diseases%20such%20as%20aids%20and%20malaria,%20groups%20that
follow%20the%20issue%20say%20not%20a%20single%20pill%20has%20left%20canada.

the%20roadblock,%20they%20say,%20is%20red%20tape%20and%20a%20powerful%20brand-name
pharmaceutical%20industry%20that%20opposes%20the%20generic%20reproductions.

but%20the%20groups%20argue%20that%20canada's%20access%20to%20medicines%20regime%20can%20be
rewritten%20to%20get%20the%20drugs%20moving.

the%20canadian%20hiv/aids%20legal%20network%20and%20doctors%20without%20borders%20will
ask%20the%20commons%20industry,%20science%20an

Download to local if using Google Colab

In [None]:
from google.colab import files
files.download('structured.xlsx')