# Gmail Cleaning

This notebook develops the functions required to clean and prepare the raw gmail data for analysis and labeling.

Remember the goal of this project is to teach computers how to read email subject lines, so we need clean email subject lines as well as sender names, which could help us contextualize the subject line, but other fields are currently not important.  I will also save the date for filtering.

## Packages

In [175]:
import pandas as pd
from email.header import decode_header

## Completed Cleaning Functions

In [None]:
# Change string date to datetime object
def datetimeify(df):
    df['date'] = pd.to_datetime(df['date'])
    return df


# Extract and clean name, email, and domain fields --> 'from'
def name_email_domain(df):
    # extract name and email --> 'from' and create new columns
    df[['name','email']] = df['from'].str.rsplit(' ',1, expand=True)
    
    # make sure every email observation has an email address
    for i in df.index:
        if " " in df.name[i]:
            pass
        elif "@" in df.name[i]:
            df.at[i, 'email'] = df.name[i]
        else:
            pass
    
    # remove brackets from around email addresses
    df.email = df.email.str.replace(r'[<>]+', '', regex=True)
    
    # extract the domain --> 'email' and create new column
    df['domain'] = df['email'].str.split('@').str[1]
    
    # remove the 'from' column
    df.drop(columns=['from'])
    
    return df


def column_decode(df, column):
    
    # satisfy dependencies
    from email.header import decode_header
    
    # change column observations to string type so we can decode
    df[column] = df[column].astype(str)
    
    # iterate over each observation to decode UTF-8 encoded subject lines
    # ensure each observation only contains the string subject line
    for i in df.index:
        df.at[i, 'subject'] = decode_header(df.at[i, 'subject'])
        df.at[i, 'subject'] = df.subject[i][0][0]
    
    # remove extraneous quote marks from each observation in the column
    df[column] = df[column].apply(lambda x: x.replace('"', ''))

    return df
    

# Master function combining all other functions
def gmail_clean(df):
    df_dated = datetimeify(df)
    df_named = name_email_domain(df_dated)
    df_decoded_sub = column_decode(df, 'subject')
    df_decoded_nam = column_decode(df, 'name')
    



## Data Import

In [None]:
df = pd.read_pickle(r'mfflavell_emails.pkl')
df = df.drop(columns=['status', 'content_type'])

In [None]:
gmail_clean(df)

## Data Import

In the our first notebook, we saved the raw gmail data into a pickle file.

In [176]:
df = pd.read_pickle(r'mfflavell_emails.pkl')

Let's take a look at the file and start listing the ways it needs to be cleaned.

In [177]:
df.head(50)

Unnamed: 0,date,from,subject,status,content_type
0,"Tue, 05 Nov 2019 12:52:10 -0800",Google Community Team <googlecommunityteam-nor...,Finish setting up your new Google Account,,"multipart/alternative; boundary=""0000000000002..."
1,"Wed, 06 Nov 2019 15:51:29 +0000","""Frank Flavell (via Google Drive)"" <drive-shar...",Job Search - Invitation to collaborate,,"multipart/alternative; boundary=""000000000000a..."
2,8 Nov 2019 18:09:46 -0800,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...,,text/html; charset=us-ascii
3,8 Nov 2019 18:12:30 -0800,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...,,text/html; charset=us-ascii
4,"Tue, 12 Nov 2019 22:25:02 +0000 (UTC)","""Akins Team"" <info@akinsc.com>",Notification from Akins HR Team,,"multipart/alternative; boundary=""----------=_1..."
5,"Thu, 14 Nov 2019 01:08:17 +0000",SRNotice <SRNotice@customercare.nyc.gov>,SR Submitted # 311-01139022,,"multipart/alternative;\r\n\tboundary=""_000_BC3..."
6,"Fri, 21 Feb 2020 11:20:21 -0800",Google <noreply-utos@google.com>,Learn more about our updated Terms of Service,,"multipart/alternative; boundary=""000000000000a..."
7,24 Apr 2020 10:02:50 -0400,"""AdvantageCare Physicians"" <noreply@patients.p...",AdvantageCare Physicians would love your feedb...,,multipart/alternative;\r\n boundary=--boundary...
8,29 Apr 2020 00:28:21 -0400,"""AdvantageCare Physicians"" <noreply@patients.p...",AdvantageCare Physicians would love your feedb...,,multipart/alternative;\r\n boundary=--boundary...
9,"Mon, 24 Aug 2020 16:02:59 +0000",=?utf-8?Q?Hatch?= <hatch@hatchbaby.com>,=?utf-8?Q?Hatch=20Rest=3A=20Tap=20on=2C=20tap=...,,"multipart/alternative; boundary=""_----------=_..."


In [178]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17013 entries, 0 to 17012
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   date          17013 non-null  object
 1   from          17013 non-null  object
 2   subject       17012 non-null  object
 3   status        0 non-null      object
 4   content_type  17013 non-null  object
dtypes: object(5)
memory usage: 664.7+ KB


In [179]:
print("date is a " + str(type(df.loc[0, "date"])) + " data type.")
print("from is a " + str(type(df.loc[0, "from"])) + " data type.")
print("subject is a " + str(type(df.loc[0, "subject"])) + " data type.")

date is a <class 'str'> data type.
from is a <class 'str'> data type.
subject is a <class 'str'> data type.


## Cleaning Checklist

* [ ] Remove the "status" and "content_type" features
* [ ] Change string date to datetime datatype
* [ ] Split email address and sender name from the "from" field
* [ ] Decode UTF-8 endcoded subject lines
* [ ] Render emojis from UTF-8 strings

### Remove Unecessary Features

In [180]:
df = df.drop(columns=['status', 'content_type'])

In [181]:
df.head()

Unnamed: 0,date,from,subject
0,"Tue, 05 Nov 2019 12:52:10 -0800",Google Community Team <googlecommunityteam-nor...,Finish setting up your new Google Account
1,"Wed, 06 Nov 2019 15:51:29 +0000","""Frank Flavell (via Google Drive)"" <drive-shar...",Job Search - Invitation to collaborate
2,8 Nov 2019 18:09:46 -0800,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...
3,8 Nov 2019 18:12:30 -0800,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...
4,"Tue, 12 Nov 2019 22:25:02 +0000 (UTC)","""Akins Team"" <info@akinsc.com>",Notification from Akins HR Team


### Change Date to Datetime

In [182]:
#Preview
pd.to_datetime(df['date']).head()

0    2019-11-05 12:52:10-08:00
1    2019-11-06 15:51:29+00:00
2    2019-11-08 18:09:46-08:00
3    2019-11-08 18:12:30-08:00
4    2019-11-12 22:25:02+00:00
Name: date, dtype: object

In [183]:
df['date'] = pd.to_datetime(df['date'])

In [184]:
type(df.date[0])

datetime.datetime

In [212]:
def datetimeify(df):
    df['date'] = pd.to_datetime(df['date'])
    return df

### Split Name, Email, and Domain into New Features

- [x] create name and email features
- [x] ensure all email features contain an address
- [ ] remove brackets from around email addresses
- [ ] create domain feature (only the address after the @ symbol)


Thanks to [EdChum](https://stackoverflow.com/questions/32643649/splitting-a-pandas-dataframe-of-email-from-field-into-senders-name-email-add) for this simple approach.

In [185]:
df[['name','email']] = df['from'].str.rsplit(' ',1, expand=True)

In [186]:
df.head(50)

Unnamed: 0,date,from,subject,name,email
0,2019-11-05 12:52:10-08:00,Google Community Team <googlecommunityteam-nor...,Finish setting up your new Google Account,Google Community Team,<googlecommunityteam-noreply@google.com>
1,2019-11-06 15:51:29+00:00,"""Frank Flavell (via Google Drive)"" <drive-shar...",Job Search - Invitation to collaborate,"""Frank Flavell (via Google Drive)""",<drive-shares-noreply@google.com>
2,2019-11-08 18:09:46-08:00,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...,noreply@csod.com,
3,2019-11-08 18:12:30-08:00,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...,noreply@csod.com,
4,2019-11-12 22:25:02+00:00,"""Akins Team"" <info@akinsc.com>",Notification from Akins HR Team,"""Akins Team""",<info@akinsc.com>
5,2019-11-14 01:08:17+00:00,SRNotice <SRNotice@customercare.nyc.gov>,SR Submitted # 311-01139022,SRNotice,<SRNotice@customercare.nyc.gov>
6,2020-02-21 11:20:21-08:00,Google <noreply-utos@google.com>,Learn more about our updated Terms of Service,Google,<noreply-utos@google.com>
7,2020-04-24 10:02:50-04:00,"""AdvantageCare Physicians"" <noreply@patients.p...",AdvantageCare Physicians would love your feedb...,"""AdvantageCare Physicians""",<noreply@patients.pgsurveying.com>
8,2020-04-29 00:28:21-04:00,"""AdvantageCare Physicians"" <noreply@patients.p...",AdvantageCare Physicians would love your feedb...,"""AdvantageCare Physicians""",<noreply@patients.pgsurveying.com>
9,2020-08-24 16:02:59+00:00,=?utf-8?Q?Hatch?= <hatch@hatchbaby.com>,=?utf-8?Q?Hatch=20Rest=3A=20Tap=20on=2C=20tap=...,=?utf-8?Q?Hatch?=,<hatch@hatchbaby.com>


But the problem is that not all from fields contain a name.  They do all contain a domain.  So we need to make sure all domain fields contain a domain.  Thanks for [piRSquared](https://stackoverflow.com/questions/23330654/update-a-dataframe-in-pandas-while-iterating-row-by-row)

In [187]:
for i in df.index:
    if " " in df.name[i]:
        pass
    elif "@" in df.name[i]:
        df.at[i, 'email'] = df.name[i]
    else:
        pass

In [188]:
df[df.email.str.contains(" ")]

Unnamed: 0,date,from,subject,name,email


In [189]:
df.head(50)

Unnamed: 0,date,from,subject,name,email
0,2019-11-05 12:52:10-08:00,Google Community Team <googlecommunityteam-nor...,Finish setting up your new Google Account,Google Community Team,<googlecommunityteam-noreply@google.com>
1,2019-11-06 15:51:29+00:00,"""Frank Flavell (via Google Drive)"" <drive-shar...",Job Search - Invitation to collaborate,"""Frank Flavell (via Google Drive)""",<drive-shares-noreply@google.com>
2,2019-11-08 18:09:46-08:00,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...,noreply@csod.com,noreply@csod.com
3,2019-11-08 18:12:30-08:00,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...,noreply@csod.com,noreply@csod.com
4,2019-11-12 22:25:02+00:00,"""Akins Team"" <info@akinsc.com>",Notification from Akins HR Team,"""Akins Team""",<info@akinsc.com>
5,2019-11-14 01:08:17+00:00,SRNotice <SRNotice@customercare.nyc.gov>,SR Submitted # 311-01139022,SRNotice,<SRNotice@customercare.nyc.gov>
6,2020-02-21 11:20:21-08:00,Google <noreply-utos@google.com>,Learn more about our updated Terms of Service,Google,<noreply-utos@google.com>
7,2020-04-24 10:02:50-04:00,"""AdvantageCare Physicians"" <noreply@patients.p...",AdvantageCare Physicians would love your feedb...,"""AdvantageCare Physicians""",<noreply@patients.pgsurveying.com>
8,2020-04-29 00:28:21-04:00,"""AdvantageCare Physicians"" <noreply@patients.p...",AdvantageCare Physicians would love your feedb...,"""AdvantageCare Physicians""",<noreply@patients.pgsurveying.com>
9,2020-08-24 16:02:59+00:00,=?utf-8?Q?Hatch?= <hatch@hatchbaby.com>,=?utf-8?Q?Hatch=20Rest=3A=20Tap=20on=2C=20tap=...,=?utf-8?Q?Hatch?=,<hatch@hatchbaby.com>


Let's also remove all those brackets around the email address.

In [190]:
df.email = df.email.str.replace(r'[<>]+', '', regex=True)

In [191]:
df.head(50)

Unnamed: 0,date,from,subject,name,email
0,2019-11-05 12:52:10-08:00,Google Community Team <googlecommunityteam-nor...,Finish setting up your new Google Account,Google Community Team,googlecommunityteam-noreply@google.com
1,2019-11-06 15:51:29+00:00,"""Frank Flavell (via Google Drive)"" <drive-shar...",Job Search - Invitation to collaborate,"""Frank Flavell (via Google Drive)""",drive-shares-noreply@google.com
2,2019-11-08 18:09:46-08:00,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...,noreply@csod.com,noreply@csod.com
3,2019-11-08 18:12:30-08:00,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...,noreply@csod.com,noreply@csod.com
4,2019-11-12 22:25:02+00:00,"""Akins Team"" <info@akinsc.com>",Notification from Akins HR Team,"""Akins Team""",info@akinsc.com
5,2019-11-14 01:08:17+00:00,SRNotice <SRNotice@customercare.nyc.gov>,SR Submitted # 311-01139022,SRNotice,SRNotice@customercare.nyc.gov
6,2020-02-21 11:20:21-08:00,Google <noreply-utos@google.com>,Learn more about our updated Terms of Service,Google,noreply-utos@google.com
7,2020-04-24 10:02:50-04:00,"""AdvantageCare Physicians"" <noreply@patients.p...",AdvantageCare Physicians would love your feedb...,"""AdvantageCare Physicians""",noreply@patients.pgsurveying.com
8,2020-04-29 00:28:21-04:00,"""AdvantageCare Physicians"" <noreply@patients.p...",AdvantageCare Physicians would love your feedb...,"""AdvantageCare Physicians""",noreply@patients.pgsurveying.com
9,2020-08-24 16:02:59+00:00,=?utf-8?Q?Hatch?= <hatch@hatchbaby.com>,=?utf-8?Q?Hatch=20Rest=3A=20Tap=20on=2C=20tap=...,=?utf-8?Q?Hatch?=,hatch@hatchbaby.com


In [214]:
df['domain'] = df['email'].str.split('@').str[1]

In [216]:
df.head(50)

Unnamed: 0,date,from,subject,name,email,domain
0,2019-11-05 12:52:10-08:00,Google Community Team <googlecommunityteam-nor...,Finish setting up your new Google Account,Google Community Team,googlecommunityteam-noreply@google.com,google.com
1,2019-11-06 15:51:29+00:00,"""Frank Flavell (via Google Drive)"" <drive-shar...",Job Search - Invitation to collaborate,"""Frank Flavell (via Google Drive)""",drive-shares-noreply@google.com,google.com
2,2019-11-08 18:09:46-08:00,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...,noreply@csod.com,noreply@csod.com,csod.com
3,2019-11-08 18:12:30-08:00,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...,noreply@csod.com,noreply@csod.com,csod.com
4,2019-11-12 22:25:02+00:00,"""Akins Team"" <info@akinsc.com>",Notification from Akins HR Team,"""Akins Team""",info@akinsc.com,akinsc.com
5,2019-11-14 01:08:17+00:00,SRNotice <SRNotice@customercare.nyc.gov>,SR Submitted # 311-01139022,SRNotice,SRNotice@customercare.nyc.gov,customercare.nyc.gov
6,2020-02-21 11:20:21-08:00,Google <noreply-utos@google.com>,Learn more about our updated Terms of Service,Google,noreply-utos@google.com,google.com
7,2020-04-24 10:02:50-04:00,"""AdvantageCare Physicians"" <noreply@patients.p...",AdvantageCare Physicians would love your feedb...,"""AdvantageCare Physicians""",noreply@patients.pgsurveying.com,patients.pgsurveying.com
8,2020-04-29 00:28:21-04:00,"""AdvantageCare Physicians"" <noreply@patients.p...",AdvantageCare Physicians would love your feedb...,"""AdvantageCare Physicians""",noreply@patients.pgsurveying.com,patients.pgsurveying.com
9,2020-08-24 16:02:59+00:00,=?utf-8?Q?Hatch?= <hatch@hatchbaby.com>,"b'Hatch Rest: Tap on, tap off'",=?utf-8?Q?Hatch?=,hatch@hatchbaby.com,hatchbaby.com


In [218]:
df.head(50)

Unnamed: 0,date,from,subject,name,email,domain
0,2019-11-05 12:52:10-08:00,Google Community Team <googlecommunityteam-nor...,Finish setting up your new Google Account,Google Community Team,googlecommunityteam-noreply@google.com,google.com
1,2019-11-06 15:51:29+00:00,"""Frank Flavell (via Google Drive)"" <drive-shar...",Job Search - Invitation to collaborate,"""Frank Flavell (via Google Drive)""",drive-shares-noreply@google.com,google.com
2,2019-11-08 18:09:46-08:00,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...,noreply@csod.com,noreply@csod.com,csod.com
3,2019-11-08 18:12:30-08:00,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...,noreply@csod.com,noreply@csod.com,csod.com
4,2019-11-12 22:25:02+00:00,"""Akins Team"" <info@akinsc.com>",Notification from Akins HR Team,"""Akins Team""",info@akinsc.com,akinsc.com
5,2019-11-14 01:08:17+00:00,SRNotice <SRNotice@customercare.nyc.gov>,SR Submitted # 311-01139022,SRNotice,SRNotice@customercare.nyc.gov,customercare.nyc.gov
6,2020-02-21 11:20:21-08:00,Google <noreply-utos@google.com>,Learn more about our updated Terms of Service,Google,noreply-utos@google.com,google.com
7,2020-04-24 10:02:50-04:00,"""AdvantageCare Physicians"" <noreply@patients.p...",AdvantageCare Physicians would love your feedb...,"""AdvantageCare Physicians""",noreply@patients.pgsurveying.com,patients.pgsurveying.com
8,2020-04-29 00:28:21-04:00,"""AdvantageCare Physicians"" <noreply@patients.p...",AdvantageCare Physicians would love your feedb...,"""AdvantageCare Physicians""",noreply@patients.pgsurveying.com,patients.pgsurveying.com
9,2020-08-24 16:02:59+00:00,=?utf-8?Q?Hatch?= <hatch@hatchbaby.com>,"b'Hatch Rest: Tap on, tap off'",=?utf-8?Q?Hatch?=,hatch@hatchbaby.com,hatchbaby.com


In [221]:
df['name'] = df['name'].apply(lambda x: x.replace('"', ''))

In [222]:
df.head(50)

Unnamed: 0,date,from,subject,name,email,domain
0,2019-11-05 12:52:10-08:00,Google Community Team <googlecommunityteam-nor...,Finish setting up your new Google Account,Google Community Team,googlecommunityteam-noreply@google.com,google.com
1,2019-11-06 15:51:29+00:00,"""Frank Flavell (via Google Drive)"" <drive-shar...",Job Search - Invitation to collaborate,Frank Flavell (via Google Drive),drive-shares-noreply@google.com,google.com
2,2019-11-08 18:09:46-08:00,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...,noreply@csod.com,noreply@csod.com,csod.com
3,2019-11-08 18:12:30-08:00,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...,noreply@csod.com,noreply@csod.com,csod.com
4,2019-11-12 22:25:02+00:00,"""Akins Team"" <info@akinsc.com>",Notification from Akins HR Team,Akins Team,info@akinsc.com,akinsc.com
5,2019-11-14 01:08:17+00:00,SRNotice <SRNotice@customercare.nyc.gov>,SR Submitted # 311-01139022,SRNotice,SRNotice@customercare.nyc.gov,customercare.nyc.gov
6,2020-02-21 11:20:21-08:00,Google <noreply-utos@google.com>,Learn more about our updated Terms of Service,Google,noreply-utos@google.com,google.com
7,2020-04-24 10:02:50-04:00,"""AdvantageCare Physicians"" <noreply@patients.p...",AdvantageCare Physicians would love your feedb...,AdvantageCare Physicians,noreply@patients.pgsurveying.com,patients.pgsurveying.com
8,2020-04-29 00:28:21-04:00,"""AdvantageCare Physicians"" <noreply@patients.p...",AdvantageCare Physicians would love your feedb...,AdvantageCare Physicians,noreply@patients.pgsurveying.com,patients.pgsurveying.com
9,2020-08-24 16:02:59+00:00,=?utf-8?Q?Hatch?= <hatch@hatchbaby.com>,"b'Hatch Rest: Tap on, tap off'",=?utf-8?Q?Hatch?=,hatch@hatchbaby.com,hatchbaby.com


In [None]:
def name_email_domain(df):
    # extract name and email --> 'from' and create new columns
    df[['name','email']] = df['from'].str.rsplit(' ',1, expand=True)
    
    # make sure every email observation has an email address
    for i in df.index:
        if " " in df.name[i]:
            pass
        elif "@" in df.name[i]:
            df.at[i, 'email'] = df.name[i]
        else:
            pass
    
    # remove brackets from around email addresses
    df.email = df.email.str.replace(r'[<>]+', '', regex=True)
    
    # extract the domain --> 'email' and create new column
    df['domain'] = df['email'].str.split('@').str[1]
    
    # remove the 'from' column
    df.drop(columns=['from'])
    
    return df
    

### Decode UTF-8 Subject Lines

In [193]:
type(df.subject[0])

str

Convert every observation into string datatype so we can decode.

In [194]:
df.subject = df.subject.astype(str)

Iterate over each observation to decode UTF-8 encoded subject lines

In [195]:
for i in df.index:
    df.at[i, 'subject'] = decode_header(df.at[i, 'subject'])

In [196]:
type(df.subject[0])

list

In [200]:
type(df.subject[0][0])

tuple

Iterate over each subject to replace the tuple with the string subject line.

In [205]:
for i in df.index:
    df.at[i, 'subject'] = df.subject[i][0][0]

In [208]:
type(df.subject[1245])

str

In [227]:
df.head(50)

Unnamed: 0,date,from,subject,name,email,domain
0,2019-11-05 12:52:10-08:00,Google Community Team <googlecommunityteam-nor...,Finish setting up your new Google Account,Google Community Team,googlecommunityteam-noreply@google.com,google.com
1,2019-11-06 15:51:29+00:00,"""Frank Flavell (via Google Drive)"" <drive-shar...",Job Search - Invitation to collaborate,Frank Flavell (via Google Drive),drive-shares-noreply@google.com,google.com
2,2019-11-08 18:09:46-08:00,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...,noreply@csod.com,noreply@csod.com,csod.com
3,2019-11-08 18:12:30-08:00,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...,noreply@csod.com,noreply@csod.com,csod.com
4,2019-11-12 22:25:02+00:00,"""Akins Team"" <info@akinsc.com>",Notification from Akins HR Team,Akins Team,info@akinsc.com,akinsc.com
5,2019-11-14 01:08:17+00:00,SRNotice <SRNotice@customercare.nyc.gov>,SR Submitted # 311-01139022,SRNotice,SRNotice@customercare.nyc.gov,customercare.nyc.gov
6,2020-02-21 11:20:21-08:00,Google <noreply-utos@google.com>,Learn more about our updated Terms of Service,Google,noreply-utos@google.com,google.com
7,2020-04-24 10:02:50-04:00,"""AdvantageCare Physicians"" <noreply@patients.p...",AdvantageCare Physicians would love your feedb...,AdvantageCare Physicians,noreply@patients.pgsurveying.com,patients.pgsurveying.com
8,2020-04-29 00:28:21-04:00,"""AdvantageCare Physicians"" <noreply@patients.p...",AdvantageCare Physicians would love your feedb...,AdvantageCare Physicians,noreply@patients.pgsurveying.com,patients.pgsurveying.com
9,2020-08-24 16:02:59+00:00,=?utf-8?Q?Hatch?= <hatch@hatchbaby.com>,"b'Hatch Rest: Tap on, tap off'",=?utf-8?Q?Hatch?=,hatch@hatchbaby.com,hatchbaby.com


In [229]:
df.subject = df.subject.astype(str)

In [230]:
df['subject'] = df['subject'].apply(lambda x: x.replace("b'", ''))

In [232]:
df['subject'] = df['subject'].apply(lambda x: x.rstrip("'"))

In [236]:
df.head(50)

Unnamed: 0,date,from,subject,name,email,domain
0,2019-11-05 12:52:10-08:00,Google Community Team <googlecommunityteam-nor...,Finish setting up your new Google Account,Google Community Team,googlecommunityteam-noreply@google.com,google.com
1,2019-11-06 15:51:29+00:00,"""Frank Flavell (via Google Drive)"" <drive-shar...",Job Search - Invitation to collaborate,Frank Flavell (via Google Drive),drive-shares-noreply@google.com,google.com
2,2019-11-08 18:09:46-08:00,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...,noreply@csod.com,noreply@csod.com,csod.com
3,2019-11-08 18:12:30-08:00,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...,noreply@csod.com,noreply@csod.com,csod.com
4,2019-11-12 22:25:02+00:00,"""Akins Team"" <info@akinsc.com>",Notification from Akins HR Team,Akins Team,info@akinsc.com,akinsc.com
5,2019-11-14 01:08:17+00:00,SRNotice <SRNotice@customercare.nyc.gov>,SR Submitted # 311-01139022,SRNotice,SRNotice@customercare.nyc.gov,customercare.nyc.gov
6,2020-02-21 11:20:21-08:00,Google <noreply-utos@google.com>,Learn more about our updated Terms of Service,Google,noreply-utos@google.com,google.com
7,2020-04-24 10:02:50-04:00,"""AdvantageCare Physicians"" <noreply@patients.p...",AdvantageCare Physicians would love your feedb...,AdvantageCare Physicians,noreply@patients.pgsurveying.com,patients.pgsurveying.com
8,2020-04-29 00:28:21-04:00,"""AdvantageCare Physicians"" <noreply@patients.p...",AdvantageCare Physicians would love your feedb...,AdvantageCare Physicians,noreply@patients.pgsurveying.com,patients.pgsurveying.com
9,2020-08-24 16:02:59+00:00,=?utf-8?Q?Hatch?= <hatch@hatchbaby.com>,"Hatch Rest: Tap on, tap off",=?utf-8?Q?Hatch?=,hatch@hatchbaby.com,hatchbaby.com


In [234]:
def column_decode(df, column):
    
    # satisfy dependencies
    from email.header import decode_header
    
    # change column observations to string type so we can decode
    df[column] = df[column].astype(str)
    
    # iterate over each observation to decode UTF-8 encoded subject lines
    # ensure each observation only contains the string subject line
    for i in df.index:
        df.at[i, 'subject'] = decode_header(df.at[i, 'subject'])
        df.at[i, 'subject'] = df.subject[i][0][0]
    
    # remove extraneous quote marks from each observation in the column
    df[column] = df[column].apply(lambda x: x.replace('"', ''))
    
    # remove "b'" at the beginning of observations
    df[column] = df[column].apply(lambda x: x.replace("b'", ''))
    
    # remove extraneous single quote mark at the end of observations
    df[column] = df[column].apply(lambda x: x.rstrip("'"))

### Render Emojis & Create Feature of Emoji Name

In [223]:
emoji = pd.read_csv('emoji_library.csv')

In [225]:
emoji.head()

Unnamed: 0,type,native,android,symbol,unicode,utf-8,description
0,emoticon,😁,😁,😁,U+1F601,\xF0\x9F\x98\x81,grinning face with smiling eyes
1,emoticon,😂,😂,😂,U+1F602,\xF0\x9F\x98\x82,face with tears of joy
2,emoticon,😃,😃,😃,U+1F603,\xF0\x9F\x98\x83,smiling face with open mouth
3,emoticon,😄,😄,😄,U+1F604,\xF0\x9F\x98\x84,smiling face with open mouth and smiling eyes
4,emoticon,😅,😅,😅,U+1F605,\xF0\x9F\x98\x85,smiling face with open mouth and cold sweat


In [237]:
df.subject[13]

'\\xf0\\x9f\\x91\\x8bWelcome to Kano World'