# Gmail Cleaning: Name, Email & Domain

This notebook develops functions required to transform the 'from' column into 3 new cleaned features:
* **Name:** Sender's name
* **Email:** The Sender's email address
* **Domain:** The domain the Sender is emailing from

## Packages

In [2]:
import pandas as pd
from email.header import decode_header

## Completed Cleaning Functions

In [3]:
# Change string date to datetime object
def datetimeify(df):
    df['date'] = pd.to_datetime(df['date'])
    return df

# Drop any columns without a subject line, which is the most important field we need.
def drop_empty_subs(df):
    df.dropna(subset=['subject'], inplace=True)
    return df

# Extract and clean name, email, and domain fields --> 'from'
def name_email_domain(df):
    
    # remove extraneous quote marks from each observation in the from column
    df['from'] = df['from'].apply(lambda x: x.replace('"', '')) 
    
    # extract name and email --> 'from' and create new columns
    df[['name','email']] = df['from'].str.rsplit('<',1, expand=True)
    
    # make sure every email observation has an email address
    for i in df.index:
        if " " in df.name[i]:
            pass
        elif "@" in df.name[i]:
            df.at[i, 'email'] = df.name[i]
        else:
            pass
    
    # remove brackets from around email addresses
    df.email = df.email.str.replace(r'[<>]+', '', regex=True)
    
    # extract the domain --> 'email' and create new column
    df['domain'] = df['email'].str.split('@').str[1]
    
    # remove extraneous quote marks from each observation in the column
    df['name'] = df['name'].apply(lambda x: x.replace('"', ''))
    
    # remove the 'from' column
    df = df.drop(columns=['from'])
    
    return df

# Master function combining all other functions
def gmail_NED(df):
    df_dated = datetimeify(df)
    df_dropped = drop_empty_subs(df_dated)
    df_NED = name_email_domain(df_dropped)
    return df_NED


## Data Import

In [4]:
df = pd.read_pickle(r'mfflavell_emails.pkl')
df = df.drop(columns=['status', 'content_type'])

In [5]:
df = datetimeify(df)

In [6]:
df = drop_empty_subs(df)

In [7]:
df = name_email_domain(df)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17012 entries, 0 to 17012
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   date     17012 non-null  object
 1   subject  17012 non-null  object
 2   name     17012 non-null  object
 3   email    17012 non-null  object
 4   domain   17012 non-null  object
dtypes: object(5)
memory usage: 1.3+ MB


In [9]:
df.head()

Unnamed: 0,date,subject,name,email,domain
0,2019-11-05 12:52:10-08:00,Finish setting up your new Google Account,Google Community Team,googlecommunityteam-noreply@google.com,google.com
1,2019-11-06 15:51:29+00:00,Job Search - Invitation to collaborate,Frank Flavell (via Google Drive),drive-shares-noreply@google.com,google.com
2,2019-11-08 18:09:46-08:00,U.S. Census Bureau Prospective Candidate Confi...,noreply@csod.com,noreply@csod.com,csod.com
3,2019-11-08 18:12:30-08:00,U.S. Census Bureau Prospective Candidate Confi...,noreply@csod.com,noreply@csod.com,csod.com
4,2019-11-12 22:25:02+00:00,Notification from Akins HR Team,Akins Team,info@akinsc.com,akinsc.com


In [10]:
df.tail()

Unnamed: 0,date,subject,name,email,domain
17008,2022-09-23 21:33:42+00:00,Up to $25 off tastes so sweet. Can we just say...,Uber Eats,uber@uber.com,uber.com
17009,2022-09-23 22:25:40+00:00,"Matthew Flavell, will you rate your transactio...",Amazon Marketplace,marketplace-messages@amazon.com,amazon.com
17010,2022-09-23 23:04:11+00:00,Re: Flatiron DS Full Curriculum,rajeev panwar,panwar_rajeev@hotmail.com,hotmail.com
17011,2022-09-23 23:11:38+00:00,Your Weekend Watch Guide Is Here,HBO Max,HBOMax@mail.hbomax.com,mail.hbomax.com
17012,2022-09-24 00:00:20+00:00,=?utf-8?B?8J+OuSBIb3cgbXVjaCBwaWFubyBoYXZlIHlv...,Levi from Simply Piano,play@piano.hellosimply.com,piano.hellosimply.com


In [11]:
df.to_pickle("mfflavell_emails_NED.pkl")

## Function Development

Record of how I developed each cleaning function.

In [332]:
df = pd.read_pickle(r'mfflavell_emails.pkl')

Let's take a look at the file and start listing the ways it needs to be cleaned.

In [334]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17013 entries, 0 to 17012
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   date          17013 non-null  object
 1   from          17013 non-null  object
 2   subject       17012 non-null  object
 3   status        0 non-null      object
 4   content_type  17013 non-null  object
dtypes: object(5)
memory usage: 664.7+ KB


In [335]:
print("date is a " + str(type(df.loc[0, "date"])) + " data type.")
print("from is a " + str(type(df.loc[0, "from"])) + " data type.")
print("subject is a " + str(type(df.loc[0, "subject"])) + " data type.")

date is a <class 'str'> data type.
from is a <class 'str'> data type.
subject is a <class 'str'> data type.


## Cleaning Checklist

* [X] Remove the "status" and "content_type" features
* [X] Remove subject nulls since subject is the primary feature of interest
* [X] Change string date to datetime datatype
* [X] Split email address and sender name from the "from" field
* [X] Decode UTF-8 endcoded subject lines
* [X] Render emojis from UTF-8 strings

## Remove Unecessary Features

In [324]:
df = df.drop(columns=['status', 'content_type'])

In [325]:
df.head()

Unnamed: 0,date,from,subject
0,"Tue, 05 Nov 2019 12:52:10 -0800",Google Community Team <googlecommunityteam-nor...,Finish setting up your new Google Account
1,"Wed, 06 Nov 2019 15:51:29 +0000","""Frank Flavell (via Google Drive)"" <drive-shar...",Job Search - Invitation to collaborate
2,8 Nov 2019 18:09:46 -0800,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...
3,8 Nov 2019 18:12:30 -0800,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...
4,"Tue, 12 Nov 2019 22:25:02 +0000 (UTC)","""Akins Team"" <info@akinsc.com>",Notification from Akins HR Team


## Remove Null Values for Subject

In [360]:
sub_strings = []
sub_bytes = []
sub_other = []
for i in df.index:
    if type(df.subject[i]) == str:
        sub_strings.append(i)
    elif type(df.subject[i]) == bytes:
        sub_bytes.append(i)
    else:
        sub_other.append(i)
        
print(len(sub_strings))
print(len(sub_bytes))
print(len(sub_other))

17012
0
1


In [361]:
print(sub_other)

[16829]


In [362]:
type(sub_other[0])

int

In [363]:
type(df.subject[16829])

NoneType

In [371]:
df = df.drop([16829], axis=0)

In [372]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17012 entries, 0 to 17012
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   date     17012 non-null  object
 1   from     17012 non-null  object
 2   subject  17012 non-null  object
dtypes: object(3)
memory usage: 531.6+ KB


## Change Date to Datetime

In [326]:
#Preview
pd.to_datetime(df['date']).head()

0    2019-11-05 12:52:10-08:00
1    2019-11-06 15:51:29+00:00
2    2019-11-08 18:09:46-08:00
3    2019-11-08 18:12:30-08:00
4    2019-11-12 22:25:02+00:00
Name: date, dtype: object

In [327]:
df['date'] = pd.to_datetime(df['date'])

In [328]:
type(df.date[0])

datetime.datetime

In [329]:
def datetimeify(df):
    df['date'] = pd.to_datetime(df['date'])
    return df

### Split Name, Email, and Domain into New Features

- [x] create name and email features
- [x] ensure all email features contain an address
- [X] remove brackets from around email addresses
- [X] create domain feature (only the address after the @ symbol)


Thanks to [EdChum](https://stackoverflow.com/questions/32643649/splitting-a-pandas-dataframe-of-email-from-field-into-senders-name-email-add) for this simple approach.  But we'll update it slightly to split on the bracket "<" at the beginning of the email address.

In [330]:
df[['name','email']] = df['from'].str.rsplit('<',1, expand=True)

In [331]:
df.head(50)

Unnamed: 0,date,from,subject,name,email
0,2019-11-05 12:52:10-08:00,Google Community Team <googlecommunityteam-nor...,Finish setting up your new Google Account,Google Community Team,<googlecommunityteam-noreply@google.com>
1,2019-11-06 15:51:29+00:00,"""Frank Flavell (via Google Drive)"" <drive-shar...",Job Search - Invitation to collaborate,"""Frank Flavell (via Google Drive)""",<drive-shares-noreply@google.com>
2,2019-11-08 18:09:46-08:00,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...,noreply@csod.com,
3,2019-11-08 18:12:30-08:00,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...,noreply@csod.com,
4,2019-11-12 22:25:02+00:00,"""Akins Team"" <info@akinsc.com>",Notification from Akins HR Team,"""Akins Team""",<info@akinsc.com>
5,2019-11-14 01:08:17+00:00,SRNotice <SRNotice@customercare.nyc.gov>,SR Submitted # 311-01139022,SRNotice,<SRNotice@customercare.nyc.gov>
6,2020-02-21 11:20:21-08:00,Google <noreply-utos@google.com>,Learn more about our updated Terms of Service,Google,<noreply-utos@google.com>
7,2020-04-24 10:02:50-04:00,"""AdvantageCare Physicians"" <noreply@patients.p...",AdvantageCare Physicians would love your feedb...,"""AdvantageCare Physicians""",<noreply@patients.pgsurveying.com>
8,2020-04-29 00:28:21-04:00,"""AdvantageCare Physicians"" <noreply@patients.p...",AdvantageCare Physicians would love your feedb...,"""AdvantageCare Physicians""",<noreply@patients.pgsurveying.com>
9,2020-08-24 16:02:59+00:00,=?utf-8?Q?Hatch?= <hatch@hatchbaby.com>,=?utf-8?Q?Hatch=20Rest=3A=20Tap=20on=2C=20tap=...,=?utf-8?Q?Hatch?=,<hatch@hatchbaby.com>


But the problem is that not all from fields contain a name.  They do all contain a domain.  So we need to make sure all domain fields contain a domain.  Thanks for [piRSquared](https://stackoverflow.com/questions/23330654/update-a-dataframe-in-pandas-while-iterating-row-by-row)

In [271]:
for i in df.index:
    if " " in df.name[i]:
        pass
    elif "@" in df.name[i]:
        df.at[i, 'email'] = df.name[i]
    else:
        pass

In [272]:
df[df.email.str.contains(" ")]

Unnamed: 0,date,from,subject,name,email


In [273]:
df.head(50)

Unnamed: 0,date,from,subject,name,email
0,2019-11-05 12:52:10-08:00,Google Community Team <googlecommunityteam-nor...,Finish setting up your new Google Account,Google Community Team,<googlecommunityteam-noreply@google.com>
1,2019-11-06 15:51:29+00:00,"""Frank Flavell (via Google Drive)"" <drive-shar...",Job Search - Invitation to collaborate,"""Frank Flavell (via Google Drive)""",<drive-shares-noreply@google.com>
2,2019-11-08 18:09:46-08:00,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...,noreply@csod.com,noreply@csod.com
3,2019-11-08 18:12:30-08:00,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...,noreply@csod.com,noreply@csod.com
4,2019-11-12 22:25:02+00:00,"""Akins Team"" <info@akinsc.com>",Notification from Akins HR Team,"""Akins Team""",<info@akinsc.com>
5,2019-11-14 01:08:17+00:00,SRNotice <SRNotice@customercare.nyc.gov>,SR Submitted # 311-01139022,SRNotice,<SRNotice@customercare.nyc.gov>
6,2020-02-21 11:20:21-08:00,Google <noreply-utos@google.com>,Learn more about our updated Terms of Service,Google,<noreply-utos@google.com>
7,2020-04-24 10:02:50-04:00,"""AdvantageCare Physicians"" <noreply@patients.p...",AdvantageCare Physicians would love your feedb...,"""AdvantageCare Physicians""",<noreply@patients.pgsurveying.com>
8,2020-04-29 00:28:21-04:00,"""AdvantageCare Physicians"" <noreply@patients.p...",AdvantageCare Physicians would love your feedb...,"""AdvantageCare Physicians""",<noreply@patients.pgsurveying.com>
9,2020-08-24 16:02:59+00:00,=?utf-8?Q?Hatch?= <hatch@hatchbaby.com>,=?utf-8?Q?Hatch=20Rest=3A=20Tap=20on=2C=20tap=...,=?utf-8?Q?Hatch?=,<hatch@hatchbaby.com>


Let's also remove all those brackets around the email address.

In [274]:
df.email = df.email.str.replace(r'[<>]+', '', regex=True)

In [275]:
df.head(50)

Unnamed: 0,date,from,subject,name,email
0,2019-11-05 12:52:10-08:00,Google Community Team <googlecommunityteam-nor...,Finish setting up your new Google Account,Google Community Team,googlecommunityteam-noreply@google.com
1,2019-11-06 15:51:29+00:00,"""Frank Flavell (via Google Drive)"" <drive-shar...",Job Search - Invitation to collaborate,"""Frank Flavell (via Google Drive)""",drive-shares-noreply@google.com
2,2019-11-08 18:09:46-08:00,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...,noreply@csod.com,noreply@csod.com
3,2019-11-08 18:12:30-08:00,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...,noreply@csod.com,noreply@csod.com
4,2019-11-12 22:25:02+00:00,"""Akins Team"" <info@akinsc.com>",Notification from Akins HR Team,"""Akins Team""",info@akinsc.com
5,2019-11-14 01:08:17+00:00,SRNotice <SRNotice@customercare.nyc.gov>,SR Submitted # 311-01139022,SRNotice,SRNotice@customercare.nyc.gov
6,2020-02-21 11:20:21-08:00,Google <noreply-utos@google.com>,Learn more about our updated Terms of Service,Google,noreply-utos@google.com
7,2020-04-24 10:02:50-04:00,"""AdvantageCare Physicians"" <noreply@patients.p...",AdvantageCare Physicians would love your feedb...,"""AdvantageCare Physicians""",noreply@patients.pgsurveying.com
8,2020-04-29 00:28:21-04:00,"""AdvantageCare Physicians"" <noreply@patients.p...",AdvantageCare Physicians would love your feedb...,"""AdvantageCare Physicians""",noreply@patients.pgsurveying.com
9,2020-08-24 16:02:59+00:00,=?utf-8?Q?Hatch?= <hatch@hatchbaby.com>,=?utf-8?Q?Hatch=20Rest=3A=20Tap=20on=2C=20tap=...,=?utf-8?Q?Hatch?=,hatch@hatchbaby.com


In [276]:
df['domain'] = df['email'].str.split('@').str[1]

In [277]:
df.head(50)

Unnamed: 0,date,from,subject,name,email,domain
0,2019-11-05 12:52:10-08:00,Google Community Team <googlecommunityteam-nor...,Finish setting up your new Google Account,Google Community Team,googlecommunityteam-noreply@google.com,google.com
1,2019-11-06 15:51:29+00:00,"""Frank Flavell (via Google Drive)"" <drive-shar...",Job Search - Invitation to collaborate,"""Frank Flavell (via Google Drive)""",drive-shares-noreply@google.com,google.com
2,2019-11-08 18:09:46-08:00,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...,noreply@csod.com,noreply@csod.com,csod.com
3,2019-11-08 18:12:30-08:00,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...,noreply@csod.com,noreply@csod.com,csod.com
4,2019-11-12 22:25:02+00:00,"""Akins Team"" <info@akinsc.com>",Notification from Akins HR Team,"""Akins Team""",info@akinsc.com,akinsc.com
5,2019-11-14 01:08:17+00:00,SRNotice <SRNotice@customercare.nyc.gov>,SR Submitted # 311-01139022,SRNotice,SRNotice@customercare.nyc.gov,customercare.nyc.gov
6,2020-02-21 11:20:21-08:00,Google <noreply-utos@google.com>,Learn more about our updated Terms of Service,Google,noreply-utos@google.com,google.com
7,2020-04-24 10:02:50-04:00,"""AdvantageCare Physicians"" <noreply@patients.p...",AdvantageCare Physicians would love your feedb...,"""AdvantageCare Physicians""",noreply@patients.pgsurveying.com,patients.pgsurveying.com
8,2020-04-29 00:28:21-04:00,"""AdvantageCare Physicians"" <noreply@patients.p...",AdvantageCare Physicians would love your feedb...,"""AdvantageCare Physicians""",noreply@patients.pgsurveying.com,patients.pgsurveying.com
9,2020-08-24 16:02:59+00:00,=?utf-8?Q?Hatch?= <hatch@hatchbaby.com>,=?utf-8?Q?Hatch=20Rest=3A=20Tap=20on=2C=20tap=...,=?utf-8?Q?Hatch?=,hatch@hatchbaby.com,hatchbaby.com


In [278]:
df.head(50)

Unnamed: 0,date,from,subject,name,email,domain
0,2019-11-05 12:52:10-08:00,Google Community Team <googlecommunityteam-nor...,Finish setting up your new Google Account,Google Community Team,googlecommunityteam-noreply@google.com,google.com
1,2019-11-06 15:51:29+00:00,"""Frank Flavell (via Google Drive)"" <drive-shar...",Job Search - Invitation to collaborate,"""Frank Flavell (via Google Drive)""",drive-shares-noreply@google.com,google.com
2,2019-11-08 18:09:46-08:00,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...,noreply@csod.com,noreply@csod.com,csod.com
3,2019-11-08 18:12:30-08:00,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...,noreply@csod.com,noreply@csod.com,csod.com
4,2019-11-12 22:25:02+00:00,"""Akins Team"" <info@akinsc.com>",Notification from Akins HR Team,"""Akins Team""",info@akinsc.com,akinsc.com
5,2019-11-14 01:08:17+00:00,SRNotice <SRNotice@customercare.nyc.gov>,SR Submitted # 311-01139022,SRNotice,SRNotice@customercare.nyc.gov,customercare.nyc.gov
6,2020-02-21 11:20:21-08:00,Google <noreply-utos@google.com>,Learn more about our updated Terms of Service,Google,noreply-utos@google.com,google.com
7,2020-04-24 10:02:50-04:00,"""AdvantageCare Physicians"" <noreply@patients.p...",AdvantageCare Physicians would love your feedb...,"""AdvantageCare Physicians""",noreply@patients.pgsurveying.com,patients.pgsurveying.com
8,2020-04-29 00:28:21-04:00,"""AdvantageCare Physicians"" <noreply@patients.p...",AdvantageCare Physicians would love your feedb...,"""AdvantageCare Physicians""",noreply@patients.pgsurveying.com,patients.pgsurveying.com
9,2020-08-24 16:02:59+00:00,=?utf-8?Q?Hatch?= <hatch@hatchbaby.com>,=?utf-8?Q?Hatch=20Rest=3A=20Tap=20on=2C=20tap=...,=?utf-8?Q?Hatch?=,hatch@hatchbaby.com,hatchbaby.com


In [279]:
df['name'] = df['name'].apply(lambda x: x.replace('"', ''))

In [280]:
df.head(50)

Unnamed: 0,date,from,subject,name,email,domain
0,2019-11-05 12:52:10-08:00,Google Community Team <googlecommunityteam-nor...,Finish setting up your new Google Account,Google Community Team,googlecommunityteam-noreply@google.com,google.com
1,2019-11-06 15:51:29+00:00,"""Frank Flavell (via Google Drive)"" <drive-shar...",Job Search - Invitation to collaborate,Frank Flavell (via Google Drive),drive-shares-noreply@google.com,google.com
2,2019-11-08 18:09:46-08:00,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...,noreply@csod.com,noreply@csod.com,csod.com
3,2019-11-08 18:12:30-08:00,noreply@csod.com,U.S. Census Bureau Prospective Candidate Confi...,noreply@csod.com,noreply@csod.com,csod.com
4,2019-11-12 22:25:02+00:00,"""Akins Team"" <info@akinsc.com>",Notification from Akins HR Team,Akins Team,info@akinsc.com,akinsc.com
5,2019-11-14 01:08:17+00:00,SRNotice <SRNotice@customercare.nyc.gov>,SR Submitted # 311-01139022,SRNotice,SRNotice@customercare.nyc.gov,customercare.nyc.gov
6,2020-02-21 11:20:21-08:00,Google <noreply-utos@google.com>,Learn more about our updated Terms of Service,Google,noreply-utos@google.com,google.com
7,2020-04-24 10:02:50-04:00,"""AdvantageCare Physicians"" <noreply@patients.p...",AdvantageCare Physicians would love your feedb...,AdvantageCare Physicians,noreply@patients.pgsurveying.com,patients.pgsurveying.com
8,2020-04-29 00:28:21-04:00,"""AdvantageCare Physicians"" <noreply@patients.p...",AdvantageCare Physicians would love your feedb...,AdvantageCare Physicians,noreply@patients.pgsurveying.com,patients.pgsurveying.com
9,2020-08-24 16:02:59+00:00,=?utf-8?Q?Hatch?= <hatch@hatchbaby.com>,=?utf-8?Q?Hatch=20Rest=3A=20Tap=20on=2C=20tap=...,=?utf-8?Q?Hatch?=,hatch@hatchbaby.com,hatchbaby.com


In [None]:
def name_email_domain(df):
    # extract name and email --> 'from' and create new columns
    df[['name','email']] = df['from'].str.rsplit(' ',1, expand=True)
    
    # make sure every email observation has an email address
    for i in df.index:
        if " " in df.name[i]:
            pass
        elif "@" in df.name[i]:
            df.at[i, 'email'] = df.name[i]
        else:
            pass
    
    # remove brackets from around email addresses
    df.email = df.email.str.replace(r'[<>]+', '', regex=True)
    
    # extract the domain --> 'email' and create new column
    df['domain'] = df['email'].str.split('@').str[1]
    
    # remove the 'from' column
    df.drop(columns=['from'])
    
    return df
    

In [241]:
import emoji

In [250]:
emoji.distinct_emoji_list(df.subject[13])

[]

In [243]:
emoji.emoji_list(test_emo)

[]

In [244]:
import regex

In [245]:
regex.search('\p{Emoji=Yes}', test_e.decode('utf8'))

AttributeError: 'str' object has no attribute 'decode'

In [254]:
def extract_emojis(s):
  return ''.join(c for c in s if c in emoji_lib['utf-8'])

In [287]:
print([c for c in test_emo])

['\\', 'x', 'f', '0', '\\', 'x', '9', 'f', '\\', 'x', '9', '1', '\\', 'x', '8', 'b', 'W', 'e', 'l', 'c', 'o', 'm', 'e', ' ', 't', 'o', ' ', 'K', 'a', 'n', 'o', ' ', 'W', 'o', 'r', 'l', 'd']


In [285]:
test_print(test_emo)

<generator object test_print.<locals>.<genexpr> at 0x7feee4d67c80>


In [255]:
extract_emojis(test_emo)

''

In [257]:
from django.utils.encoding import smart_str,smart_unicode

cleaned_up_text=smart_str(test_emo)

ImportError: cannot import name 'smart_unicode' from 'django.utils.encoding' (/opt/anaconda3/lib/python3.9/site-packages/django/utils/encoding.py)

In [258]:
test_emo.decode("unicode_escape")

AttributeError: 'str' object has no attribute 'decode'

In [298]:
def emojified(s):
    data = pd.DataFrame(columns=['symbol', 'utf-8', 'description'])
    for word in s:
        i
        row = emoji_lib[emoji_lib['utf-8'] == word]
        data = pd.concat([data, row])
    return data

In [305]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'[a-zA-Z0-9]+')

In [306]:
test_tokenized = tokenizer.tokenize(test_emo)

In [307]:
test_tokenized

['xf0', 'x9f', 'x91', 'x8bWelcome', 'to', 'Kano', 'World']

In [308]:
for word in test_tokenized:
    if word in emoji_lib['utf-8']:
        print('YES')
    else:
        print('nope')

nope
nope
nope
nope
nope
nope
nope


In [299]:
emojified(test_emo)

Unnamed: 0,symbol,utf-8,description,type,native,android,unicode


In [309]:
import demoji

ModuleNotFoundError: No module named 'demoji'

In [311]:

import demoji

In [312]:
demoji.findall(test_emo)

{}

In [317]:
type(test_emo.encode('unicode_escape'))

bytes