# WITI Email Bounce List Cleaning

This code takes in bounced email addresses from a email software called SendGrid and filters out the error codes so we can understand what is a hard bounce vs soft bounce. Hard bounces mean that the email is invalid. This will help us pull which emails we will want to remove from our database.

In [1]:
#load in libraries needed

import pandas as pd
import re
import string

In [5]:
#Load in all the error codes for emails to be able to filter out
#which are hard and soft bounces

codes = pd.read_csv('C:\\Users\\jlucz\\Desktop\\Projects\\WITI Internship\\SendGrid_ErrorCodes.csv', encoding='latin1')
codes.sort_values(by='Error', inplace=True)
codes = codes['Error']

In [6]:
#Reseting the index to make sure we are in order

codes.reset_index(drop=True, inplace=True)
codes

0     250
1     415
2     421
3     429
4     450
5     451
6     452
7     500
8     503
9     550
10    551
11    552
12    553
13    554
14    5xx
Name: Error, dtype: object

In [8]:
#Read in the email list that contains the error message. We need to pull out
#the error code from the reason column

data = pd.read_csv('C:\\Users\\jlucz\\Desktop\\Projects\\WITI Internship\\WITI_SendGrid_BounceList.csv')
data['reason'].head()

0           550 5.1.1 <ee1806@att.com>... User unknown
1    550 5.1.1 <natalie.blake@att.com>... User unknown
2           550 5.1.1 <bc3683@att.com>... User unknown
3                              550 5.1.1 User Unknown 
4                              550 5.1.1 User Unknown 
Name: reason, dtype: object

In [9]:
#Before iterating through each row in the reason column, we need to make 
#sure there are no null values. Checking this we see we have 21 rows with
#null values. We need to fix this or we will have errors moving forward

data[data['reason'].isna()].count()

status     21
reason      0
email      21
created    21
dtype: int64

In [10]:
#To fix these null values, we are going to fill the nulls with a string 
#value. I chose abc for simplicity. 

data.fillna(value='abc', inplace=True)

In [11]:
#After running different variations of code, I noticed that we had hyphens
#that made it difficult to pull the error code out. Using regex, I remove
#the hyphens and replace them with spaces. This allows us to run a loop to
#retrieve the error codes

for i in data.index:
    oldstr = data['reason'].loc[i]
    regex_2 = re.compile(r'[-]')
    newstr = regex_2.sub(' ', oldstr)
    data['reason'].loc[i] = newstr

In [12]:
#Create a new column called code that will house the error code value
#The loop below starts at the first row of the data dataframe. It looks to
#find if any of the error codes listed in "codes" are present. If not it
#leaves the value blank

data['code'] = ''
for i in codes:
    for n in data.index:
        x = data['reason'].at[n]
        y = x.find(" "+i+" ")
        if y > -1:
            data.at[n, 'code'] = i
        else:
            y= x.find(i+" ")
            if y > -1:
                data.at[n, 'code'] = i
            else:
                y= x.find(" "+i)
                if y > -1:
                    data.at[n, 'code'] = i

In [13]:
data['code'].head()

0    550
1    550
2    550
3    550
4    550
Name: code, dtype: object

In [None]:
#Export the data to a CSV for sharing purposes

data.to_csv('Cleaned_BounceList.csv')