# How to clean your mailbox automatically and periodically to reduce gas emission (thanks to Python) . 

## Introduction

***The huge amount of mails we receive everyday is stored into servers. Email storing is thus reponsible of unnecessary gas emission. A great action that can be taken is to delete a lot of them. Nonetheless, if we do it by hand, we can't even select all the mails from a particular kind.***

Email marketing and newletters are are great ways to target its clients. Nonetheless, when these email become outdated, one can delete them. 

### With this notebook, you will be able to :

1. Store variables in your environment system to be not explicitly readable in your code (such as your email & password),
2. Know how to access your mailbox (Gmail, Yahoo, Outlook) thanks to Python,
3. Store all your mail in a DataFrame (to get statistics, to know which sender is most frequent for example), 
4. Delete particular types of email according to different criteria, 
5. Get familiar with the time librairy. 

### To succeed this tutorial, you will need :

1. Internet ;),
2. to set the 2-step authentification parameter of your mail box, 
3. **Get an App Password**: you will not be able to access to your mail only with your raw password. 

Google's tutorial : https://support.google.com/mail/answer/185833

**NB: The full code and references are at the end. We will understand each steps together.**

I recommend you to run everything in a .py file, as a notebook access to the environmental variable only in the current working session. Nonetheless, you can run it in a ipynb, but your password will be explictly written.

I am not an expert (yet ;),just a student who loves data, and is concerned by the 876389 mails he has in its mailbox. Some process can be optimized : feel free to suggesst anything !

## Step 1 : Set your password & username as environment variable

First : get your App Password on your mail settings (it takes 2 min). It looks like : oiueronvkncdiuo.

Google's tutorial : https://support.google.com/mail/answer/185833

### Explicitly : accessible in a notebook. 

It is possible to set a variable into the current environment like this :

In [11]:
os.environ['google_mail']='mymail@gmail.com'
os.environ['google_psw']='my_google_app_key_password'

And to get it:

In [10]:
myMail=os.environ.get('google_mail')
print(myMail)

mymail@gmail.com


And to access all the variables, just run:

Nonetheless, it will only be accessible in this environment during this session and not at anytime. Thus, it can be better to use the bash system. 

### Implictly : in bash

For Mac, your terminal, write the following : 

1. cd
2. nano .bash_profile (you can also write 'open .bash_profile which will open another tab)
3. write at the end : 

export gmail=mygmail@gmail.com

export google_psw=app_key_password (such as 'ldkfjdslmkfjsddisfj')

...

4. ctrl+X to save the changes
5. Y to say validate
6. Enter to exit. 

You can put how many variables you like ! Let's try if it worked. 

**If you run it in a notebook, your will only access the environment variable in the current notebook session : email & password will have to be written explictly at the beginning. On a .py , no worries !**

## Step 2 : Let's delete our unneccessary email !

Let's dive in the python codes. 

### 1. Librairies

In [4]:
import smtplib, sys, imaplib, os, email
from pandas import DataFrame, Series
from email.header import Header, decode_header, make_header
import matplotlib.pyplot as plt
import datetime
import pandas as pd
import time
import numpy as np

### 2. Get the environment variables.

In [200]:
os.environ["google_mail"]='mygoogle@gmail.com' #On a private notebook, you can write it explictly. 
os.environ["google_psw"]='efoizefoziufz'

In [205]:
my_email=os.environ.get('google_mail')
password=os.environ.get('google_psw')

### 3. Select the right imap adress. 

**You can find your imap adress right here :** 

https://www.systoolsgroup.com/imap/. 

Google is : imap.gmail.com, Yahoo is imap.mail.yahoo.com...

In [206]:
imap = imaplib.IMAP4_SSL("imap.gmail.com", port=993)

### 4. Try you connection !

In [207]:
def test_connection():
    try :
        rv,data=imap.login(my_email, password)
        print(rv,data)
        print('Successfully completed mail.login')
    except:
        print('login FAILED!')
    

In [208]:
test_connection()

login FAILED!


### 5. Get familiar with the folders names of your mailbox.

It is necessary to know the exact name. My folders are sometimes in french, sometime in english !

### 6. Select the mailbox you want to clean. 

In [12]:
mailbox_selection='Inbox'

### 7. Try the connection to the specific mailbox.

In [13]:
def mailbox_connection():

    """ The function returns the list of all emails. """
    fromStr=''
    subjectStr=''

    status,msgs=imap.select(mailbox_selection) #readonly because we only access the information yet, we don't delete at the moment. 
    print('The status for the mailbox selection is :' + status)

    if status=='OK':
        typ, data=imap.uid('search',None,'ALL')
    
    #Count the number of messages
        msgList=data[0].split()
        numMsg=len(msgList)
        print('Total %s messages in mailbox' %str(numMsg))
    return msgList

In [14]:
mailbox_connection()

The status for the mailbox selection is :OK
Total 3085 messages in mailbox


[b'1',
 b'3',
 b'4',
 b'5',
 b'7',
 b'8',
 b'9',
 b'12',
 b'13',
 b'14',
 b'15',
 b'24',
 b'25',
 b'26',
 b'30',
 b'31',
 b'32',
 b'33',
 b'36',
 b'42',
 b'52',
 b'66',
 b'68',
 b'71',
 b'76',
 b'79',
 b'98',
 b'100',
 b'108',
 b'152',
 b'153',
 b'155',
 b'172',
 b'173',
 b'225',
 b'226',
 b'369',
 b'418',
 b'506',
 b'565',
 b'575',
 b'761',
 b'1117',
 b'1120',
 b'1372',
 b'1563',
 b'1606',
 b'1620',
 b'1642',
 b'1886',
 b'2197',
 b'2408',
 b'2459',
 b'4172',
 b'4176',
 b'4617',
 b'4663',
 b'4677',
 b'4681',
 b'4684',
 b'4686',
 b'4692',
 b'4693',
 b'4711',
 b'4721',
 b'4744',
 b'4745',
 b'4770',
 b'4816',
 b'4817',
 b'4823',
 b'4858',
 b'5004',
 b'5031',
 b'5077',
 b'5078',
 b'5081',
 b'5180',
 b'5181',
 b'5649',
 b'5685',
 b'5689',
 b'5706',
 b'5710',
 b'5711',
 b'5712',
 b'5769',
 b'5816',
 b'5856',
 b'5857',
 b'5901',
 b'5902',
 b'5921',
 b'5926',
 b'5927',
 b'5928',
 b'6041',
 b'6042',
 b'6250',
 b'6285',
 b'6393',
 b'6462',
 b'6477',
 b'6487',
 b'6495',
 b'6496',
 b'6594',
 b'659

There were 4 000 mails stored in my personnal mailbox, even tho I cleaned it a few years ago and suppress everyday lot of them !

### 7.bis : if the server logs out automatically.

The server logs out every 30 minutes. Run the following function to connect again. 

In [192]:
imap = imaplib.IMAP4_SSL("imap.gmail.com", port=993)

def reconnect():
    mailbox_selection='Inbox'
    test_connection()
    mailbox_connection()

In [193]:
reconnect()

OK [b'LOGIN completed']
Successfully completed mail.login
The status for the mailbox selection is :OK
Total 2990 messages in mailbox


### 8. Generate the dataframe of your mails.

Let's dive in the technical aspects now. The following function creates a dataframe with the message id, the date, the address and the expeditor. 

**The server logs out automatically after 30 minutes (in some cases 14 !). This is why we have to exist the while loop before 29 minutes.**

"percFirstMail" is the percentage of the mailbox for which the mailbox will be stored. For example: percFirstMail=0 will store in the dataframe from the first mail until 30 minutes are done, and percFirstMail=0.4 will begin at 40% of your mailbox.

You can generate a new df when the first one helped you to clean your mailbox. In my case, i had : 

Total 3960 messages in mailbox
43.63636363636363 % of the mailbox has been stored in "msgData" for a duration of 29.02558348576228 minutes.

In [16]:
## Generate the df.
def mails_to_df(percFirstMail=0.4): #
    #Set up a dictionary to hold the data
    subjectDict={}
    fromDict={}
    dateDict={}
    
    imap.noop() #refreshes the connection. After it, the server logs out automatically after 30 minutes. 
    
    time_duration = 29*60 # the loop to store mails will finish before the 30 minutes. Otherwise, we loose connection. 
    time_start = time.time() #current second
  
    messages=mailbox_connection() #list of mails id. 
    lenMailbox=len(messages)

    i=np.floor(percFirstMail*lenMailbox) # the num of the mail you begin with.

    while (time.time() < time_start + time_duration) & (i<lenMailbox): #before 29 minutes and until we stored the maximum of emails.

        num=messages[i]
        rv, data = imap.uid('fetch',num,'(RFC822)')

        if rv != 'OK':
            print("ERROR getting message: ", num)

        if data[0] is not None:

            msg = email.message_from_bytes(data[0][1])
            fromStr=make_header(decode_header(msg['From']))
            fromStr=str(fromStr)
            fromDict[num]=fromStr

            date=msg['Date']
            dateDict[num]=date

            if msg['Subject']: #if the message has a subject. 

                subjectStr=str(make_header(decode_header(msg['Subject'])))
                subjectStr=str(subjectStr)
                print ('Message: %s, Subject: %s Date: %s' %(num, subjectStr, str(msg['Date'])))

            else: 

                subjectStr=''
                print ('From:', fromStr) 
            
            i+=1
            
        subjectDict[num]=subjectStr 
        
    imap.noop()    

    Subjects, Froms , Dates = subjectDict,fromDict, dateDict
    msgDict={'Subject':Subjects,'From':Froms, 'Date':dateDict}
    df=DataFrame(msgDict)
    
    print('\n')
    print(len(df)/len(mailbox_connection())*100, '% of the mailbox has been stored in "msgData" for a duration of', (time.time()-time_start)/60, "minutes.")

    
    return df

The next function can last 29 minutes.

I personally recommend to store this df : we don't want to loose a 30-minute-long run !

In [19]:
msgData.to_csv('firstMsgData.csv')

### 9. Clean the msgData dataframe

As I have read on a StackOverFlow thread : 'welcome to hell'. 

Indeed, we have to deal with ... date formats ! Exciting ! 

I've listed the type of date format I had to deal with. You can always add "if" clauses. 

In [178]:
#Get a clean df with the message id, the subject, the expeditor and the email address and the good date format. 
def cleanDf(dataframe):

    """ It allows us to separate the name from the address in the df. """

    address_list=[]
    from_list=[]
    date_list=[]

    df=dataframe.copy()
    df.dropna(how='any',inplace=True)
    df.reset_index(inplace=True)
    

    len1=len("Mon, 4 Aug 2008 21:09:52 +0000 (GMT)")
    len2=len("Mon, 21 Aug 2008 21:09:52 +0000 (GMT)")
    len3=len("Wed, 26 Jul 2017 12:11:31 +0200 (CEST)")
    len8=len("Thu, 07 Nov 2019 14:58:13 GMT")

    len9=len("Tue,  2 Jun 2020 18:17:24 +0000 (UTC)")

    len4=len("Mon, 4 Aug 2008 21:09:52 +0000")
    len5=len("Mon, 21 Aug 2008 21:09:52 +0000")

    len6=len("4 Sep 2019 10:42:52 -0400") 
    len7=len("11 Sep 2019 10:42:52 -0400")

    for i in df.From: #for the address : we extrat the name of expeditor and the email address. 
        
        if '<' in i:
            pos1=i.find('<')
            pos2=i.find('>')

            expeditor=i[:pos1]
            address=i[pos1+1:pos2]

            address_list.append(address)
            from_list.append(expeditor)

        else:
            address_list.append(i)
            from_list.append(i)

    for i in df.Date: #to get the date. In each header, we have ±XXXX (GMT) or (PDT), but the type can differ. 
        
        if len(i) in [len1,len2,len4,len5,len8]:
            pos1=i.find(',')
            pos2=i.find(':')
            
            dateFormat=i[pos1+2:pos2-3]
            date_list.append(dateFormat)

        elif len(i) in [len6,len7]:
            pos=i.find(':')
            
            dateFormat=i[pos-18:pos-3]
            date_list.append(dateFormat)

        elif len(i) in [len3,len9]:

            if i[5]==' ':
                pos1=i.find(',')
                pos2=i.find(':')

                dateFormat=i[pos1+3:pos2-3]
                date_list.append(dateFormat)

            elif i[5]!=' ':
                pos1=i.find(',')
                pos2=i.find(':')

                dateFormat=i[pos1+2:pos2-3]
                date_list.append(dateFormat)

        else:
            print(i)

    df['Address']=address_list
    df['From']=from_list
    df['Date']=date_list
    df['Date']= pd.to_datetime(df['Date'],errors='coerce', format='%d %b %Y')
    
    return df

In [180]:
cleanMsgData=cleanDf(msgData)

### 10. Know you most frequent expeditor...

### 11. ...and vizualise it. 

In [183]:
##Plot of the n top expeditors
def plot_frequency(dataframe,top_n=10, addressOrFrom='Address'):
    fig, ax = plt.subplots(figsize=(7,4))
    df=DataFrame(dataframe[addressOrFrom].value_counts()[:top_n])
    df.plot( kind='barh',legend = False, ax=ax)
    ax.set_xlabel('Number')
    ax.set_ylabel('Expeditor')
    plt.show()

plot_frequency(cleanMsgData,top_n=20,addressOrFrom='From')

### 12. Build the function which will transfer selected emails in the trash folder

To put mails in the trash folder is not the same as to delete it for ever. With Google, the mails stored in the trash will be deleted after 30 days, and only then, be forever deleted. 

You have thus a backup if you deleted a wrong mail. 

In [185]:
trashFolderName="Trash"

In [186]:
def transferFiles(deleteSeries):

    count=0

    for i in range(len(deleteSeries)):

        uid=deleteSeries.index[i]
        rv, data = imap.uid('COPY',uid,trashFolderName) # we copy the mail in the trash folder
        print ('The status for copying', uid,' to the trash is :' + rv)

        if rv != 'OK':
            print("ERROR getting message: ", uid)

        else: 
            print('moving msg %s' %uid)
            count+=1
            mov, data = imap.uid('STORE', uid , '+FLAGS', '(\\Deleted)') #This deletes for ever the mail. To copy it in the trash is not the same thing. 
            print ('The status for deleting', uid,' is :' + mov)
            print('\n')
            imap.expunge()
            
    print( count, "emails have been removed.")
    print('\n')
    return count

### 13. Put the top N expeditors in the trash

**Attention ! Only run the following function if you want the N top expeditor's mail in the trash.**

If your belowed partner is in top and you don't want to erase everything, let's go to the next function ;-) .

In [187]:
def transferTopN(n=10):

    allCounts=0
    fromSeries=msgData['From']
    topFrom=fromSeries.value_counts()

    for i in range(n):
        maskFrom=topFrom.index[i] #varchar
        mask=fromSeries.isin([maskFrom]) #serie of boolean
        seriesToDelete=fromSeries[mask] #Serie with the id and the unique query
        allCounts+=transferFiles(seriesToDelete)

    print("In total,", allCounts, 'mails have been removed.')

### 14. Transfer mails from the expeditors you choosed. 

This gets the top expeditor, a list which form you can suppress what you want. 

In [209]:
listToSuppress=['firstMail@mail.com',
'secondMail@mail.com']

This function will transfer in trash all mails which are dated before the delta time and for the specific list instancied before.

In [190]:
## transfer mail according to specific criteria. 
def transferCriteria(timeDelta=30): #All mails before 30 days. 

    """ This function will transfer all mails which are dated before the delta time and for the specific list instancied before"""

    today = datetime.date.today()
    lastMonth = today - datetime.timedelta(days=timeDelta)
    date=lastMonth.strftime("%d-%b-%Y") #the date of last month's same day, because timeDelta=30. 

    count=0 #number of mail deleted. 

    for exped in listToSuppress:
        
        print(exped)
        print('\n')

        try: 

            status, messages_id_list = imap.uid('search', None, 'BEFORE', date ,'FROM' , exped) 
            print ('The status for these search criterion is :' + status)
            print('\n')

            messages = messages_id_list[0].split() #convert the string ids to list of email ids
            print(messages)
            print('\n')

            for uid in messages:
                rv, data = imap.uid('COPY',uid,trashFolderName)
                print ('The status for copying', uid,'to the trash is :' + rv)
                if rv != 'OK':
                    print("ERROR getting message: ", uid)
                else: 
                    print('moving msg %s' %uid)
                    mov, data = imap.uid('STORE', uid , '+FLAGS', '(\\Deleted)')
                    print ('The status for deleting', uid,' is :' + status)
                    count+=1
                    print('\n')
                    imap.expunge()

        except:
            print('The procedure did not succeed.')
    print("In total,", count, 'mails have been removed. It is about', count*10, 'grams of CO2 avoided by year.')

In total, 874 mails have been removed in 12 minutes...

**DO NOT FORGET that you might have to rerun again, if you did not store all your mail in msgData. If only 40% of your mailbox has been stored in msgData, change the parameter percFirstMail of mails_to_df() to 0.6.**

### 15. Close connection

In [210]:
def closeConnection():
    imap.close() #Close currently selected mailbox
    imap.logout()

### 16. Run the notebook every month with Kaggle. 

Kaggle has the feature to run a notebook you wrote with the frequency you choosed. Thus, you can copy/paste (or upload) this notebook, and set the frequency. 

Moreover, I planned to send and e-mail every time the notebook is runned :

(basically, imap is a protocol to read email, and smtp to send. )

In [None]:
from smtplib import SMTP_SSL, SMTP_SSL_PORT


SMTP_HOST = 'smtp.gmail.com' #for google
SMTP_USER=os.environ['google_mail']
SMTP_PASS=password=os.environ['google_psw'] #the password of your api key (2-SV)


# Craft the email by hand
from_email = SMTP_USER  
to_emails = [SMTP_USER] 
body = 'You just erased ' + str(count) + ' mails from your '+ my_email + ' mailbox, which represents ' + str(10*count) + 'grams of C02 avoided by year. The top senders are ' +  str(cleanMsgData['Address'].value_counts()[:25])
headers = f"From: {from_email}\r\n"
headers += f"To: {', '.join(to_emails)}\r\n" 
headers += f"Subject: Hello\r\n"
email_message = headers + "\r\n" + body  # Blank line needed between headers and body

# Connect, authenticate, and send mail
smtp_server = SMTP_SSL(SMTP_HOST, port=SMTP_SSL_PORT)
smtp_server.set_debuglevel(1)  # Show SMTP server interactions
smtp_server.login(SMTP_USER, SMTP_PASS)
smtp_server.sendmail(from_email, to_emails, email_message)

# Disconnect
smtp_server.quit()

Do not forget to erase this mail after reading it ;-).

# Conclusion

In this notebook, we learned :

1. how to set up the 2-SV key,
2. how to connect to your mailbox, choose the folder, 
3. how to create a cleaned df with the email contents #weLoveTimeFormats,
4. how to know who send us the most emails, 
5. how to transfer to trash the top N expeditors, and the choosen ones in a list we created, 
6. How to send an email with python. 


The quantity of gas emission avoided might be low, but it is not. I encourage you to share this notebook or run it for the people you know : you only need their email and an app-key ! 

# Credits

All the links that helped me to build the code, except from the official documentation you should always refer to ! ;-)

https://www.techgeekbuzz.com/how-to-delete-emails-in-python/

https://www.systoolsgroup.com/imap/

https://www.linkedin.com/pulse/reduce-your-email-inbox-30-2-hours-simple-data-dr-darren-obrigkeit/

https://www.tiger-222.fr/?d=2016/01/21/16/35/09-python-et-imap-exemple-concret

https://www.rfc-editor.org/rfc/rfc3501#section-6.4.4

https://carbonliteracy.com/the-carbon-cost-of-an-email/