# Spam Emails Parsing & Pattern Analysis


## Introduction

Spam email is still a common attack method. Most of the email services have spam filters that can help us block and filter out most of the emails with commercial, fraudulent and malicious content. The purpose of this analysis is to explore the difference between the features of commercial email and malicious email.

## Get Email Header Data

This part of the script is to get emails' header for further analysis.
The script uses ```imaplib``` library to get email headers from a Gmail account Spam folder. For the script to work, Gmail IMAP Access need to be enabled. To enable IMAP for Gmail, please check this [instruction](https://support.google.com/mail/answer/7126229). You also need either to turn off 2-Step Verification for your Google account or set up an app password [here](https://myaccount.google.com/u/1/security).

### Setup connection

1. Import the required libraries.

In [None]:
import getpass
import imaplib
import email
from email.parser import HeaderParser
from email.header import decode_header
import re
import csv
import pandas as pd
from bs4 import BeautifulSoup
#from ipywidgets import interact, interactive, fixed, interact_manual
#import ipywidgets as widgets
import requests
import json

2. Login to the Gmail

In [None]:
print('Email:')
un = input()
print('Password')
pw = getpass.getpass()
conn = imaplib.IMAP4_SSL(port = '993',host = 'imap.gmail.com')
conn.login(un,pw)

3. Get email headers and add attribute.

In [None]:
#conn.select('[Gmail]/Index')
conn.select('[Gmail]/Spam')
type, emaildata = conn.search(None, 'ALL')
emaillist=emaildata[0].split()
parser = HeaderParser()
header_list=[]
msg_text=[]
key_f=open("key.txt","r")
key=key_f.readlines()[0]
for a in emaillist:
    type, emaildata2 = conn.fetch(a, '(RFC822)')
    h = parser.parsestr(emaildata2[0][1].decode('utf-8','ignore'))
    for txt in h.walk():
        if not txt.is_multipart():
            msg_text = txt.get_payload(decode=True).decode('utf-8','ignore')
    soup = BeautifulSoup(msg_text, "lxml")
    msg_text= soup.get_text(strip=True)
    header = {}
    header['Subject']=decode_header(h['Subject'])[0][0]
    header['ARC-Authentication-Results']=h['ARC-Authentication-Results'].strip()
    header['Return-Path']=h['Return-Path'].strip()
    header['Return-Path Address']=re.findall(r'\b@\S*\b',str(h['Return-Path'].strip()))[0]
    header['Received']=h['Received'].strip()
    header['Received-SPF']=h['Received-SPF'].strip()
    header['Date']=pd.to_datetime(h['Date'].strip())
    if 'Reply-To' in h:
        header['Reply-To']=h['Reply-To'].strip()
        header['Reply-To Address']=re.findall(r'\b@\S*\b',str(h['Reply-To'].strip()))[0]
    else:
        header['Reply-To']=''
        header['Reply-To Address']=''
    header['Content-Type']=h['Content-Type'].strip().split(';')[0]
    header['From']=h['From'].strip()
    if re.findall(r'\b@\S*\b',str(h['From'].strip())):
        header['From Address']=re.findall(r'\b@\S*\b',str(h['From'].strip()))[0]
    else:
        header['From Address']=''
    #header['Sender-ip']= re.findall(r'(?:(?:25[0-5]|2[0-4]\d|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)',str(header))
    header['Message']= len(msg_text)
    if re.findall(r'\bip=\S*\b',str(header)):
        header['IP']=re.findall(r'\bip=\S*\b',str(header))[0].split('=')[1]
    else:
        header['IP']=''
    if re.findall(r'\bspf=\S*\b',str(header)):
        header['SPF']=re.findall(r'\bspf=\S*\b',str(header))[0].split('=')[1]
    else:
        header['SPF']=''
    if re.findall(r'\bdmarc=\S*\b',str(header)):
        header['DMARC']=re.findall(r'\bdmarc=\S*\b',str(header))[0].split('=')[1]
    else:
        header['DMARC']=''
    if re.findall(r'\bdkim=\S*\b',str(header)):
        header['DKIM']=re.findall(r'\bdkim=\S*\b',str(header))[0].split('=')[1]
    else:
        header['DKIM']=''
    
    address = "http://api.ipstack.com/"+header['IP']+"?access_key="+key    
    response = requests.get(address)
    ipjason = response.text
    iplist = json.loads(ipjason)
    header['Country']= iplist.get('country_name')
    header['Regin']= iplist.get('region_name')
    header['City']= iplist.get('city')
    if iplist.get('type')== 'ipv6' :
        header['IPv6 Indicator']= 1
    elif iplist.get('type')== 'ipv4' :
        header['IPv6 Indicator']= 0
    header_list.append(header)

4. Write the data into csv file.

In [None]:
keys = header_list[0].keys()
print('File Name:')
fn = input()
with open(fn+'.csv', 'w') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(header_list)