https://docs.python.org/ja/3/library/email.html

を利用してemailを読み込みするのが良さそう

In [1]:
from email import message_from_file
from email.header import decode_header

In [2]:
some_file = "./samples/easy_ham/2170.78c282a5e417d6d231dc75aa8588ebb7"

In [3]:
message = None
with open(some_file, mode="r") as file:
  message = message_from_file(file)

In [4]:
type(message)

email.message.Message

## Headerの処理

In [5]:
decode_header(message["Content-Type"])

[('text/plain; encoding=utf-8', None)]

In [8]:
decode_header(message["X-Spam-Level"])

[('', None)]

Headerを特徴量に利用するとしたらイテレートしたい。

~~HeaderParserを利用してヘッダーのキーを取得できるのでそちらを利用する~~ 
https://docs.python.org/3/library/email.parser.html

`message.Message#items` で取得できるのでそちらでOK

In [6]:
from email.parser import HeaderParser

headers = None
with open(some_file, mode="r") as file:
  parser = HeaderParser()
  headers = parser.parse(file)

In [7]:
for h in headers:
  print(h)

Return-Path
Delivered-To
Received
Received
Received
Message-Id
To
From
Subject
Date
Content-Type
Lines
X-Spam-Status
X-Spam-Level


In [13]:
message.items()

[('Return-Path', '<rssfeeds@example.com>'),
 ('Delivered-To', 'yyyy@localhost.example.com'),
 ('Received',
  'from localhost (jalapeno [127.0.0.1])\n\tby jmason.org (Postfix) with ESMTP id AE79816F16\n\tfor <jm@localhost>; Mon, 30 Sep 2002 13:43:46 +0100 (IST)'),
 ('Received',
  'from jalapeno [127.0.0.1]\n\tby localhost with IMAP (fetchmail-5.9.0)\n\tfor jm@localhost (single-drop); Mon, 30 Sep 2002 13:43:46 +0100 (IST)'),
 ('Received',
  'from dogma.slashnull.org (localhost [127.0.0.1]) by\n    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g8U81fg21359 for\n    <jm@jmason.org>; Mon, 30 Sep 2002 09:01:41 +0100'),
 ('Message-Id', '<200209300801.g8U81fg21359@dogma.slashnull.org>'),
 ('To', 'yyyy@example.com'),
 ('From', 'gamasutra <rssfeeds@example.com>'),
 ('Subject', 'Priceless Rubens works stolen in raid on mansion'),
 ('Date', 'Mon, 30 Sep 2002 08:01:41 -0000'),
 ('Content-Type', 'text/plain; encoding=utf-8'),
 ('Lines', '6'),
 ('X-Spam-Status',
  'No, hits=-527.4 required=5.0\n\

## Body

In [12]:
message.as_string()

"Return-Path: <rssfeeds@example.com>\nDelivered-To: yyyy@localhost.example.com\nReceived: from localhost (jalapeno [127.0.0.1])\n\tby jmason.org (Postfix) with ESMTP id AE79816F16\n\tfor <jm@localhost>; Mon, 30 Sep 2002 13:43:46 +0100 (IST)\nReceived: from jalapeno [127.0.0.1]\n\tby localhost with IMAP (fetchmail-5.9.0)\n\tfor jm@localhost (single-drop); Mon, 30 Sep 2002 13:43:46 +0100 (IST)\nReceived: from dogma.slashnull.org (localhost [127.0.0.1]) by\n    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g8U81fg21359 for\n    <jm@jmason.org>; Mon, 30 Sep 2002 09:01:41 +0100\nMessage-Id: <200209300801.g8U81fg21359@dogma.slashnull.org>\nTo: yyyy@example.com\nFrom: gamasutra <rssfeeds@example.com>\nSubject: Priceless Rubens works stolen in raid on mansion\nDate: Mon, 30 Sep 2002 08:01:41 -0000\nContent-Type: text/plain; encoding=utf-8\nLines: 6\nX-Spam-Status: No, hits=-527.4 required=5.0\n\ttests=AWL,DATE_IN_PAST_03_06,T_URI_COUNT_0_1\n\tversion=2.50-cvs\nX-Spam-Level: \n\nURL: http:/

In [14]:
message.get_payload()

"URL: http://www.newsisfree.com/click/-1,8381145,215/\nDate: 2002-09-30T03:04:58+01:00\n\n*Arts:* Fourth art raid on philanthropist's home once targeted by the IRA and \nDublin gangster Martin Cahill.\n\n\n"

本文の内容は `get_payload` で取得できる。

しかしX-Spam-Levelの削除により1行空いてしまい、本文開始位置がおかしくなっている。前処理でX-Spam-Levelの後ろの空行を削除しておくべき。

In [22]:
!rg "charset"

[0m[35msamples/hard_ham/0002.2fe846db6e3249836abdbfcae459bf2a[0m
[0m[32m12[0m:Content-Type: text/html; [0m[1m[31mcharset[0m=ISO-8859-1

[0m[35msamples/easy_ham/1079.3d222257b98d7d58a0f970d101be3ad7[0m
[0m[32m84[0m:Content-Type: text/plain; [0m[1m[31mcharset[0m=us-ascii

[0m[35msamples/spam/0275.0404a07cd99e27d569958716f392082b[0m
[0m[32m33[0m:	[0m[1m[31mcharset[0m="Windows-1252"
[0m[32m89[0m:	[0m[1m[31mcharset[0m="iso-8859-1"
[0m[32m96[0m:[0m[1m[31mcharset[0m=3Diso-8859-1">

[0m[35msamples/spam/0199.955edee89f34960c033c4d1072841356[0m
[0m[32m21[0m:Content-Type: text/html; [0m[1m[31mcharset[0m="iso-8859-1"

[0m[35msamples/spam/0207.3adcb1a14977a49cac8f6e10f64ac6f7[0m
[0m[32m37[0m:Content-Type: text/html; [0m[1m[31mcharset[0m="iso-8859-1"

[0m[35msamples/spam/0047.376bd7728ee94b32bc23429d9c51bae5[0m
[0m[32m20[0m:Content-Type: text/html; [0m[1m[31mcharset[0m="ISO-8859-1"
[0m[32m26[0m:<META http-equiv=Content-Type

In [3]:
_file = "./samples/easy_ham/0296.42216a75e0256510b216eaba6893d40d"

_message = None
with open(_file, mode="r") as file:
  _message = message_from_file(file)

charset指定されたファイルがあるのでそちらもケアする必要がある

In [4]:
_message.get_payload()

'Of course, everyone knows that Owlman is a work of fuggin` genius\n\nJ\n\n\n\n\n> Hey, I met the wizard bloke from Owlman, who wants to touch me!!\n> \n> Dave\n\n\n\n------------------------ Yahoo! Groups Sponsor ---------------------~-->\n4 DVDs Free +s&p Join Now\nhttp://us.click.yahoo.com/pt6YBB/NXiEAA/MVfIAA/7gSolB/TM\n---------------------------------------------------------------------~->\n\nTo unsubscribe from this group, send an email to:\nforteana-unsubscribe@egroups.com\n\n \n\nYour use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ \n\n\n\n'

In [5]:
charset = _message.get_charsets()
_message.get_payload(decode=True).decode(charset[0])

'Of course, everyone knows that Owlman is a work of fuggin` genius\n\nJ\n\n\n\n\n> Hey, I met the wizard bloke from Owlman, who wants to touch me!!\n> \n> Dave\n\n\n\n------------------------ Yahoo! Groups Sponsor ---------------------~-->\n4 DVDs Free +s&p Join Now\nhttp://us.click.yahoo.com/pt6YBB/NXiEAA/MVfIAA/7gSolB/TM\n---------------------------------------------------------------------~->\n\nTo unsubscribe from this group, send an email to:\nforteana-unsubscribe@egroups.com\n\n \n\nYour use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ \n\n\n\n'