Skip to content
This repository has been archived by the owner on Nov 19, 2019. It is now read-only.

[WIP] Modernise ISBG #63

Merged
merged 23 commits into from Jan 23, 2017
Merged

[WIP] Modernise ISBG #63

merged 23 commits into from Jan 23, 2017

Conversation

duk3luk3
Copy link
Contributor

@duk3luk3 duk3luk3 commented Dec 28, 2016

Following up from #61, this is my first stab at modernising ISBG. It's very rough and probably buggy, I have only gotten it to a state where it doesn't syntax error when running. It is intended to have identical behaviour to current ISBG version plus my #61. This needs to be ensured by doing a lot of testing.

It then needs to be cleaned up.

TODO:

  • Turn ISBG into a class
  • [ ] Use argparse instead of docopt
  • Turn print output into programmatically usable return values / attributes
  • Use logging module for debug output
  • Fix up setup.py
  • Tune up sa-unwrap.py (Turn into proper module / add to setup.py / add docs) or remove it

Rationale for changes:

  • argparse is part of stdlib, so we could get rid of the docopt dependency. I've also found it perfectly usable in the past and don't see the advantage of docopt.
  • ISBG should be usable as a library, that include return information
  • logging module gives us flexible options for debugging / logging output

Feedback very welcome!

Works analogously to `--learnthendestroy`
Due to introduction of --learnunflagged number of messages in learning
boxes determined from return of mailbox select command is no longer
accurate. So this was changed to be determined from the result of the
IMAP SEARCH command.
Logging is good, but whole mail bodies are frequently unwanted.

This commit introduces a new --verbose-mails argument and changes the
behaviour of `assertok` to only print the first 100 chars of any imap
results if not set.
@baldurmen
Copy link
Contributor

baldurmen commented Dec 29, 2016

Hi! I'll take a look at this soon.

In the meanwhile, please don't work on changing docopt to argparse. I actually did the argparse to doctopt conversion a while ago.

I really like docopt. It's very clean (around 500 lines of code) and works really well. My main argument for using it is that it manages cli interface, help menus and tons of stuff for you automatically. All these things have to be done manually with argparse and it's a hassle I prefer not to deal with.

I understand it's not part of the standard python libs, but all modern linux distros should have docopt packages. For the corner cases, there is always the option of adding docopt.py manually somewhere.

@duk3luk3
Copy link
Contributor Author

OK - that's fine, to discuss things like this is why I opened the PR in this state.

As you'll see I'm sanitizing the options that come out of docopts quite heavily in the new parse_args function to have an easier time working with them in the do_isbg function. What's the preferred way to specify default argument values? I really wanted to get rid of all the if opts["foo"] is not None switches everywhere in the code.

@duk3luk3
Copy link
Contributor Author

duk3luk3 commented Jan 2, 2017

With commit 8a93157, the following call should now work:

./isbg.py --dryrun --verbose --imaphost <host> --imapuser <user> --spaminbox <spaminbox>

I've added --dryrun option to disable / mock all IMAP write operations and Spamassassin calls, as well as limit the number of messages processed to 5 (with 1 "detected" as spam) for testing purposes.

Add --dryrun switches to learning modes (for --learnspambox / --learnhambox)
and fixup
@duk3luk3
Copy link
Contributor Author

duk3luk3 commented Jan 5, 2017

I fixed up --learnspambox / --learnhambox in 55839e3, so the following now works:

./isbg.py --dryrun --verbose --imaphost <host> --imapuser <user> --teachonly --learnunflagged --learnhambox "INBOX.Spam.Learn-Ham"

My next step will be installing SA locally and do a less "dry" test.

Script for unwrapping spamassassin reports with the original mail
embedded.
Patches up everything with python2 equivalents to new things in python3.
Mainly string/bytes handling.
@duk3luk3
Copy link
Contributor Author

Now contains sa-unwrap.py that extracts email parts with Content-Type: message/rfc822; x-spam-type=original from an e-mail, i.e. the original e-mails from spamassassin spam reports.

I made this script because I need it for my usecase where I need to be able to safely handle spam mails encapsulated in spamassassin spam reports. I've experimented with spamassassin (and spamc/sa-learn) a bit and have not found any conclusive evidence that it treats these emails properly.

If you don't like it I'll happily remove it from the PR, but for now it's in because my ongoing work is on this branch.

This updates the architecture and layout of the python scripts to create
proper modules and updates setup.py to make everything installable.
@duk3luk3 duk3luk3 force-pushed the modernise branch 2 times, most recently from 40baf02 to 1fd43cd Compare January 12, 2017 09:52
Pulls out spamlearning and hamlearning into a single function with
unified logic.
Pulls out the main spamassassin processing into a member function.
@duk3luk3
Copy link
Contributor Author

More work:

  • 1fd43cd reorganizes the repo to create a proper distribution.
  • e39e685 and 06cad66 perform some major refactoring, pulling out the main sa-learn and spamassassin processing steps into member functions

Forgot a `self` referring to an exitcode
`isbg.py` now tries to import and use `sa_unwrap.py` to unwrap mails that
are SA reports with the original spam mail embedded.

This only works if the `isbg` module is loaded properly. To stay
compatible with users calling `isbg.py` directly, `unwrap` is replaced
by a dummy no-op function if imprting `sa_unwrap` fails.
@baldurmen
Copy link
Contributor

i'll be doing a first batch of merge+review this Thursday (promised :D)

@duk3luk3
Copy link
Contributor Author

Great! Please note that this is definitely not in a release-ready state. But I'll test and fixup the most recent changes I made today.

@duk3luk3
Copy link
Contributor Author

duk3luk3 commented Jan 17, 2017

I pushed a more complete version of the option setter functions in 4f44457.

For now I've kept everything so that just grabbing isbg.py and running it should work. Is that a desired feature? It complicates some things.

(EDIT: Had to push a new commit to fix some typos)

Option setter functions for using isbg programmatically.

Includes fixes to some option parsing that were spotted while working on
the setter functions.

WIP - Untested
@baldurmen
Copy link
Contributor

No need to make anything working atm. I'm going to create a separate dev branch. I'm just going to review your code a little and comment on it.

* Set clean default settings in __init__
* Do not use sys.exit on successful run if exitcodes option not set
* Changes for py3 compat, especially string encoding - mostly untested
@duk3luk3
Copy link
Contributor Author

8b814f4 is the result of trying to import isbg and use it programmatically from a python 3 script. It contains clean setting of default options, removes the sys.exit from the end of the script if exitcodes isn't set, and contains some changes to string handling and other things for python 3. It also fixes some typos from the last commit.

It's pretty ugly and most likely fails in all code paths except the ones I've tried, which is mainly the script posted below and one or two manual calls like noted in comments above.

This is right about the point where I should invest some effort in some form of automated testing...

Here is the script I use for calling isbg programmatically:
(commented out is the old popen-based usage)

#!/usr/bin/python3

"""
Uses ISBG https://github.com/isbg/isbg (gepatcht: https://github.com/duk3luk3/isbg/tree/learnthenflag )
to learn spam from mailstores.

Fetches unflagged mails from imap servers, pushes them into spamc for learning, and then optionally flags them.

Parameters:
  * 'doflag': Flag mails after processing

Prerequisites: 
  * python(2)-docopt
  * python(3)-yaml
  * settings file in /var/lib/spamassassin/imaplearn.yaml
  * log dir in /var/log/spamlearning that is writeable for the user running this script

A configuration dict must be created for every IMAP Server to be used.

Template:

IMAP: [
    {
    # IMAP Server
    'server': 'mailstore.example.com',
    # IMAP Username Admin Postfix
    'username': '*admin',
    # IMAP Admin Password
    'password': 'XXXXX',
    # Default spambox / hambox mailbox
    'default-spambox': 'INBOX.Spam.Learn-Spam',
    'default-hambox': 'INBOX.Spam.Learn-Ham',
    'users': [
        # users tuple: Name, Spam Mbox, Ham Mbox (null => use default, false => disabled)
        ['joe', null, null],
        ['hank', 'INBOX.spam.learn_spam', 'INBOX.spam.learn_ham'],
    ]
    },
    {
    'server': 'mailstore2.example.com',
    'username': '*admin',
    'password': 'XXXXXX',
    'default-spambox': 'Spam/Learn-Spam',
    'default-hambox': 'Spam/Learn-Ham',
    'users': [
        ['joe', 'Spamlearning/Learn-Spam', False],
    ]
    },
]
  
"""

import os
import sys
import subprocess
import logging

from isbg.isbg import ISBG

import yaml
with open('/var/lib/spamassassin/imaplearn.yaml') as settfile:
    IMAP = yaml.load(settfile)['IMAP']

logger = logging.getLogger('spamlearn_mailstore')
logger.setLevel(logging.DEBUG)

fh = logging.FileHandler('/var/log/spamlearning/spamlearn_mailstore.log')
fh.setLevel(logging.DEBUG)

ch = logging.StreamHandler()
ch.setLevel(logging.INFO)

formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
fh.setFormatter(formatter)

logger.addHandler(ch)
logger.addHandler(fh)

#ISBG='/opt/hostrepo/maildmz/isbg.py'

DOFLAG = 'doflag' in sys.argv



for imap in IMAP:
    server = imap['server']
    username = imap['username']
    password = imap['password']
    default_spambox = imap['default-spambox']
    default_hambox = imap['default-hambox']
    users = imap['users']
    for user, spambox, hambox in users:
        imap_username = user + username
        imap_spambox = spambox if spambox is not None else default_spambox
        imap_hambox = hambox if hambox is not None else default_hambox
        try:
            boxparams = []
            if imap_spambox:
                boxparams = boxparams + [ '--learnspambox', imap_spambox ]
            if imap_hambox:
                boxparams = boxparams + [ '--learnhambox', imap_hambox ]

            if len(boxparams) > 0:
                logger.info('Running ISBG spamlearning for user {} on mailstore {}'.format(user, server))

                isbg = ISBG()
                isbg.set_imap_opts(
                    imaphost=server,
                    imapport=993,
                    imapuser=imap_username,
                    imappasswd=password,
                    nossl=False
                )
                isbg.set_reporting_opts(
                    imaplist=False,
                    nostats=False,
                    noreport=False,
                    exitcodes=False,
                    verbose=True,
                    verbose_mails=False
                )
                isbg.set_mailboxes(
                    inbox="INBOX",
                    spaminbox="INBOX.spam",
                    learnspambox=imap_spambox,
                    learnhambox=imap_hambox
                )
                isbg.set_processing_opts(
                    dryrun=False,
                    maxsize=isbg.maxsize,
                    teachonly=True,
                    spamc=True,
                    gmail=False
                )
                isbg.set_learning_opts(
                    learnunflagged=True,
                    learnthendestroy=False,
                    learnthenflag=DOFLAG
                )

                isbg.do_isbg()
                
#                isbg_args = [
#                    ISBG,
#                    '--imaphost', server,
#                    '--imapuser', imap_username,
#                    '--teachonly', '--spamc', '--verbose', '--learnunflagged',
#                    '--noninteractive'
#                    ] + boxparams
#
#                if DOFLAG:
#                    isbg_args.append('--learnthenflag')
#
#                logger.debug('ISBG params (minus password): {}'.format(isbg_args))
#
#                isbg_args = isbg_args + [
#                    '--imappasswd', password,
#                ]
#                isbg = subprocess.Popen(
#                    isbg_args,
#    #                stdout = sys.stderr,
#    #                stderr = sys.stderr,
#                    stdout = subprocess.PIPE,
#                    stderr = subprocess.STDOUT,
#                    stdin = subprocess.PIPE,
#                    )
##                isbg_out, isbg_err = isbg.communicate((password + '\n').encode())
##                logger.info('ISBG output:\n>>>\n{}\n<<<'.format(isbg_out.decode().strip()))
#                isbg_out = isbg.stdout.read().decode()
#                logger.debug('ISBG output:\n===\n{}\n==='.format(isbg_out.strip()))
#                for line in isbg_out.split('\n'):
#                    if line.endswith('learnt'):
#                        logger.info(line)
            else:
                logger.warn('Skipped ISBG spamlearning for user {} on mailstore {} because no mailboxes'.format(user, server))
        except:
            logger.error('Error running isbg!', exc_info=True)

Replace all print calls by logging statements.
A log handler prints all log messages to stderr.
Fixes two small issues.
Setting the log handler that logs everything to stderr only in the
`isbg_run` method meant for when isbg is run as command / top level
script.

When isbg is run as a library, the caller is responsible for setting up
log output. If we add a log handler this can lead to messages being
duplicated.
@duk3luk3 duk3luk3 force-pushed the modernise branch 4 times, most recently from 4437d4c to 728a296 Compare January 20, 2017 07:35
* Add separate pastuid files for learning folders
* Clean up functionality
Add UIDVALIDITY to uid trackfile to make sure we use pastuids properly
@duk3luk3 duk3luk3 force-pushed the modernise branch 2 times, most recently from 99bfd2b to ba2514b Compare January 20, 2017 08:08
@duk3luk3
Copy link
Contributor Author

I added lots of features because I can't help myself.

@baldurmen baldurmen mentioned this pull request Jan 21, 2017
@baldurmen baldurmen changed the base branch from master to v.2.0 January 21, 2017 22:37
* Use sa_unwrap properly
* Log all exceptions
@baldurmen
Copy link
Contributor

I thought I'd have more time to look at this in depth, but it looks like I don't :(

So here I merge, and I'll look at this eventually. Don't stop! Please open more specific PR on v.2.0 and I'll try to look and merge quickly.

Oh, and please join our IRC channel and our new mailing list!

@baldurmen baldurmen merged commit 373963e into isbg:v.2.0 Jan 23, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants