# Regresión Logística: Detección de Spam

A partir de un conjunto de datos **reales** etiquetados (*2007 TREC Public Spam Corpus*) generar un modelo de regresión logística (clasificación) que pueda determinar si un correo dado es SPAM o no.

## 1. Funciones complementarias

In [1]:
# Esta clase facilita el preprocesamiento de correos electrónicos que poseen código HTML
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs = True
        self.fed = []

    def handle_data(self, d):
        self.fed.append(d)

    def get_data(self):
        return ''.join(self.fed)

In [2]:
# Esta función se encarga de elimar los tags HTML que se encuentren en el texto del correo electrónico
def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

In [3]:
# Ejemplo de eliminación de los tags HTML de un texto
t = '<tr><td align="left"><a href="../../issues/51/16.html#article">Phrack World News</a></td>'
strip_tags(t)

'Phrack World News'

In [5]:
# Clase encargada de eliminar valores innecesarios como signos de puntuacion
# palabras similares pero con diferente conjugacion en el mismo email, etc
import email
import string
import nltk
nltk.download('stopwords')

class Parser:

    def __init__(self):
        self.stemmer = nltk.PorterStemmer()
        self.stopwords = set(nltk.corpus.stopwords.words('english'))
        self.punctuation = list(string.punctuation)

    def parse(self, email_path):
        """Parse an email."""
        with open(email_path, errors='ignore') as e:
            msg = email.message_from_file(e)
        return None if not msg else self.get_email_content(msg)

    def get_email_content(self, msg):
        """Extract the email content."""
        subject = self.tokenize(msg['Subject']) if msg['Subject'] else []
        body = self.get_email_body(msg.get_payload(),
                                   msg.get_content_type())
        content_type = msg.get_content_type()
        # Returning the content of the email
        return {"subject": subject,
                "body": body,
                "content_type": content_type}

    def get_email_body(self, payload, content_type):
        """Extract the body of the email."""
        body = []
        if type(payload) is str and content_type == 'text/plain':
            return self.tokenize(payload)
        elif type(payload) is str and content_type == 'text/html':
            return self.tokenize(strip_tags(payload))
        elif type(payload) is list:
            for p in payload:
                body += self.get_email_body(p.get_payload(),
                                            p.get_content_type())
        return body

    def tokenize(self, text):
        """Transform a text string in tokens. Perform two main actions,
        clean the punctuation symbols and do stemming of the text."""
        for c in self.punctuation:
            text = text.replace(c, "")
        text = text.replace("\t", " ")
        text = text.replace("\n", " ")
        tokens = list(filter(None, text.split(" ")))
        # Stemming of the tokens
        return [self.stemmer.stem(w) for w in tokens if w not in self.stopwords]

[nltk_data] Downloading package stopwords to /home/ljmor/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### Como se ve un correo electrónico en el conjunto de datos original

Este es el formato de un correo tal cual como vendría en el Dataset

```

From RickyAmes@aol.com  Sun Apr  8 13:07:32 2007
Return-Path: <RickyAmes@aol.com>
Received: from 129.97.78.23 ([211.202.101.74])
	by speedy.uwaterloo.ca (8.12.8/8.12.5) with SMTP id l38H7G0I003017;
	Sun, 8 Apr 2007 13:07:21 -0400
Received: from 0.144.152.6 by 211.202.101.74; Sun, 08 Apr 2007 19:04:48 +0100
Message-ID: <WYADCKPDFWWTWTXNFVUE@yahoo.com>
From: "Tomas Jacobs" <RickyAmes@aol.com>
Reply-To: "Tomas Jacobs" <RickyAmes@aol.com>
To: the00@speedy.uwaterloo.ca
Subject: Generic Cialis, branded quality@ 
Date: Sun, 08 Apr 2007 21:00:48 +0300
X-Mailer: Microsoft Outlook Express 6.00.2600.0000
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="--8896484051606557286"
X-Priority: 3
X-MSMail-Priority: Normal
Status: RO
Content-Length: 988
Lines: 24

----8896484051606557286
Content-Type: text/html;
Content-Transfer-Encoding: 7Bit

<html>
<body bgcolor="#ffffff">
<div style="border-color: #00FFFF; border-right-width: 0px; border-bottom-width: 0px; margin-bottom: 0px;" align="center">
<table style="border: 1px; border-style: solid; border-color:#000000;" cellpadding="5" cellspacing="0" bgcolor="#CCFFAA">
<tr>
<td style="border: 0px; border-bottom: 1px; border-style: solid; border-color:#000000;">
<center>
Do you feel the pressure to perform and not rising to the occasion??<br>
</center>
</td></tr><tr>
<td bgcolor=#FFFF33 style="border: 0px; border-bottom: 1px; border-style: solid; border-color:#000000;">
<center>

<b><a href='http://excoriationtuh.com/?lzmfnrdkleks'>Try <span>V</span><span>ia<span></span>gr<span>a</span>.....</a></b></center>
</td></tr><td><center>your anxiety will be a thing of the past and you will<br>
be back to your old self.
</center></td></tr></table></div></body></html>


----8896484051606557286--

```

### Una vez utlizada nuestra clase de parseo

```python3
p = Parser()
p.parse("datasets/trec07p/data/inmail.1")
```

El email resultante queda algo asi, palabras del subject y palabras del body:

```
{'subject': ['gener', 'ciali', 'brand', 'qualiti'],
 'body': ['do',
  'feel',
  'pressur',
  'perform',
  'rise',
  'occas',
  'tri',
  'viagra',
  'anxieti',
  'thing',
  'past',
  'back',
  'old',
  'self'],
 'content_type': 'multipart/alternative'}
```

### Leer las etiquetas dadas a los correos por parte del dataset (spam o no)

```python3
index = open("datasets/trec07p/full/index").readlines()
index
```
Que se ve mas o menos así:

```
'spam ../data/inmail.1\n',
 'ham ../data/inmail.2\n',
 'spam ../data/inmail.3\n',
 'spam ../data/inmail.4\n',
 'spam ../data/inmail.5\n',
 'spam ../data/inmail.6\n',
 'spam ../data/inmail.7\n',
 'spam ../data/inmail.8\n',
 'spam ../data/inmail.9\n',
 'ham ../data/inmail.10\n',
 'spam ../data/inmail.11\n',
 'spam ../data/inmail.12\n',....
```

In [8]:
# Funcion para parsear el array donde se etiqueta a los correos y generar un diccionario
# clave-valor donde este el label (spam o no) y el email que le corresponde
import os

DATASET_PATH = os.path.join("datasets", "trec07p")

def parse_index(path_to_index, n_elements):
    ret_indexes = []
    index = open(path_to_index).readlines()
    for i in range(n_elements):
        mail = index[i].split(" ../")
        label = mail[0]
        path = mail[1][:-1]
        path_mail = path.split("/")[-1]
        ret_indexes.append({"label":label, "email_path":os.path.join(DATASET_PATH, os.path.join("data", path_mail))})
    return ret_indexes

### Como queda el arreglo de etiquetas parseado

El arrglo de etiquetas queda como un diccionario así:

```python3
indexes = parse_index("datasets/trec07p/full/index", 10)
indexes
```

```
[{'label': 'spam', 'email_path': 'datasets\\trec07p\\data\\inmail.1'},
 {'label': 'ham', 'email_path': 'datasets\\trec07p\\data\\inmail.2'},
 {'label': 'spam', 'email_path': 'datasets\\trec07p\\data\\inmail.3'},
 {'label': 'spam', 'email_path': 'datasets\\trec07p\\data\\inmail.4'},
 {'label': 'spam', 'email_path': 'datasets\\trec07p\\data\\inmail.5'},
 {'label': 'spam', 'email_path': 'datasets\\trec07p\\data\\inmail.6'},
 {'label': 'spam', 'email_path': 'datasets\\trec07p\\data\\inmail.7'},
 {'label': 'spam', 'email_path': 'datasets\\trec07p\\data\\inmail.8'},
 {'label': 'spam', 'email_path': 'datasets\\trec07p\\data\\inmail.9'},
 {'label': 'ham', 'email_path': 'datasets\\trec07p\\data\\inmail.10'}]
```

## Preprocesar los datos del dataset

In [10]:
# Cargar indice y etiquetas en memoria
index = parse_index('datasets/trec07p/full/index', 1)

FileNotFoundError: [Errno 2] No such file or directory: 'datasets/trec07p/full/index'