# Regresión Logística: Detección de SPAM
En este ejercicio se muestran los fundamentos de regresión logistica planteando uno de los primeros problemas que fueron solucionados mediante el uso de tecnicas de machine learning.

## Enunciado del ejercicio.
Se propone la construcción de un sistema de aprendizaje automatico, capaz de predicir si un correo determinado se corresponde con un correo SPAM, o nno, para ello, se utilizara ell siguiente conjunto de datos:

###### [2007 TREC Public SPAM Corpus](https://plg.uwaterloo.ca/~gvcormac/treccorpus07/)

The corpus trec07p contains 75,419 messages:

    25,220 HAM
    50,190 SPAM
These messages constitute all the messages delivered to a particular server between these dates:

    Sun, 8 Apr 2007 13:07:21 -0400
    Fri, 6 Jul 2007 07:04:53 -0400

# 1.- Fubciones Complementarias
En este caso práctico relacionado con la detención de correos electronicos de SPAM, el conjunto de datos que se dipobe, esta formaso por corres electrónicos, con correspondientes cabeceras y campos adicionales, por lo tanto requieren un preprosecamiento previo a que sean ingeridos por el algoritmo de Machine Learning (ML).

In [1]:
# En esta clase facilita el preprocesamiento que poseen código HTML.
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs = True
        self.fed =[]

    def handle_data(self,d):
        self.fed.append(d)

    def get_data(self):
        return ''.join(self.fed)

In [2]:
# Esta Función se encarga de eliminar los tags HTML que se encuentren en el texto del correo electronico
def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

In [3]:
# Eliminacion de los tags de HTML de un texto.
t = '<tr><td aling="left"><a href"../..issues/51/16.html#article">Phrack world news</a></td>'
strip_tags(t)

'Phrack world news'

Ademas de eliminar los posibles tags HTML que se encuentren en correo electrónico deben realizarse otras acciones de preprocesamiento, para evitar que tengan ruido inecesario los mensages, entre ellos, se encuentran la eliminación de los signo de puntuación, eliminación de posibles campos de correo electronico que no son relevantes o eliminación de los afijos de una palabra, manteniendo únicamente la raiz de la misma (Stemming). La clase que se muestra a coontinuación, realiza estas transformaciones.

In [4]:
import email
import string
import nltk

class Parser:

    def __init__(self):
        self.stemmer = nltk.PorterStemmer()
        self.stopwords = set(nltk.corpus.stopwords.words('english'))
        self.punctuation = list(string.punctuation)

    def parse(self, email_path):
        """Parse an email."""
        with open(email_path, errors = 'ignore') as e:
            msg = email.message_from_file(e)
        return None if not msg else self.get_email_content(msg)
    
    def get_email_content(self, msg):
        """Extract the email content."""
        subject = self.tokenize(msg['Subject']) if msg['Subject'] else []
        body = self.get_email_body(msg.get_payload(),
                                  msg.get_content_type())
        content_type = msg.get_content_type()
        # Returning the content of the email
        return{"subject": subject,
              "body": body,
              "content_type": content_type}
    
    def get_email_body(self, payload, content_type):
        """Extract the body of the email"""
        body = []
        if type(payload) is str and content_type == 'text/plain':
            return self.tokenize(payload)
        elif type(payload) is str and content_type == 'text/html':
            return self.tokenize(strip_tags(payload))
        elif type(payload) is list:
            for p in payload:
                body += self.get_email_body(p.get_payload(),
                                   p.get_content_type())
        return body
        
    def tokenize(self, text):
        """Transform a text string in tokens.Perform two main actions, clean the punctuation symbols and do steamming of the text."""
        for c in self.punctuation:
            text = text.replace(c, "")
        text = text.replace("\t", "")
        text = text.replace("\n", "")
        tokens = list(filter(None, text.split(" ")))
        # Stemming of the tokens
        return[self.stemmer.stem(w) for w in tokens if w not in self.stopwords]

###### Lectura de un correo en formato .raw

In [5]:
inmail = open("datasets/datasets/trec07p/data/inmail.1").read()
print(inmail)

From RickyAmes@aol.com  Sun Apr  8 13:07:32 2007
Return-Path: <RickyAmes@aol.com>
Received: from 129.97.78.23 ([211.202.101.74])
	by speedy.uwaterloo.ca (8.12.8/8.12.5) with SMTP id l38H7G0I003017;
	Sun, 8 Apr 2007 13:07:21 -0400
Received: from 0.144.152.6 by 211.202.101.74; Sun, 08 Apr 2007 19:04:48 +0100
Message-ID: <WYADCKPDFWWTWTXNFVUE@yahoo.com>
From: "Tomas Jacobs" <RickyAmes@aol.com>
Reply-To: "Tomas Jacobs" <RickyAmes@aol.com>
To: the00@speedy.uwaterloo.ca
Subject: Generic Cialis, branded quality@ 
Date: Sun, 08 Apr 2007 21:00:48 +0300
X-Mailer: Microsoft Outlook Express 6.00.2600.0000
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="--8896484051606557286"
X-Priority: 3
X-MSMail-Priority: Normal
Status: RO
Content-Length: 988
Lines: 24

----8896484051606557286
Content-Type: text/html;
Content-Transfer-Encoding: 7Bit

<html>
<body bgcolor="#ffffff">
<div style="border-color: #00FFFF; border-right-width: 0px; border-bottom-width: 0px; margin-bottom: 0px;" align="

In [6]:
p = Parser()
p.parse("datasets/datasets/trec07p/data/inmail.1")

{'subject': ['gener', 'ciali', 'brand', 'qualiti'],
 'body': ['do',
  'feel',
  'pressur',
  'perform',
  'rise',
  'occasiontri',
  'viagrayour',
  'anxieti',
  'thing',
  'past',
  'willb',
  'back',
  'old',
  'self'],
 'content_type': 'multipart/alternative'}

##### Lectura del índice

Estas Funciones complementarias se encargan de cargar en memoria la ruta de cada correo electrónico y su etiqueta coreespondiente que es {SPAM, HAM}

In [7]:
index = open("datasets/datasets/trec07p/full/index").readlines()
index

['spam ../data/inmail.1\n',
 'ham ../data/inmail.2\n',
 'spam ../data/inmail.3\n',
 'spam ../data/inmail.4\n',
 'spam ../data/inmail.5\n',
 'spam ../data/inmail.6\n',
 'spam ../data/inmail.7\n',
 'spam ../data/inmail.8\n',
 'spam ../data/inmail.9\n',
 'ham ../data/inmail.10\n',
 'spam ../data/inmail.11\n',
 'spam ../data/inmail.12\n',
 'spam ../data/inmail.13\n',
 'spam ../data/inmail.14\n',
 'spam ../data/inmail.15\n',
 'spam ../data/inmail.16\n',
 'spam ../data/inmail.17\n',
 'spam ../data/inmail.18\n',
 'spam ../data/inmail.19\n',
 'ham ../data/inmail.20\n',
 'ham ../data/inmail.21\n',
 'spam ../data/inmail.22\n',
 'spam ../data/inmail.23\n',
 'spam ../data/inmail.24\n',
 'spam ../data/inmail.25\n',
 'spam ../data/inmail.26\n',
 'spam ../data/inmail.27\n',
 'spam ../data/inmail.28\n',
 'ham ../data/inmail.29\n',
 'spam ../data/inmail.30\n',
 'ham ../data/inmail.31\n',
 'spam ../data/inmail.32\n',
 'spam ../data/inmail.33\n',
 'ham ../data/inmail.34\n',
 'spam ../data/inmail.35\n',
 

In [8]:
import os
DATASET_PATH = "datasets/datasets/trec07p"

def parse_index(path_to_index, n_elements):
    ret_indexes = []
    index = open(path_to_index).readlines()
    for i in range(n_elements):
        mail = index[i].split(" ../")
        label = mail[0]
        path = mail[1][:-1]
        ret_indexes.append({"label": label, "email_path": os.path.join(DATASET_PATH, path)})
    return ret_indexes

In [9]:
def parse_email(index):
    p = Parser()
    pmail = p.parse(index["email_path"])
    return pmail, index["label"]

In [10]:
indexes = parse_index("datasets/datasets/trec07p/full/index", 10)
indexes

[{'label': 'spam', 'email_path': 'datasets/datasets/trec07p/data/inmail.1'},
 {'label': 'ham', 'email_path': 'datasets/datasets/trec07p/data/inmail.2'},
 {'label': 'spam', 'email_path': 'datasets/datasets/trec07p/data/inmail.3'},
 {'label': 'spam', 'email_path': 'datasets/datasets/trec07p/data/inmail.4'},
 {'label': 'spam', 'email_path': 'datasets/datasets/trec07p/data/inmail.5'},
 {'label': 'spam', 'email_path': 'datasets/datasets/trec07p/data/inmail.6'},
 {'label': 'spam', 'email_path': 'datasets/datasets/trec07p/data/inmail.7'},
 {'label': 'spam', 'email_path': 'datasets/datasets/trec07p/data/inmail.8'},
 {'label': 'spam', 'email_path': 'datasets/datasets/trec07p/data/inmail.9'},
 {'label': 'ham', 'email_path': 'datasets/datasets/trec07p/data/inmail.10'}]

# 2.- Preprosecamiento del DataSet.

Con las funciones presentadas anetriormente, se permite la lectura de los correos electronicos de manera prográmatica y el preprocesamiento de los mismos para eliminar aquellos componentes que no resultan de utilidad para la deteccion de correos SPAM. Sin embargo, cada uno de los correos sigue estando representado por un diccionario de Python con una serie de palabras.

In [11]:
# Cargar el índice y las etiquetas en memoria.
index = parse_index("datasets/datasets/trec07p/full/index", 1)

In [12]:
import os

open(index[0]["email_path"]).read()

'From RickyAmes@aol.com  Sun Apr  8 13:07:32 2007\nReturn-Path: <RickyAmes@aol.com>\nReceived: from 129.97.78.23 ([211.202.101.74])\n\tby speedy.uwaterloo.ca (8.12.8/8.12.5) with SMTP id l38H7G0I003017;\n\tSun, 8 Apr 2007 13:07:21 -0400\nReceived: from 0.144.152.6 by 211.202.101.74; Sun, 08 Apr 2007 19:04:48 +0100\nMessage-ID: <WYADCKPDFWWTWTXNFVUE@yahoo.com>\nFrom: "Tomas Jacobs" <RickyAmes@aol.com>\nReply-To: "Tomas Jacobs" <RickyAmes@aol.com>\nTo: the00@speedy.uwaterloo.ca\nSubject: Generic Cialis, branded quality@ \nDate: Sun, 08 Apr 2007 21:00:48 +0300\nX-Mailer: Microsoft Outlook Express 6.00.2600.0000\nMIME-Version: 1.0\nContent-Type: multipart/alternative;\n\tboundary="--8896484051606557286"\nX-Priority: 3\nX-MSMail-Priority: Normal\nStatus: RO\nContent-Length: 988\nLines: 24\n\n----8896484051606557286\nContent-Type: text/html;\nContent-Transfer-Encoding: 7Bit\n\n<html>\n<body bgcolor="#ffffff">\n<div style="border-color: #00FFFF; border-right-width: 0px; border-bottom-width: 0

In [13]:
# Parsear el primer correo.

mail, label = parse_email(index[0])
print("El correo es: ", label, "\n")
print(mail)

El correo es:  spam 

{'subject': ['gener', 'ciali', 'brand', 'qualiti'], 'body': ['do', 'feel', 'pressur', 'perform', 'rise', 'occasiontri', 'viagrayour', 'anxieti', 'thing', 'past', 'willb', 'back', 'old', 'self'], 'content_type': 'multipart/alternative'}


El algoritmo de regersion logística no es capaz de ingerir texto como parte de DataSet. Por lo tanto, deben aplicarse una serie de funciones adicionales que transformen el texto de los correos electrónicos pensados en un representación numerica.

##### Aplicación CounVectorizer

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

# Preparar el email en una cadena de texto
prep_email = [" ".join(mail['subject']) + " ".join(mail['body'])]

vectorizer = CountVectorizer()
X = vectorizer.fit(prep_email)

print("e-mail", prep_email, "\n")
print("Caracteristicas de entrada", vectorizer.get_feature_names_out())

e-mail ['gener ciali brand qualitido feel pressur perform rise occasiontri viagrayour anxieti thing past willb back old self'] 

Caracteristicas de entrada ['anxieti' 'back' 'brand' 'ciali' 'feel' 'gener' 'occasiontri' 'old'
 'past' 'perform' 'pressur' 'qualitido' 'rise' 'self' 'thing' 'viagrayour'
 'willb']


In [15]:
X = vectorizer.transform(prep_email)
print("\nValues:\n", X.toarray())


Values:
 [[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]


In [16]:
from sklearn.preprocessing import OneHotEncoder

prep_email = [[w] for w in mail['subject'] + mail['body']]
enc = OneHotEncoder(handle_unknown = 'ignore') 
X= enc.fit_transform(prep_email)

print("Feature: \n", enc.get_feature_names_out())
print("Values: \n", X.toarray())

Feature: 
 ['x0_anxieti' 'x0_back' 'x0_brand' 'x0_ciali' 'x0_do' 'x0_feel' 'x0_gener'
 'x0_occasiontri' 'x0_old' 'x0_past' 'x0_perform' 'x0_pressur'
 'x0_qualiti' 'x0_rise' 'x0_self' 'x0_thing' 'x0_viagrayour' 'x0_willb']
Values: 
 [[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0

#### Funciones auxiliares para preprocesamiento del DataSet

In [17]:
def create_prep_dataset(index_path, n_elements):
    X=[]
    y=[]

    indexes = parse_index(index_path, n_elements)
    for i in range(n_elements):
        print("\nParsing email: {0}".format(i+1), end = "")
        mail, label = parse_email(indexes[i])
        X.append(" ".join(mail['subject']) + " ".join(mail['body']))
        y.append(label)
    return X,y

# 3.- Entrenamiento del algoritmo

In [18]:
# Leer unicamente un subconjunto de 1000 correos electronicos
X_train, y_train = create_prep_dataset("datasets/datasets/trec07p/full/index", 1000)
X_train


Parsing email: 1
Parsing email: 2
Parsing email: 3
Parsing email: 4
Parsing email: 5
Parsing email: 6
Parsing email: 7
Parsing email: 8
Parsing email: 9
Parsing email: 10
Parsing email: 11
Parsing email: 12
Parsing email: 13
Parsing email: 14
Parsing email: 15
Parsing email: 16
Parsing email: 17
Parsing email: 18
Parsing email: 19
Parsing email: 20
Parsing email: 21
Parsing email: 22
Parsing email: 23
Parsing email: 24
Parsing email: 25
Parsing email: 26
Parsing email: 27
Parsing email: 28
Parsing email: 29
Parsing email: 30
Parsing email: 31
Parsing email: 32
Parsing email: 33
Parsing email: 34
Parsing email: 35
Parsing email: 36
Parsing email: 37
Parsing email: 38
Parsing email: 39
Parsing email: 40
Parsing email: 41
Parsing email: 42
Parsing email: 43
Parsing email: 44
Parsing email: 45
Parsing email: 46
Parsing email: 47
Parsing email: 48
Parsing email: 49
Parsing email: 50
Parsing email: 51
Parsing email: 52
Parsing email: 53
Parsing email: 54
Parsing email: 55
Parsing email: 56


['gener ciali brand qualitido feel pressur perform rise occasiontri viagrayour anxieti thing past willb back old self',
 'typo debianreadmhi ive updat gulu i check mirrorsit seem littl typo debianreadm fileexamplehttpgulususherbrookecadebianreadmeftpftpfrdebianorgdebianreadmetest lenni access releas diststest thecurr test develop snapshot name etch packag whichhav test unstabl pass autom test propog tothi releaseetch replac lenni like readmehtml yan morinconsult en logiciel libreyanmorinsavoirfairelinuxcom5149941556 to unsubscrib email debianmirrorsrequestlistsdebianorgwith subject unsubscrib troubl contact listmasterlistsdebianorg',
 'authent viagramega authenticv i a g r a discount pricec i a l i s discount pricedo miss it click herehttpwwwmoujsjkhchumcom authent viagramega authenticv i a g r a discount pricec i a l i s discount pricedo miss it click',
 'nice talk yahey billi realli fun go night talk said feltinsecur manhood i notic toiletsy quit small area worri websit i tell secret

##### Aplicar Vectorización de los datos

In [19]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)

In [20]:
print(X_train.toarray())
print("\nFeatures", len(vectorizer.get_feature_names_out()))

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

Features 25417


In [21]:
import pandas as pd

In [22]:
pd.DataFrame(X_train.toarray(), columns = [vectorizer.get_feature_names_out()])

Unnamed: 0,00,000,0000,000000,000000categori,000000smallitalictext,000000storylink20fontfamili,000002,000048000000000,000099alinkhov,...,淶ҵƿϊһ淶ĵƶȶҳ,绰tel,绰۹ϵͳctsƽe,肾ǝvă,鏗ėvłq,饻jwkݤ,鵵χ2ʶ3ҵ൵νռࡢ鵵õȹҫդãˡҵĵԭ뷽ҵĵĸصҵĵԭҵĵļšҵ1ҵߵ2ҵߵ,뵭袵,뼰ʱϵ,쵼ã
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
y_train

['spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'ham',
 'ham',
 'spam',
 'spam',
 'spam',
 'ham',
 'ham',
 'ham',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam'

##### Entrenamiento del algoritmo de regresion logistica con el DataSet preprocesado.

In [24]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train, y_train)

## 4.- Prediccion

In [26]:
# Leer 1500 correos de nuestro DataSet y quedarnos unicamente con los 500 ultimos.
# Estos 500 correos electronicos no se han utilizado para entrenar el algoritmo.
X, y = create_prep_dataset("datasets/datasets/trec07p/full/index", 1500)
X_test = X[1000:]
y_test= y[1000:]


Parsing email: 1
Parsing email: 2
Parsing email: 3
Parsing email: 4
Parsing email: 5
Parsing email: 6
Parsing email: 7
Parsing email: 8
Parsing email: 9
Parsing email: 10
Parsing email: 11
Parsing email: 12
Parsing email: 13
Parsing email: 14
Parsing email: 15
Parsing email: 16
Parsing email: 17
Parsing email: 18
Parsing email: 19
Parsing email: 20
Parsing email: 21
Parsing email: 22
Parsing email: 23
Parsing email: 24
Parsing email: 25
Parsing email: 26
Parsing email: 27
Parsing email: 28
Parsing email: 29
Parsing email: 30
Parsing email: 31
Parsing email: 32
Parsing email: 33
Parsing email: 34
Parsing email: 35
Parsing email: 36
Parsing email: 37
Parsing email: 38
Parsing email: 39
Parsing email: 40
Parsing email: 41
Parsing email: 42
Parsing email: 43
Parsing email: 44
Parsing email: 45
Parsing email: 46
Parsing email: 47
Parsing email: 48
Parsing email: 49
Parsing email: 50
Parsing email: 51
Parsing email: 52
Parsing email: 53
Parsing email: 54
Parsing email: 55
Parsing email: 56


##### Preprocesamiento de los correos con el vectorizador creado anteriormente

In [27]:
X_test = vectorizer.transform(X_test)

In [28]:
y_pred = clf.predict(X_test)
y_pred

array(['spam', 'ham', 'ham', 'spam', 'spam', 'ham', 'spam', 'spam',
       'spam', 'spam', 'ham', 'spam', 'spam', 'spam', 'spam', 'ham',
       'spam', 'spam', 'spam', 'spam', 'spam', 'ham', 'ham', 'ham',
       'spam', 'spam', 'spam', 'spam', 'ham', 'spam', 'ham', 'spam',
       'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'ham', 'spam',
       'spam', 'ham', 'ham', 'ham', 'spam', 'spam', 'spam', 'spam', 'ham',
       'spam', 'spam', 'spam', 'spam', 'ham', 'spam', 'ham', 'spam',
       'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'ham',
       'spam', 'ham', 'spam', 'spam', 'spam', 'spam', 'spam', 'ham',
       'spam', 'spam', 'spam', 'spam', 'ham', 'spam', 'spam', 'spam',
       'spam', 'spam', 'spam', 'spam', 'spam', 'ham', 'ham', 'spam',
       'spam', 'spam', 'spam', 'spam', 'ham', 'ham', 'ham', 'ham', 'spam',
       'spam', 'spam', 'ham', 'spam', 'spam', 'spam', 'spam', 'spam',
       'ham', 'ham', 'spam', 'ham', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam'

In [29]:
print("Prediccion: \n", y_pred)
print("\nEtiquetas Reales: \n", y_test)

Prediccion: 
 ['spam' 'ham' 'ham' 'spam' 'spam' 'ham' 'spam' 'spam' 'spam' 'spam' 'ham'
 'spam' 'spam' 'spam' 'spam' 'ham' 'spam' 'spam' 'spam' 'spam' 'spam'
 'ham' 'ham' 'ham' 'spam' 'spam' 'spam' 'spam' 'ham' 'spam' 'ham' 'spam'
 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'ham' 'spam' 'spam' 'ham' 'ham'
 'ham' 'spam' 'spam' 'spam' 'spam' 'ham' 'spam' 'spam' 'spam' 'spam' 'ham'
 'spam' 'ham' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam'
 'ham' 'spam' 'ham' 'spam' 'spam' 'spam' 'spam' 'spam' 'ham' 'spam' 'spam'
 'spam' 'spam' 'ham' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam'
 'spam' 'ham' 'ham' 'spam' 'spam' 'spam' 'spam' 'spam' 'ham' 'ham' 'ham'
 'ham' 'spam' 'spam' 'spam' 'ham' 'spam' 'spam' 'spam' 'spam' 'spam' 'ham'
 'ham' 'spam' 'ham' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam'
 'spam' 'spam' 'ham' 'ham' 'ham' 'ham' 'spam' 'spam' 'spam' 'ham' 'spam'
 'ham' 'ham' 'spam' 'spam' 'spam' 'ham' 'spam' 'spam' 'spam' 'ham' 'spam'
 'ham' 'spam' 'spam' 'spam' 'spam' 'spam

Evaluacion de los resultados.

In [30]:
from sklearn.metrics import accuracy_score
print("Accuracy: {:.3f}".format(accuracy_score(y_test, y_pred)))

Accuracy: 0.934


## 5- Aumentando el DataSet

In [32]:
# Leer 12,000 correos electronicos
X, y = create_prep_dataset("datasets/datasets/trec07p/full/index", 12000)


Parsing email: 1
Parsing email: 2
Parsing email: 3
Parsing email: 4
Parsing email: 5
Parsing email: 6
Parsing email: 7
Parsing email: 8
Parsing email: 9
Parsing email: 10
Parsing email: 11
Parsing email: 12
Parsing email: 13
Parsing email: 14
Parsing email: 15
Parsing email: 16
Parsing email: 17
Parsing email: 18
Parsing email: 19
Parsing email: 20
Parsing email: 21
Parsing email: 22
Parsing email: 23
Parsing email: 24
Parsing email: 25
Parsing email: 26
Parsing email: 27
Parsing email: 28
Parsing email: 29
Parsing email: 30
Parsing email: 31
Parsing email: 32
Parsing email: 33
Parsing email: 34
Parsing email: 35
Parsing email: 36
Parsing email: 37
Parsing email: 38
Parsing email: 39
Parsing email: 40
Parsing email: 41
Parsing email: 42
Parsing email: 43
Parsing email: 44
Parsing email: 45
Parsing email: 46
Parsing email: 47
Parsing email: 48
Parsing email: 49
Parsing email: 50
Parsing email: 51
Parsing email: 52
Parsing email: 53
Parsing email: 54
Parsing email: 55
Parsing email: 56


In [33]:
# Utilizamos 10,000 correos para entrenar el algoritmo y 2,000 para realizar pruebas
X_train, y_train = X[:10000], y[:10000]
X_test, y_test = X[10000:], y[10000:]

In [34]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)

In [35]:
clf = LogisticRegression()
clf.fit(X_train, y_train)

In [36]:
X_test = vectorizer.transform(X_test)

In [37]:
y_pred = clf.predict(X_test)

In [38]:
print("Accuracy: {:.3f}".format(accuracy_score(y_test, y_pred)))

Accuracy: 0.983
