# Text segmentation using Hidden Markov Model

In [3]:
import numpy as np

### Question 1 : Give the value of the π vector of the initial probabilities

π0 = [1,0], on est forcément dans le header au début par hypothèse.

### Question 2 : What is the probability to move from state 1 to state 2 ? What is the probability to remain in state 2 ? What is the lower/higher probability ? Try to explain why

La probabilité de passer du header au body est 0.000781921964187974. La probabilité de rester en 2 est de 1. En effet, on ne peut pas passer du body à l'header. Dans un mail, le header est toujours avant le body et on ne passe qu'une seule fois de l'header au body. Quand on est dans l'header, on a de forte chance d'y rester, mais il possible de passer au body.

### Question 3 : What is the size of B ?

In [389]:
P = np.loadtxt('PerlScriptAndModel/PerlScriptAndModel/P.dat')

In [391]:
P.shape

(256, 2)

La taille de B est 256 (nombre de caractères) 2 (nombre d'états).

### Question 4 :  print the track and present and discuss the results obtained on mail11.txt to mail30.txt

In [364]:
# On va tout d'abord crée une liste de tous les mails, représentés par leurs caractères ASCII 

mails = []
for i in range(1,31):
    mails.append(np.loadtxt('dat/dat/mail'+str(i)+'.dat'))
mails = np.array(mails)
print("Il y a",len(mails),"mails.")

Il y a 30 mails.


In [368]:
# On charge le vecteur contenant les distributions de probabilités pour les 2 états 

P = np.loadtxt('PerlScriptAndModel/PerlScriptAndModel/P.dat')

# On définit la matrice de passage A

A = np.array([[0.999218078035812,0.000781921964187974],[0.,1.]])

# Enfin, on définit le vecteur représentant la distribution initiale

Pi0 = np.array([1,0]) # on fait l'hypothèse que tous les mails possèdent un header, 
                      # et qu'on est forcément dans l'header (état 0) au début 


In [370]:
def viterbi(X,Pi0,A,P):
    """
        Viterbi Algorithm Implementation

        Keyword arguments:
            - obs: sequence of observation
            - states:list of states
            - start_prob:vector of the initial probabilities
            - trans: transition matrix
            - emission_prob: emission probability matrix
        Returns:
            - seq: sequence of state
    """
    #pour eviter d avoir des valeurs nulles dans le log
    realmin = np.finfo(np.double).tiny
    #print(realmin)
    A = np.log(A+realmin)
    #print(Pi0)
    Pi0 = np.log(Pi0+realmin)
    #print(Pi0)
    P = np.log(P+realmin)
    taille = np.shape(X) #X.shape[0]
    T = taille[0] #nombre d observations
    N = Pi0.shape[0]#nombre des etats du modele
    #print(T,N)
    
    #Initialisations 
    deltas = np.zeros((T,N))
    bcktr = np.zeros((T,N))
    
    #on initialise deltas
    for i in range(N):
        deltas[0][i] = Pi0[i]+P[int(X[0]),i]
    
    #on itère maintenant, en définissant notre nouvel état t à partir de l'état précédent t-1
    
    for t in range(1,(T)):
    #pour chaque observation
        for j in range(N):
            #pour chaque état 
            m = max( A[0][j]+deltas[t-1][0], A[1][j]+deltas[t-1][1])
            deltas[t][j] = m+P[int(X[t]),j]
            bcktr[t-1][j] = np.argmax(np.array([A[0][j]+deltas[t-1][0], A[1][j]+deltas[t-1][1]]))
            
    #on définit le dernier état bcktr[T-1]
    for j in range(N):
        bcktr[T-1][j] = np.argmax(np.array([A[0][j]+deltas[T-1][0], A[1][j]+deltas[T-1][1]]))
    
    #A partir de bcktr et de deltas on peut retrouver le chemin
    path = np.zeros(T)
    path[T-1] = int(np.argmax(deltas[T-1]))
    for t in range(T-2,-1,-1):
        path[t]=int(bcktr[t][int(path[t+1])])
        
    return deltas , path


In [371]:
#Pour chaque mail, on va afficher la moyenne de son vecteur chemin
#Cela donnera la proportion du body du mail par rapport à l'ensemble du mail 

#Note : pour le moment, header=état0 et body=état1

for i in range(n):
    print(i+1)
    print(np.mean(viterbi(mails[i],Pi0,A,P)[1]))

1
0.27223926380368096
2
0.27577014218009477
3
0.4247585155058465
4
0.32710280373831774
5
0.3201417601890136
6
0.24135783245094986
7
0.336018711018711
8
0.35216413715570544
9
0.7055072463768116
10
0.23391655450874832
11
0.179568345323741
12
0.2642123716503882
13
0.3076923076923077
14
0.2680961070559611
15
0.6793478260869565
16
0.24971450323563
17
0.3337226277372263
18
0.23074423139421515
19
0.19809160305343512
20
0.24404272801972063
21
0.21058558558558557
22
0.3867691463079879
23
0.42186666666666667
24
0.308295055390435
25
0.28381717109326743
26
0.546227893440788
27
0.437420584498094
28
0.12436048799685163
29
0.18892733564013842
30
0.578875968992248


En effet les mails 9 et 15 ont un grand corps d'où grosse moyenne ! 

In [374]:
# Pour chaque mail, on va enregister le chemin dans un fichier texte path{numéro du mail}

# NOTE : cette fois on prend header=état0 et body=état1 (d'où le str(int(x)+1))

n = len(mails)
for i in range(n):
    path = viterbi(mails[i],Pi0,A,P)[1]
    fichier = open('path'+str(i+1)+'.txt', "w")
    for x in path:
        fichier.write(str(int(x)+1))

In [375]:
path = viterbi(mails[29],Pi0,A,P)[1]
fichier = open('path30.txt', "w")
for x in path:
    fichier.write(str(int(x)+1))

### Résultat de perl segment.pl mail11.txt path11.txt 

From spamassassin-devel-admin@lists.sourceforge.net  Thu Aug 22 15:25:29 2002
Return-Path: <spamassassin-devel-admin@example.sourceforge.net>
Delivered-To: zzzz@localhost.netnoteinc.com
Received: from localhost (localhost [127.0.0.1])
	by phobos.labs.netnoteinc.com (Postfix) with ESMTP id AE2D043F9B
	for <zzzz@localhost>; Thu, 22 Aug 2002 10:25:29 -0400 (EDT)
Received: from phobos [127.0.0.1]
	by localhost with IMAP (fetchmail-5.9.0)
	for zzzz@localhost (single-drop); Thu, 22 Aug 2002 15:25:29 +0100 (IST)
Received: from usw-sf-list2.sourceforge.net (usw-sf-fw2.sourceforge.net
    [216.136.171.252]) by dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id
    g7MENlZ09984 for <zzzz@spamassassin.taint.org>; Thu, 22 Aug 2002 15:23:47 +0100
Received: from usw-sf-list1-b.sourceforge.net ([10.3.1.13]
    helo=usw-sf-list1.sourceforge.net) by usw-sf-list2.sourceforge.net with
    esmtp (Exim 3.31-VA-mm2 #1 (Debian)) id 17hsof-00042r-00; Thu,
    22 Aug 2002 07:20:05 -0700
Received: from vivi.uptime.at ([62.116.87.11] helo=mail.uptime.at) by
    usw-sf-list1.sourceforge.net with esmtp (Exim 3.31-VA-mm2 #1 (Debian)) id
    17hsoM-0000Ge-00 for <spamassassin-devel@lists.sourceforge.net>;
    Thu, 22 Aug 2002 07:19:47 -0700
Received: from [192.168.0.4] (chello062178142216.4.14.vie.surfer.at
    [62.178.142.216]) (authenticated bits=0) by mail.uptime.at (8.12.5/8.12.5)
    with ESMTP id g7MEI7Vp022036 for
    <spamassassin-devel@lists.sourceforge.net>; Thu, 22 Aug 2002 16:18:07
    +0200
User-Agent: Microsoft-Entourage/10.0.0.1309
From: David H=?ISO-8859-1?B?9g==?=hn <dh@uptime.at>
To: <spamassassin-devel@example.sourceforge.net>
Message-Id: <B98ABFA4.1F87%dh@uptime.at>
MIME-Version: 1.0
X-Trusted: YES
X-From-Laptop: YES
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit
X-Mailscanner: Nothing found, baby
Subject: [SAdev] Interesting approach to Spam handling..
Sender: spamassassin-devel-admin@example.sourceforge.net
Errors-To: spamassassin-devel-admin@example.sourceforge.net
X-Beenthere: spamassassin-devel@example.sourceforge.net
X-Mailman-Version: 2.0.9-sf.net
Precedence: bulk
List-Help: <mailto:spamassassin-devel-request@example.sourceforge.net?subject=help>
List-Post: <mailto:spamassassin-devel@example.sourceforge.net>
List-Subscribe: <https://example.sourceforge.net/lists/listinfo/spamassassin-devel>,
    <mailto:spamassassin-devel-request@lists.sourceforge.net?subject=subscribe>
List-Id: SpamAssassin Developers <spamassassin-devel.example.sourceforge.net>
List-Unsubscribe: <https://example.sourceforge.net/lists/listinfo/spamassassin-devel>,
    <mailto:spamassassin-devel-request@lists.sourceforge.net?subject=unsubscribe>
List-Archive: <http://www.geocrawler.com/redir-sf.php3?list=spamassassin-devel>
X-Original-Date: Thu, 22 Aug 2002 16:19:48 +0200
Date: Thu, 22 Aug 2002 16:19:48 +0200

========================== coupez ici ==========================


Hello, have you seen and discussed this article and his approach?

Thank you

http://www.paulgraham.com/spam.html
-- "Hell, there are no rules here-- we're trying to accomplish something."
-- Thomas Alva Edison




-------------------------------------------------------
This sf.net email is sponsored by: OSDN - Tired of that same old
cell phone?  Get a new here for FREE!
https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390
_______________________________________________
Spamassassin-devel mailing list
Spamassassin-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/spamassassin-devel


### Résultat de perl segment.pl mail30.txt path30.txt 

From ilug-admin@linux.ie  Fri Aug 23 11:07:51 2002
Return-Path: <ilug-admin@linux.ie>
Delivered-To: zzzz@localhost.netnoteinc.com
Received: from localhost (localhost [127.0.0.1])
	by phobos.labs.netnoteinc.com (Postfix) with ESMTP id 7419C4416C
	for <zzzz@localhost>; Fri, 23 Aug 2002 06:06:33 -0400 (EDT)
Received: from phobos [127.0.0.1]
	by localhost with IMAP (fetchmail-5.9.0)
	for zzzz@localhost (single-drop); Fri, 23 Aug 2002 11:06:33 +0100 (IST)
Received: from lugh.tuatha.org (root@lugh.tuatha.org [194.125.145.45]) by
    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g7MJtgZ22471 for
    <zzzz-ilug@spamassassin.taint.org>; Thu, 22 Aug 2002 20:55:42 +0100
Received: from lugh (root@localhost [127.0.0.1]) by lugh.tuatha.org
    (8.9.3/8.9.3) with ESMTP id UAA19436; Thu, 22 Aug 2002 20:53:00 +0100
X-Authentication-Warning: lugh.tuatha.org: Host root@localhost [127.0.0.1]
    claimed to be lugh
Received: from mail02.svc.cra.dublin.eircom.net
    (mail02.svc.cra.dublin.eircom.net [159.134.118.18]) by lugh.tuatha.org
    (8.9.3/8.9.3) with SMTP id UAA19403 for <ilug@linux.ie>; Thu,
    22 Aug 2002 20:52:53 +0100
Received: (qmail 50842 messnum 34651 invoked from
    network[159.134.205.176/p432.as1.athlone1.eircom.net]); 22 Aug 2002
    19:52:16 -0000
Received: from p432.as1.athlone1.eircom.net (HELO darkstar)
    (159.134.205.176) by mail02.svc.cra.dublin.eircom.net (qp 50842) with SMTP;
    22 Aug 2002 19:52:16 -0000
Content-Type: text/plain; charset="iso-8859-15"
From: Ciaran Johnston <cj@nologic.org>
Organization: nologic.org
To: <ilug@linux.ie>
Subject: Re: [ILUG] Formatting a windows partition from Linux
Date: Thu, 22 Aug 2002 20:58:07 +0100
User-Agent: KMail/1.4.1
References: <1029944325.29456.28.camel@dubrhlnx1>
    <26030.194.237.142.30.1029943301.squirrel@mail.nologic.org>
In-Reply-To: <26030.194.237.142.30.1029943301.squirrel@mail.nologic.org>
MIME-Version: 1.0
Message-Id: <200208222058.07760.cj@nologic.org>
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from quoted-printable to 8bit by lugh.tuatha.org id
    UAA19403
Sender: ilug-admin@linux.ie
Errors-To: ilug-admin@linux.ie
X-Mailman-Version: 1.1
Precedence: bulk
L

========================== coupez ici ==========================


ist-Id: Irish Linux Users' Group <ilug.linux.ie>
X-Beenthere: ilug@linux.ie

Update on this for anyone that's interested, and because I like closed 
threads... nothing worse than an infinite while loop, is there?

I ended up formatting a floppy on my flatmate's (un-networked) P100 running 
FAT16 Win95, and mcopied the contents of the bootdisk across. Now I have a 
FAT16 Win98 install running alongside Slackware, and can play Metal Gear 
Solid when the mood takes me ;)

/Ciaran.

On Wednesday 21 August 2002 16:21, Ciaran Johnston wrote:
> Dublin said:
> > If you copy the files from your disk to the c: partition and mark it as
> > active it should work ...
>
> Yeah, I figured that, but it doesn't seem to ... well, if that's the case
> I'll give it another go tonight, maybe come back with some error messages.
>
> Just to clarify for those who didn't understand me initially - I have a
> floppy drive installed, but it doesn't physically work. There's nowhere
> handy to pick one up where I am, and I don't fancy waiting a few days for
> one to arrive from Peats.
>
> Thanks for the answers,
> Ciaran.
>
> > You especially need io.sys, command.com and msdos.sys
> >
> > your cd driver .sys and read the autoexec.bat and config.sys files for
> > hints on what you did with your boot floppy <g>
> >
> > P
> >
> > On Wed, 2002-08-21 at 14:07, Ciaran Johnston wrote:
> >> Hi folks,
> >> The situation is this: at home, I have a PC with 2 10Gig HDDs, and no
> >> (working) floppy drive. I have been running Linux solely for the last
> >> year, but recently got the urge to, among other things, play some of
> >> my Windoze games. I normally install the windows partition using a
> >> boot floppy which I have conveniently zipped up, but I haven't any way
> >> of writing or reading a floppy.
> >> So, how do I go about:
> >> 1. formatting a C: drive with system files (normally I would use
> >> format /s c: from the floppy).
> >> 2. Installing the CDROM drivers (my bootdisk (I wrote it many years
> >> ago) does this normally).
> >> 3. Booting from the partition?
> >>
> >> I wiped all my linux partitions from the first drive and created
> >> partitions for Windows (HDA1) Slackware and RedHat. I used cfdisk for
> >> this. I made the first drive (hda) bootable. I then installed the
> >> windows partition in LILO and reran lilo (installed in MBR). I copied
> >> the contents of boot.zip to my new windows partition and tried to boot
> >> it - all I get is a garbled line of squiggles.
> >>
> >> Anyone any ideas? I can't think of anywhere in Athlone to get a new
> >> floppy drive this evening...
> >>
> >> Thanks,
> >> Ciaran.
> >>
> >>
> >>
> >> --
> >> Irish Linux Users' Group: ilug@linux.ie
> >> http://www.linux.ie/mailman/listinfo/ilug for (un)subscription
> >> information. List maintainer: listmaster@linux.ie


-- 
Irish Linux Users' Group: ilug@linux.ie
http://www.linux.ie/mailman/listinfo/ilug for (un)subscription information.
List maintainer: listmaster@linux.ie



Pour le mail 11, la coupure se fait au bon endroit. On passe au caractère H de Hello qui a plus de chance d'appartenir au body qu'à l'header d'où la transition. 

Pour le mail 30, la coupure est en revanche un peu trop tôt.

In [388]:
#On essaie de regarder la distribution du caractère H 

print("Répartition de h",P[72])

#On essaie de regarder la distribution du caractère = 

print("Répartition de h",P[61])

#On essaie de regarder la distribution du caractère 2 

print("Répartition de h",P[50])

Répartition de h [0.00043041 0.00128733]
Répartition de h [0.00109497 0.00012316]
Répartition de h [0.03010139 0.00263533]


C'est plutôt logique, le signe égal et le chiffre 2 sont plus présents dans le header que dans le body, et c'est l'inverse pour le carctère h. 

### Question 5 : How would you model the problem if you had to segment the mails in more than two parts (for example : header, body, signature) ?

Il faudrait faire une matrice A de taille 3*3, et une matrice B de taille (256,3). 
On pourrait apprendre ces matrices A et B de la même manière que l'on a fait ici, avec une dizaine de mails. 
On aurait Pi0 = [1,0,0] et dans la matrice A, uniquement la possibilité de passer de 1 à 2 et de 2 à 3. 

### Question 6 : How would you model the problem of separating the portions of mail included, knowing that they always start with the character ">"

Il faudrait augmenter, dans la matrice B, la probabilité du caractère >, pour l'état 2 ! 