In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('../../data/spam-train.csv')

In [3]:
df.head()

Unnamed: 0,input,target
0,Ok lar... Joking wif u oni...,ham
1,Free entry in 2 a wkly comp to win FA Cup fina...,spam
2,U dun say so early hor... U c already then say...,ham
3,"Nah I don't think he goes to usf, he lives aro...",ham
4,FreeMsg Hey there darling it's been 3 week's n...,spam


## Text classification

**Task:** Create a model that calculates the class (spam/ham) of a given message.

We will do this by creating a Pipeline that will consist of 3 steps:

- Vectorize the data (representing input and target as vectors)/
- Transform the data.
- Create a classification model on top of the new representation of the data.

In [4]:
from sklearn.pipeline import Pipeline

In [5]:
# Vectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Transformer
from sklearn.feature_extraction.text import TfidfTransformer #Term frequency * inverse document frequency

# Classifier
from sklearn.neural_network import MLPClassifier

In [6]:
pipe = Pipeline([
    ('vec', CountVectorizer(stop_words='english')),
    ('tfidf', TfidfTransformer()), 
    ('nn', MLPClassifier(hidden_layer_sizes=(100,50,), activation='tanh'))
])

In [7]:
X = df['input']

In [8]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(df['target'])

In [9]:
pipe.fit(X,y)

In [10]:
pipe.score(X,y)

1.0

In [11]:
out = pipe.predict(['Die... I accidentally deleted e msg i suppose 2 put in e sim archive. Haiz... I so sad...'])

In [12]:
le.inverse_transform(out)

array(['ham'], dtype=object)

We can use `predict_proba` to estimate class probabilities and adjust the predictions to the context of the problem.

For example, to decrease the "risk tolerance" of a prediction.

In [13]:
probas = pipe.predict_proba(["you have received your package"])
ham_proba, spam_proba = probas[0]

if spam_proba > 0.02:
    out = 'spam'
else:
    out = 'ham'

In [14]:
probas

array([[0.98950656, 0.01049344]])

In [15]:
out

'ham'

## Model persistence

Scikit allows to save trained model objects as binary files (pickled) that can be read as part of an application (unrelated to scikit).

In [16]:
import joblib
joblib.dump(pipe, 'pipeline.pkl')

['pipeline.pkl']

## Weights

The coefficients obtained after the training process are stored within the trained model.

In [17]:
pipe.steps[2][1].coefs_

[array([[ 0.02711572, -0.01964854, -0.03004793, ...,  0.03123645,
         -0.06429743, -0.03802805],
        [ 0.03211688, -0.02816248, -0.06198131, ...,  0.05819027,
         -0.03617192, -0.03875066],
        [-0.00654301,  0.0291064 ,  0.0073223 , ...,  0.01709385,
         -0.00374363, -0.00234654],
        ...,
        [-0.03195658,  0.01027423,  0.02840294, ...,  0.00050925,
          0.00443496,  0.00325789],
        [ 0.01121356,  0.0297211 ,  0.02353754, ...,  0.00605635,
         -0.00218721, -0.00561321],
        [ 0.00675875, -0.00741731, -0.02265729, ...,  0.04275281,
         -0.02086842, -0.02156378]]),
 array([[ 0.09485287,  0.01031972,  0.00823737, ...,  0.05219171,
          0.06747428, -0.20471939],
        [-0.02223352,  0.1065239 ,  0.04584068, ...,  0.15227551,
          0.16409351,  0.06842835],
        [ 0.09750196,  0.31661198,  0.05643015, ..., -0.00338447,
          0.21299589,  0.27178575],
        ...,
        [-0.19722557,  0.05344896, -0.24207183, ..., -

In [18]:
pipe.steps[2][1].coefs_[0].shape

(7505, 100)

In [19]:
pipe.steps[2][1].coefs_[0][0]

array([ 0.02711572, -0.01964854, -0.03004793,  0.06888016, -0.06958473,
        0.03319043, -0.04258098, -0.0244341 ,  0.01848258,  0.05877085,
        0.06115306, -0.02448434, -0.06916235,  0.03893557,  0.06129614,
        0.04779475, -0.02524209, -0.01784851, -0.0672942 ,  0.02178241,
        0.0271579 , -0.05759197,  0.03517576,  0.05532286,  0.03568352,
       -0.05735865, -0.03027907,  0.05480035,  0.05995709,  0.05908954,
        0.07019727,  0.02764915, -0.05471611,  0.06132288, -0.06099722,
        0.06089851, -0.05026024,  0.02838628, -0.05154651,  0.05573839,
       -0.06222068,  0.04227527,  0.0603626 , -0.07492042, -0.06119767,
       -0.04185345,  0.04066058,  0.04066194,  0.05877274,  0.04190873,
       -0.05177647, -0.04225024, -0.04583939, -0.02406191,  0.02930503,
        0.03933523,  0.05416791,  0.02401508, -0.04815017, -0.04266433,
        0.05766058, -0.05976618, -0.068397  , -0.04411259,  0.02619356,
        0.05085765,  0.05142763, -0.05038693,  0.04951947,  0.05