#### import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder

# Vectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Transformer
from sklearn.feature_extraction.text import TfidfTransformer

# Classifier
from sklearn.neural_network import MLPClassifier

In [5]:
df = pd.read_csv('../../data/spam-train.csv')
df.head()

Unnamed: 0,input,target
0,Ok lar... Joking wif u oni...,ham
1,Free entry in 2 a wkly comp to win FA Cup fina...,spam
2,U dun say so early hor... U c already then say...,ham
3,"Nah I don't think he goes to usf, he lives aro...",ham
4,FreeMsg Hey there darling it's been 3 week's n...,spam


### Text classification

**Task**: Create a model that calculates the class (spam/ham) of a given message.

We will create a Pipeline of 3 steps:

- Vectorize the data (represent input and target as vectors)
- Transform the data
- Create a classification model on top of the new representation of the data

#### pipe = Pipeline([
    ('vec', CountVectorizer(stop_words='english')),
    ('tfidf', TfidfTransformer()),
    ('nn', MLPClassifier(hidden_layer_sizes=(100,50,), activation='tanh'))
])

X = df['input']
le = LabelEncoder()
y = le.fit_transform(df['target'])

In [46]:
pipe.fit(X, y)

In [47]:
pipe.score(X, y)

1.0

In [48]:
df_valid = pd.read_csv('../../data/spam-validation.csv')
df_valid.head()

Unnamed: 0,input,target
0,Die... I accidentally deleted e msg i suppose ...,ham
1,Welcome to UK-mobile-date this msg is FREE giv...,spam
2,This is wishing you a great day. Moji told me ...,ham
3,Thanks again for your reply today. When is ur ...,ham
4,"Sorry I flaked last night, shit's seriously go...",ham


In [22]:
out = pipe.predict(['Die... I accidentally deleted ae msg I suppose 2 put in e sim archive. Haiz... I so sad...'])
le.inverse_transform(out)

array(['ham'], dtype=object)

We can use `predict_proba` to estimate class probabilities and adjust the predictions to the context of the problem.

For example, to decrease the risk tolerance of a prediction.

In [36]:
[(ham_proba, spam_proba)] = pipe.predict_proba(['you have received your package'])

if spam_proba > 0.02:
    print('Spam')
else:
    print('Not spam')


Spam


### Model persistence

Scikit allows saving trained model objects as binary pipes (pickled) that can be read as part of an application (unrelated to scikit).

In [27]:
import joblib
joblib.dump(pipe, 'pipeline.pkl')

['pipeline.pkl']

### Weights

The coefficients obtained after the training process are stored within the model.

In [28]:
pipe.steps[2][1].coefs_

[array([[ 1.69560378e-01,  2.04753138e-01,  3.03446027e-01, ...,
          1.33966276e-01, -1.76222088e-01, -8.17749561e-06],
        [ 1.75080097e-01,  2.36324052e-01,  3.53938553e-01, ...,
          2.94460783e-01, -8.26072040e-02,  3.64576038e-02],
        [-4.50033537e-02, -2.45727740e-02, -5.93326363e-02, ...,
          7.05517135e-02,  1.02080292e-01,  4.31836019e-02],
        ...,
        [-2.90963774e-02, -3.17074703e-02, -5.94680058e-02, ...,
          6.94561095e-02,  9.86346869e-02,  7.21197345e-02],
        [-3.48694535e-02, -1.88900990e-02, -5.09388836e-02, ...,
          5.29343427e-02,  8.03436145e-02,  3.21004694e-02],
        [ 1.09463907e-01,  1.10879618e-01,  1.66696850e-01, ...,
         -5.28133736e-02, -1.20244084e-01, -7.02956401e-02]]),
 array([[-4.60673417e-01, -9.48707837e-01,  8.20000097e-04,
          7.67099645e-01, -2.44696772e-04],
        [-9.75439006e-01, -8.25821259e-01, -1.27835990e-22,
          6.47584545e-01, -7.24498458e-04],
        [-5.06351746e

In [40]:
pipe.steps[2][1].coefs_[0].shape

(7505, 10)