In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('./data/spam.csv')

In [3]:
df.head()

Unnamed: 0,input,target
0,Ok lar... Joking wif u oni...,ham
1,Free entry in 2 a wkly comp to win FA Cup fina...,spam
2,U dun say so early hor... U c already then say...,ham
3,"Nah I don't think he goes to usf, he lives aro...",ham
4,FreeMsg Hey there darling it's been 3 week's n...,spam


## Text classification

**Task:** Create a model that calculates the class (spam/ham) of a given message.

We will do this by creating a Pipeline that will consist of 3 steps:

- Vectorize the data (representing input and target as vectors)/
- Transform the data.
- Create a classification model on top of the new representation of the data.

In [4]:
from sklearn.pipeline import Pipeline



In [5]:
# Vectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Transformer
from sklearn.feature_extraction.text import TfidfTransformer #Term frequency * inverse document frequency

# Classifier
from sklearn.neural_network import MLPClassifier

In [6]:
pipe = Pipeline([
    ('vec', CountVectorizer(stop_words='english')),
    ('tfidf', TfidfTransformer()), 
    ('nn', MLPClassifier(hidden_layer_sizes=(100,50,), activation='tanh'))
])

In [7]:
X = df['input']

In [8]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(df['target'])

In [9]:
pipe.fit(X,y)

In [10]:
pipe.score(X,y)

1.0

In [11]:
out = pipe.predict(['Die... I accidentally deleted e msg i suppose 2 put in e sim archive. Haiz... I so sad...'])

In [12]:
le.inverse_transform(out)

array(['ham'], dtype=object)

We can use `predict_proba` to estimate class probabilities and adjust the predictions to the context of the problem.

For example, to decrease the "risk tolerance" of a prediction.

In [13]:
probas = pipe.predict_proba(["you have received your package"])
ham_proba, spam_proba = probas[0]

if spam_proba > 0.02:
    out = 'spam'
else:
    out = 'ham'

In [14]:
probas

array([[0.98634358, 0.01365642]])

In [15]:
out

'ham'

## Model persistence

Scikit allows to save trained model objects as binary files (pickled) that can be read as part of an application (unrelated to scikit).

In [16]:
import joblib
joblib.dump(pipe, 'pipeline.pkl')

['pipeline.pkl']

## Weights

The coefficients obtained after the training process are stored within the trained model.

In [17]:
pipe.steps[2][1].coefs_

[array([[ 0.05350618, -0.03040398, -0.04394492, ..., -0.02129842,
          0.02469654,  0.07387793],
        [ 0.06529762, -0.0414888 , -0.05769114, ..., -0.04278381,
          0.04022994,  0.03124056],
        [ 0.00989029, -0.0040283 ,  0.00796734, ...,  0.02597673,
         -0.00776022, -0.01977846],
        ...,
        [-0.00391779,  0.0302778 ,  0.01106432, ...,  0.00484698,
          0.01020368, -0.03297961],
        [-0.00164602,  0.00280126,  0.01385728, ...,  0.03091237,
         -0.00292893,  0.00396123],
        [ 0.04538837, -0.04525029,  0.00228118, ..., -0.00958166,
          0.01245021,  0.02773307]]),
 array([[ 0.03499236,  0.22202564, -0.0718144 , ..., -0.08923861,
         -0.18124863,  0.26047121],
        [-0.09028339, -0.12862633,  0.26702516, ...,  0.25358383,
          0.15024501, -0.16811977],
        [ 0.26465371,  0.07904608,  0.09876234, ...,  0.04896578,
          0.13148824, -0.1442469 ],
        ...,
        [ 0.25330312, -0.06370079,  0.17792987, ...,  

In [18]:
pipe.steps[2][1].coefs_[0].shape

(7505, 100)

In [19]:
pipe.steps[2][1].coefs_[0][0]

array([ 0.05350618, -0.03040398, -0.04394492, -0.07246585,  0.05877407,
        0.06283825, -0.02474915, -0.03631948, -0.07692276, -0.06138606,
        0.04581669, -0.03009157,  0.05827165, -0.08901121,  0.035561  ,
       -0.06073579,  0.06348444, -0.06565675,  0.0486393 ,  0.06387082,
        0.04514307,  0.06663691, -0.06452388, -0.02535899,  0.04882097,
       -0.05427195,  0.03529098, -0.03593724, -0.03134151,  0.03807798,
       -0.0862536 ,  0.0553752 , -0.03048642, -0.04405498,  0.07199328,
        0.07061589,  0.03110469, -0.02836107,  0.02774807, -0.04109428,
        0.05765969, -0.02373062,  0.06921763, -0.07741337,  0.0416306 ,
       -0.02584814,  0.06064964, -0.03679758, -0.02177264,  0.05453975,
        0.05445935, -0.0624127 , -0.04123801, -0.0429893 ,  0.06570579,
       -0.07130807,  0.02527219, -0.03043807, -0.03741513,  0.03626732,
        0.05267795, -0.0382423 , -0.05657447,  0.05475551,  0.07054788,
       -0.07283638, -0.06590839, -0.05957507,  0.03502571,  0.03