In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('./data/spam.csv')

In [None]:
df.head()

## Text classification

**Task:** Create a model that calculates the class (spam/ham) of a given message.

We will do this by creating a Pipeline that will consist of 3 steps:

- Vectorize the data (representing input and target as vectors)/
- Transform the data.
- Create a classification model on top of the new representation of the data.

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
# Vectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Transformer
from sklearn.feature_extraction.text import TfidfTransformer #Term frequency * inverse document frequency

# Classifier
from sklearn.neural_network import MLPClassifier

In [None]:
pipe = Pipeline([
    ('vec', CountVectorizer(stop_words='english')),
    ('tfidf', TfidfTransformer()), 
    ('nn', MLPClassifier(hidden_layer_sizes=(100,50,), activation='tanh'))
])

In [None]:
X = df['input']

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(df['target'])

In [None]:
pipe.fit(X,y)

In [None]:
pipe.score(X,y)

In [None]:
out = pipe.predict(['Die... I accidentally deleted e msg i suppose 2 put in e sim archive. Haiz... I so sad...'])

In [None]:
le.inverse_transform(out)

We can use `predict_proba` to estimate class probabilities and adjust the predictions to the context of the problem.

For example, to decrease the "risk tolerance" of a prediction.

In [None]:
probas = pipe.predict_proba(["you have received your package"])
ham_proba, spam_proba = probas[0]

if spam_proba > 0.02:
    out = 'spam'
else:
    out = 'ham'

In [None]:
probas

In [None]:
out

## Model persistence

Scikit allows to save trained model objects as binary files (pickled) that can be read as part of an application (unrelated to scikit).

In [None]:
import joblib
joblib.dump(pipe, 'pipeline.pkl')

## Weights

The coefficients obtained after the training process are stored within the trained model.

In [None]:
pipe.steps[2][1].coefs_

In [None]:
pipe.steps[2][1].coefs_[0].shape

In [None]:
pipe.steps[2][1].coefs_[0][0]