In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('./data/spam.csv')

In [3]:
df.head()

Unnamed: 0,input,target
0,Ok lar... Joking wif u oni...,ham
1,Free entry in 2 a wkly comp to win FA Cup fina...,spam
2,U dun say so early hor... U c already then say...,ham
3,"Nah I don't think he goes to usf, he lives aro...",ham
4,FreeMsg Hey there darling it's been 3 week's n...,spam


## Text classification

**Task:** Create a model that calculates the class (spam/ham) of a given message.

We will do this by creating a Pipeline that will consist of 3 steps:

- Vectorize the data (representing input and target as vectors)/
- Transform the data.
- Create a classification model on top of the new representation of the data.

In [4]:
from sklearn.pipeline import Pipeline

In [5]:
# Vectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Transformer
from sklearn.feature_extraction.text import TfidfTransformer #Term frequency * inverse document frequency

# Classifier
from sklearn.neural_network import MLPClassifier

In [6]:
pipe = Pipeline([
    ('vec', CountVectorizer(stop_words='english')),
    ('tfidf', TfidfTransformer()), 
    ('nn', MLPClassifier(hidden_layer_sizes=(100,50,), activation='tanh'))
])

In [7]:
X = df['input']

In [8]:
y = df['target']

In [10]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

In [11]:
pipe.fit(X_train,y_train)

In [12]:
pipe.score(X_test,y_test)

0.9899103139013453

In [13]:
pipe.steps[0][1].transform(['Die... I accidentally deleted e msg i suppose 2 put in e sim archive. Haiz... I so sad...'])

<1x6675 sparse matrix of type '<class 'numpy.int64'>'
	with 7 stored elements in Compressed Sparse Row format>

In [14]:
#pipe.steps[0][1].vocabulary_

In [15]:
out = pipe.predict(['Hello friend I have a business opportunity for you'])

In [16]:
out

array(['ham'], dtype='<U4')

We can use `predict_proba` to estimate class probabilities and adjust the predictions to the context of the problem.

For example, to decrease the "risk tolerance" of a prediction.

In [17]:
probas = pipe.predict_proba(["you have received your package"])
ham_proba, spam_proba = probas[0]

if spam_proba > 0.02:
    out = 'spam'
else:
    out = 'ham'

In [18]:
probas

array([[0.97047268, 0.02952732]])

In [19]:
out

'spam'

## Model persistence

Scikit allows to save trained model objects as binary files (pickled) that can be read as part of an application (unrelated to scikit).

In [20]:
import joblib
joblib.dump(pipe, 'pipeline.joblib')

['pipeline.joblib']

## Weights

The coefficients obtained after the training process are stored within the trained model.

In [21]:
pipe.steps[2][1].coefs_

[array([[-0.0884637 , -0.07070647, -0.06250891, ...,  0.02013007,
          0.03045472, -0.03072784],
        [-0.0448542 , -0.05897063, -0.01947329, ...,  0.07140766,
          0.07477373, -0.07887229],
        [-0.01773429,  0.00717264, -0.03101202, ...,  0.031767  ,
          0.03357763, -0.01458764],
        ...,
        [ 0.03076577,  0.01197231, -0.01569317, ...,  0.0110255 ,
         -0.01095331,  0.01848178],
        [ 0.03328595,  0.0279206 ,  0.0389511 , ..., -0.0279006 ,
         -0.00358692,  0.00501672],
        [-0.0394933 , -0.02360851,  0.00284352, ...,  0.03985124,
          0.02924802, -0.02196685]]),
 array([[ 0.17830968,  0.0270272 ,  0.04922544, ..., -0.06990876,
          0.16411163, -0.22742161],
        [ 0.00432477, -0.05728062,  0.2341059 , ...,  0.16238924,
          0.19677373,  0.1296201 ],
        [ 0.21754405,  0.17922119,  0.09621939, ..., -0.11161983,
          0.25206542, -0.16606721],
        ...,
        [-0.11247266, -0.06645987, -0.18661159, ...,  

In [22]:
pipe.steps[2][1].coefs_[0].shape

(6675, 100)

In [23]:
pipe.steps[2][1].coefs_[0][0]

array([-0.0884637 , -0.07070647, -0.06250891,  0.02966097,  0.05142674,
       -0.04800198,  0.051602  , -0.01568438, -0.03685121, -0.02354842,
        0.0457416 ,  0.04662104,  0.04337081,  0.05345496,  0.03690122,
       -0.03225592, -0.08347124,  0.06539286,  0.05711338,  0.07705857,
       -0.06295594,  0.03799249, -0.0408633 , -0.06774585, -0.07409065,
       -0.07244382,  0.04403613,  0.07214189,  0.05514062,  0.07345335,
       -0.02409938,  0.04364261, -0.04050851, -0.06017675,  0.08194027,
        0.04623038,  0.07867552,  0.07345716, -0.04072053,  0.06031723,
       -0.03687323,  0.05147947, -0.02088904, -0.03409547, -0.05529798,
        0.04403192,  0.04813701,  0.02424484,  0.02536899,  0.05655566,
       -0.06273225,  0.07106833,  0.05083229, -0.0634876 , -0.05235338,
       -0.03550785, -0.04875777, -0.06613855,  0.06371321,  0.03983269,
        0.02310205, -0.02175337,  0.05924569,  0.06322394, -0.05418533,
        0.03302509, -0.04918743, -0.04844036, -0.06466188, -0.06