# A simple textual classifier for Dutch citizen reports
In many cities in the world local governments offer a service where citizens can make requests, for example to make a complaint about garbage on the street or to report nuisance. These reports are made by phone or trough a webform, by writing a text and selecting a category. The selection of a category can be done by using supervised machine learning on historical service requests. The city of Amsterdam uses this method to detect the class of an report and route it to the correct department. In this repository is the python code that can be used to create such a classifier.

The classification is done by using a TF-IDF (Term Freuqency - Inversed document frequency) as representation for the text and a logistic regression to classify the text. Optimal hyperparameters for the dataset are found using a gridsearch.

An example subset of data of dutch citizen reports is added for demonstration purposes. The original data used is not publicly available due to privacy concerns.

A live demo of a textual classification of Dutch service requets can be seen at http://ec2-54-171-141-211.eu-west-1.compute.amazonaws.com/
An example dataset of dutch citizen reports is added for demonstration purposes.

In [1]:
import pandas as pd

df = pd.read_csv('voorbeeld_meldingen.csv')
print(len(df),'rows loaded') # The example dataset is not large enough to train a good classification model

texts = df['Tekst']
labels = df['Label']

# Splitting data
split = 0.5
splitpoint = int(split*len(texts))


# train data
train_texts = texts[:splitpoint]
train_labels = labels[:splitpoint]

# test data
test_texts = texts[splitpoint:]
test_labels = labels[splitpoint:]

97 rows loaded


In [2]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

    # pipeline of classifier
pipeline = Pipeline([
        ('tfidf', TfidfVectorizer()),
        ('clf', LogisticRegression()),
])

# possible parameters to do gridsearch on
# More parameters can be found:
# TF-IDF vectorizer: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
# Logistic regression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
parameters = {
    'tfidf__max_df': (0.5, 0.75, 1.0),
    'tfidf__max_features': (None, 5000,),
    'tfidf__ngram_range': ((1, 1), (1, 2)), # using unigrams or bigrams
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'),
    'clf__penalty': ('l2', 'l1'),
    'clf__max_iter': (10, 50, 80,100,150),
}

grid_search = GridSearchCV(pipeline, parameters)
grid_search.fit(train_texts, train_labels)

print('Best parameters: ')
print(grid_search.best_params_)
print('')

print('Best score: ')
print(grid_search.best_score_)
print('')



Best parameters: 
{'clf__max_iter': 10, 'clf__penalty': 'l2', 'tfidf__max_df': 0.5, 'tfidf__max_features': None, 'tfidf__ngram_range': (1, 1), 'tfidf__norm': 'l1', 'tfidf__use_idf': True}

Best score: 
0.6041666666666666





# Model persistence
http://scikit-learn.org/stable/modules/model_persistence.html

Saving the model to be able to use it for making predictions later.

In [3]:
from sklearn.externals import joblib
joblib.dump(grid_search, 'model.pkl') 

model = joblib.load('model.pkl') 

# Evaluation

Evaluation using precision, recall and accuracy

In [4]:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score

test_predict = model.predict(test_texts)

precision = str(round(precision_score(test_labels, test_predict, average='macro'),2))
recall = str(round(recall_score(test_labels, test_predict, average='macro'),2))
accuracy = str(round(accuracy_score(test_labels, test_predict),2))

print('Precision',precision )
print('Recall',recall )
print('Accuracy',accuracy )

Precision 0.12
Recall 0.17
Accuracy 0.73


  'precision', 'predicted', average, warn_for)
