### Preprocessing step
We are going to build a binary classifier checking user's input whether it's profanity or not. 

In [1]:
#Preprocessing step
#Let's make a corpus out of (roughly) 5000 profane words
#and equal amount the most frequent russian words

import pandas as pd
import numpy as np

In [2]:
#loading and (preliminary) cleaning set of profane words

outfile = []
alphabet = 'йцукенгшщзхъфывапролджэячсмить.*_@" '
with open('bad_words_corpus.txt', 'r', encoding='utf-8') as infile:
    for line in infile:
        if len(line) > 30:
            for w in line.split():
                news = ''
                for l in range(len(w)):    #normalize each word
                    if w[l] in alphabet:
                        news += w[l]
                if len(news) == 1:
                    continue
                outfile.append(news)
        news = ''
        for l in range(len(w)):    #normalize each word
            if w[l] in alphabet:
                news += w[l]
        if len(news) == 1:
            continue
        outfile.append(line.strip())

out = pd.Series(outfile)

In [3]:
# loading frequent words dict

freq_words = pd.read_excel('freq_words_rus.xlsx', squeeze=True)

#let's make a corpus

cor = pd.concat((out,freq_words), ignore_index=True).values

In [4]:
def get_bootstrap_samples(data, n_samples, l_sample):
    '''
    Function returns bootstrapped samples.
    data - must be a numpy array
    n_samples - number of samples, an integer
    l_sample - max length of a sample
    
    '''
    indices = np.random.randint(0, len(data), (n_samples, l_sample))
    samples = data[indices]
    return samples

Building a corpus out of a combined array consisting of 5000 most frequent russian words and roughly the same amount adult language. Then taking an array of random samples lasting from 1 word till 19 inclusively.
Key problem is there's not a single open labeled dataset consisting out of adult and common language in russian. To circumvent it I've decided to make a synthetic one taking first 50 000 one-word samples from our combined array, then two-word samples and so on up to 19. Thus I get not so small dataset. 

In [5]:
%%time
sample_size = 50000
for epoch in range(1, 20):
    if epoch < 19:
        if epoch == 1:
            corp = get_bootstrap_samples(cor, sample_size, epoch).reshape(sample_size, 1)
            length = 19 - epoch
            mid = np.hstack((np.zeros((sample_size, length)), corp))
            corpt = mid                                         # initializing final corpus 
    
        corp = get_bootstrap_samples(cor, sample_size, epoch)
        length = 19 - epoch
        mid = np.hstack((np.zeros((sample_size, length)), corp))
        corpt = np.vstack((mid, corpt))                         #stacking samples

    else:
        corp = get_bootstrap_samples(cor, sample_size, epoch)
        corpt = np.vstack((corp, corpt))
    

Wall time: 3.1 s


### Building the model.
Making a training text array and target labels.

In [6]:
data = pd.DataFrame(corpt)
data_truth = data.isin(outfile)  # check each str whether it has profane word in it
data_truth['sum'] = data_truth.sum(axis=1)
data_truth['label'] = data_truth['sum'].apply(lambda x: 0 if x < 1 else 1)
y = data_truth['label']

In [7]:
texts = data.to_csv(header=None, index=False).strip('\n').split('\n')

In [8]:
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer, TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.decomposition import NMF, TruncatedSVD

import joblib
import sklearn

print('The joblib version is {}.'.format(joblib.__version__))
print('The scikit-learn version is {}.'.format(sklearn.__version__))

The joblib version is 0.16.0.
The scikit-learn version is 0.23.1.


In [9]:
#preprocessing our final data

estimators = [('tfidf', TfidfTransformer()), ('svd', TruncatedSVD(1))]
combined = FeatureUnion(estimators)

Checking accuracy of our model on the training data

In [10]:
%%time
print(cross_val_score(
    Pipeline([
            ("vectorizer", CountVectorizer()),
            ("transformer", combined),
            ("classifier", LinearSVC())
        ]),
    texts,
    y
    ))

[0.995345 0.999975 0.99997  0.99979  0.96129 ]
Wall time: 2min 45s


In [11]:
# building final model

clf_pipeline2 = Pipeline([
            ("vectorizer", CountVectorizer()),
            ("transformer", combined),
            ("classifier", LinearSVC())
        ])


clf_pipeline2.fit(texts, y)

Pipeline(steps=[('vectorizer', CountVectorizer()),
                ('transformer',
                 FeatureUnion(transformer_list=[('tfidf', TfidfTransformer()),
                                                ('svd',
                                                 TruncatedSVD(n_components=1))])),
                ('classifier', LinearSVC())])

Trying some simple tests and checking perfomance.

In [13]:
%%time
print(clf_pipeline2.predict(['пошел нахуй', 'Все хорошо!', 'Потому что у мужчин подогрев хуевый, а у женщин пиздатый', '...Я не такой дурак, как ты выглядишь...',
            'моя фантазия заканчивается, ну допустим мудак']))

[1 0 1 1 1]
Wall time: 8 ms


In [14]:
# Save the model for deployment

joblib.dump(clf_pipeline2, 'model.joblib')

['model.joblib']