# Text classification based on if text conains a Dodo promo code or not

Can be useful to check text before parsing it for potential promo codes.

## Reading train data from file
I have collected a lot of train data from a lot of different sites. It's about 40 kBytes in size.

In [14]:
import re
import yaml
from textblob import TextBlob
from textblob.classifiers import basic_extractor, NaiveBayesClassifier

with open("promocode_classifier/train.yaml") as file:
    train = yaml.safe_load(file.read())["train_data"]


## Adding a custom feature extractor and training a Naive Bayes classifier
I'm extracting any potential promo codes from the text and getting their length and also saying
if there is anything that looks like a promo code in the text.

In [15]:
def promocode_extractor(document: TextBlob, train_set):
    r = re.findall(r"[(-:,.\s][A-Z0-9]{3,7}[-:,.\s)]", str(document))
    t = {}
    if r is not None:
        t = {"contains_code": True}
        for i in r:
            t = {**t, "contains_code(len:{})".format(len(i.strip())): True}
    return t


# This is for making it easier to extend this code
# by making it run functions specified in this variable and then
# combining their results.
extractors = [promocode_extractor]


def dodo_extractor(document: TextBlob, train_set):
    bas = basic_extractor(document, train_set)
    result = bas
    for extractor in extractors:
        result = {**result, **extractor(document, train_set)}
    return result


tr = NaiveBayesClassifier(train, feature_extractor=dodo_extractor)

## Getting some more info about our model


In [16]:
tr.show_informative_features()

Most Informative Features
    contains_code(len:5) = True                + : -      =      8.0 : 1.0
           contains(руб) = True                - : +      =      4.4 : 1.0
            contains(30) = True                + : -      =      4.4 : 1.0
      contains(Промокод) = True                + : -      =      4.3 : 1.0
    contains_code(len:4) = True                + : -      =      4.1 : 1.0
     contains(Используй) = True                + : -      =      3.4 : 1.0
          contains(всех) = True                + : -      =      3.4 : 1.0
       contains(Додстер) = True                - : +      =      3.2 : 1.0
        contains(рублей) = True                + : -      =      2.8 : 1.0
       contains(Условия) = True                + : -      =      2.7 : 1.0


It seems that a lot of texts with promo codes have:

  - a promo code like word that is 5 chars in length
  - number 30 (I guess it's because most of the promo codes work for only 30 cm pizzas)
  - the word "Промокод"
  - the word "Используй"
  - the word "всех"
  - the word "рублей" (not sure why its positive, when the word руб is negative)

**TODO:** add more stuff