# SCC.413 Applied Data Mining
# Week 19
# Classification with scikit-learn 

## Contents
* [Introduction](#intro)
* [Preamble](#preamble)
* [Dataset](#dataset)
* [Train / Test split](#split)
* [CountVectorizer](#vectorizer)
* [Fitting a classifier](#fitting)
* [Making predictions](#predictions)
* [Evaluation](#eval)
* [Pipelines](#pipelines)
* [Cross validation](#cv)
* [Scaling and Normalization](#scaling)
* [Feature selection](#selection)
* [Scaling and Normalization](#scaling)
* [Optimisation with GridSearch](#gridsearch)
* [Customising preprocessing and feature extraction](#custom)
* [Exercise](#ex)

<a name="intro"></a>
## Introduction

In this lab you will be using [scikit-learn (sklearn)](https://scikit-learn.org/stable/index.html) to classify texts. sci-kit learn provides numerous classification algorithms for use, and other tools for preprocessing, feature selection, and evaluation. 

We will be using authorship analysis as an example of classification, but the same techniques can be used for various NLP tasks involving classification, e.g. classifying topics, sentiment classification, etc. The features and documents used should be dictated by the task.

<a name="preamble"></a>
## Preamble

You should upload all of the provided files to a Google Drive folder, you can then access these files from your Python code. See also the files tab.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive/')

We save the folder we are working from as a variable for easy access. You may need to edit the path to match your own.

In [None]:
working_folder = '/content/gdrive/MyDrive/413/wk19/'

The below code adds the working folder to the system path, so you can import Python files from this folder.

In [None]:
import sys
sys.path.append(working_folder)

<a name="dataset"></a>
## Dataset (Twitter GB celebs)

We will use a small dataset of 70 British Twitter "celebrities" for demonstration. A larger dataset of US Twitter celebrities is also available, but the increased size will mean that extracting features and fitting models will take longer.

A simple function for extracting filenames from a provided folder is given below. We pass filenames to sklearn to load the file and extract text, to avoid loading the whole corpus into memory. We can also pass in text directly (the default).

In [None]:
from os import listdir
from os.path import isfile, join, splitext, split

def list_files(folder):
    txtfiles = [join(folder, f) for f in listdir(folder) if isfile(join(folder, f)) and f.endswith(".txt")]
    return txtfiles

We will start by performing binary classification of gender (gender can be non-binary, but we focus on two genders here to simplify matters). The celeb data has been split into female and male folders, which are read in below and saved to the variable X (the standard "features" dataset variable). We store labels (classes) separately in the variable y.

In [None]:
f_files = list_files(working_folder + "celebs-gb-txt/female")
m_files = list_files(working_folder + "celebs-gb-txt/male")
X = f_files + m_files #X is usually used to denote the dataset to be trained and tested on, i.e. the features (or where features are extracted from)
y = ["female"] * len(f_files) + ["male"] * len(m_files) #y is usually used to store the labels/classes. Here we simply repeat female for how many female users we have, and then the same for male. Obviously X and y must be in same order.

print(len(X), len(y))

X and y should be the same length, with elements in the lists corresponding to eachother.

In [None]:
print(y.count("female"), y.count("male"))

We can see the dataset is biased to males. The US dataset has the same bias. This will generally mean that males are predicted more accurately (more training data), and we should consider that a classifier which predicts everybody as male would achieve a raw accuracy of 64.2%. One could undersample the data (e.g. choose a random subset of 25 males), but this would reduce the amount of training data available, or even oversample. This won't be covered here, but feel free to experiment.

<a name="split"></a>
## Train / Test split

We need to train our model on some data, and test/evaluate on a separate set (later we will see evaluation through cross-validation). sklearn provides functionality for this with [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

We provide the full dataset/corpus (X), the labels (y), and how large the test portion should be. Here we split the data 4 parts training, 1 part test. We provide a random_state so the split is consistent on each run. We stratify the splits (with the labels list (y)) to maintain the same proportions/bias. Returned are X and y split into train and test lists (of the same type as the input).

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state = 0, stratify=y)
print(len(X_train), len(X_test))
print(len(y_train), len(y_test))

X_train and y_train are the same length (and corresponding order is maintained), likewise for X_test and y_test.

In [None]:
print(y_train.count("female"), y_train.count("male"))
print(y_test.count("female"), y_test.count("male"))

The same proportions are maintained with stratify. Try setting stratify to None, and see the impact.

<a name="vectorizer"></a>
## CountVectorizer

Now we have some training and test data, we can start training a model. We have files of texts, from which we need to extract features (the subject of previous labs) to form vectors that can be used in a classifier. sklearn provides some functionality for extracting text features (vectorization), in the form of [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(input='filename', analyzer='word')

We create an instance of the `CountVectorizer` with default settings, except for specifying the input type as `filename` meaning the files are read in and the text extracted. The default is to extract "words" from the text, meaning the text is tokenised and the words found counted (raw frequencies are returned).

You can try the `char` `analyzer`, which counts characters (codepoints actually, see previous lab), or `char_wb`, which counts characters within words. You can also use ngrams for both words and chars with `ngram_range`. You can also remove stopwords, and several other options.

To actually extract features (vectorize), we call `fit_transform` which both creates the vocabulary from the training data (fit), and creates a vector for each training instance (document), which in this case will be the counts for each word in the vocabulary. Because no restriction has been set on the vocabulary, every word type found in the training set will be included.

In [None]:
X_train_vectorized = vectorizer.fit_transform(X_train)

If we examine this vector, we can see it is large, with 56 rows (for the 56 training instances/documents), and over 180,000 features (the vocabulary).

In [None]:
X_train_vectorized.shape

You can see the features (words) in the vocabulary with `get_feature_names` from the vectorizer (every 5000th feature is shown below). You can see some of the noise in the words.

In [None]:
vectorizer.get_feature_names_out()[::5000]

<a name="fitting"></a>
## Fitting a classifier

Now we have features, we can train a classifier. We use standard [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) with default settings (except a `random_state` to get consistent results each run). Any classifier could be used here, and you are encouraged to try different classifiers, [many are available](https://scikit-learn.org/stable/supervised_learning.html).

Once we have created the classifier, we run `fit`, providing the training vectors and training labels.

In [None]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(solver='liblinear', random_state=0)
clf.fit(X_train_vectorized, y_train)

<a name="predictions"></a>
## Making predictions

We now have a trained model which we can use to make predictions, i.e. on our test set.

First though, we need to convert our test data into the same form as the train set (same features). To do this, we use the vectorizer and call `transform`. What would happen if we instead used `fit_transform`?

In [None]:
X_test_vectorized = vectorizer.transform(X_test)
X_test_vectorized.shape

We get a list of predictions by passing the vectorized test set to the classifier's predict method.

In [None]:
predictions = clf.predict(X_test_vectorized)
print(predictions)

<a name="eval"></a>
## Evaluation

To evaluate how well the classifier is predicting, we compare to the actual labels (y_test). There are [various metrics available for evaluation](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics), a few are shown below.

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

print("Accuracy: ", accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))

The confusion matrix shows actual against predicted, with actuals as rows, and predicted as columns. Order is the order of the classes in the classifier:

In [None]:
clf.classes_

So:
- top left is the number of actual females predicted as female
- top right is the number of actual females predicted as male
- bottom left is the number of actual males predicted as female
- bottom right is the number of actual males predicted as male.

For binary classification, these are actually true positives (tp), false positives (fp), true negatives (tn), and false negatives (fn):

`[[tn fp]
  [fn tp]]`
  
The classes (only due to ordering) are 0 (negative): 'female' and 1 (positive): 'male'. Hence a male predicted as male is considered a 'true positive', and a female predicted as female is considered as 'true negatives'. But these labels are meaningless when we are not classifying successes or hits (e.g. for information retrieval).

A nice way to view a confusion matrix is with a heatmap:

In [None]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

def confusion_matrix_heatmap(cm, index):
    cmdf = pd.DataFrame(cm, index = index, columns=index)
    dims = (5, 5)
    fig, ax = plt.subplots(figsize=dims)
    sns.heatmap(cmdf, annot=True, cmap="coolwarm", center=0)
    ax.set_ylabel('Actual')    
    ax.set_xlabel('Predicted')

In [None]:
confusion_matrix_heatmap(confusion_matrix(y_test,predictions), clf.classes_)

We can view which instances were predicted correctly and incorrectly as follows ([`zip`](https://docs.python.org/3.7/library/functions.html#zip) aggregates lists):

In [None]:
shortened = [f.replace(working_folder, "") for f in X_test]
pd.DataFrame(list(zip(shortened,y_test,predictions,y_test==predictions)), columns=["Instance", "Actual", "Predicted","Correct"])

<a name="pipelines"></a>
## Pipelines

Steps for vectorization, preprocessing, normalising, feature selection, classification, and more can be combined with an sklearn [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). They conveniently combine steps to transform and fit training data. The same pipeline can then be used to transform and predict test data. They also make cross-validation much simpler, by running train/test on each fold by processing the pipeline in the same way.

In [None]:
from sklearn.pipeline import Pipeline

model = Pipeline([
    ('vectorizer', CountVectorizer(input='filename', analyzer='word')),
    ('clf', LogisticRegression(solver='liblinear', random_state=0)),
])

Above, we create a pipeline exactly as we have used above. To perform all transforms (before classifier), and do final training (fit), we simply use `fit`, providing the training data as below. `transform_fit` is called on all steps before final step has `fit` called.

In [None]:
model.fit(X_train, y_train)

We now have a trained model. It is important to note, the Pipeline provides all functions from the last step, which in this case is a classifier. We can get predictions by simply calling `predict`. This performs all transforms on the (already fitted) steps before the classifier, and then finally predict on the last step.

In [None]:
predictions = model.predict(X_test)

In [None]:
print("Accuracy: ", accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))

The results are exactly the same as above, as we've used the same steps. We can edit the pipeline parameters, or even replace whole steps. To reference a step, we use `step__param`. For example, to add a `max_features` to the vectorizer (to limit the number of features to the top n):

In [None]:
model.set_params(vectorizer__max_features=1000)

In [None]:
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print("Accuracy: ", accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))

To change the classifier we use the same process, just referencing the named step for the classifier (`clf` here). Below we change to a [Multinomial Naive Bayes classifier](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html).

In [None]:
from sklearn.naive_bayes import MultinomialNB

model.set_params(clf=MultinomialNB())

In [None]:
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print("Accuracy: ", accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))

<a name="cv"></a>
## Cross validation

So far we have been using a single train/test split. What if our test set happens to be particularly easy or hard to predict? What if our training set contains noisier data than average. As discussed in the lecture, cross validation helps here by performing multiple evaluations, splitting the data into different train/test splits.

With pipleines and sklearn, performing cross validation is straightforward, and we can be assured that the entire pipeline is fit and transformed for each fold individually. [Various methods are available for cross validation](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection), we use [`cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate) (to return multiple metrics) with [`StratifiedKFold`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold), with 5 folds.

In [None]:
from sklearn.model_selection import cross_validate, StratifiedKFold

cv_scores = cross_validate(model, X, y, 
                           cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=0), 
                           return_train_score=False, 
                           scoring=['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted']
                          )

In [None]:
print(cv_scores)

In [None]:
def print_cv_scores_summary(name, scores):
    print("{}: mean = {:.2f}%, sd = {:.2f}%, min = {:.2f}, max = {:.2f}".format(name, scores.mean()*100, scores.std()*100, scores.min()*100, scores.max()*100))

In [None]:
print_cv_scores_summary("Accuracy", cv_scores['test_accuracy'])
print_cv_scores_summary("Precision", cv_scores['test_precision_weighted'])
print_cv_scores_summary("Recall", cv_scores['test_recall_weighted'])
print_cv_scores_summary("F1", cv_scores['test_f1_weighted'])

Results are returned for all folds, and by looking at the mean, standard deviation, and range of scores, one can observe the classifier performance more thoroughly. It would appear there is some variance between folds, and the original train/test split was quite lucky in terms of achieving the high accuracy observed under cross validation.

<a name="scaling"></a>
## Scaling and Normalization
So far we have been using raw frequencies in our model. Often we want to standardize, scale or normalize features. [Various options are available](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing). A few options are demonstrated below.

For 'standardization', i.e. to zero mean and unit variance, you can use [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler).

For 'normalization', i.e. setting each instance's feature vector's norm (l1 or l2) to 1, you can use [`Normalizer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html#sklearn.preprocessing.Normalizer). Using l1 norm, it is possible to get something equivalent to relative frequencies of words (i.e. dividing by total frequency), but this is only the case if every word is included as a feature (which isn't normally the case).

A more common normalization for text data is the already discussed TF-IDF. sklearn's [`TfidfTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer) can be utilised to calculate TF-IDF scores, though the implementation is different to that normally used, and includes l2 normalization after TF-IDF is calculated by default.

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

In [None]:
model = Pipeline([
    ('vectorizer', CountVectorizer(input='filename', analyzer='word', max_features=1000)),
    ('norm', TfidfTransformer(norm=None)),
    ('clf', LogisticRegression(solver='liblinear', random_state=0)),
])

In [None]:
cv_scores = cross_validate(model, X, y, cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=0), return_train_score=False, scoring=['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted'])
print_cv_scores_summary("Accuracy", cv_scores['test_accuracy'])
print_cv_scores_summary("Precision", cv_scores['test_precision_weighted'])
print_cv_scores_summary("Recall", cv_scores['test_recall_weighted'])
print_cv_scores_summary("F1", cv_scores['test_f1_weighted'])

Finally, another common normalization technique is to binarize features (i.e. scale to 0 for not present or 1 for present at any frequency), this is useful when documents are short (e.g. individual Tweets). This is performed with [`Binarizer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html#sklearn.preprocessing.Binarizer). Here we perform binarization before TF-IDF.

In [None]:
from sklearn.preprocessing import Binarizer

model = Pipeline([
    ('vectorizer', CountVectorizer(input='filename', analyzer='word', max_features=1000)),
    ('norm', Binarizer()),
    ('norm2', TfidfTransformer(norm=None)),
    ('clf', LogisticRegression(solver='liblinear', random_state=0)),
])

In [None]:
cv_scores = cross_validate(model, X, y, cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=0), return_train_score=False, scoring=['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted'])
print_cv_scores_summary("Accuracy", cv_scores['test_accuracy'])
print_cv_scores_summary("Precision", cv_scores['test_precision_weighted'])
print_cv_scores_summary("Recall", cv_scores['test_recall_weighted'])
print_cv_scores_summary("F1", cv_scores['test_f1_weighted'])

<a name="selection"></a>
## Feature selection
When the number of features is high, one option available is some form of dimensionality reduction or feature selection, e.g. to remove features with low variance across instances/documents, to select features with significant differences between classes, or features that have high weights/coefficients in other models. [Various methods are available for this](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection). An example below selects 1000 features with the highest values for the [chi-squared test](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html#sklearn.feature_selection.chi2).

In [None]:
from sklearn.feature_selection import SelectKBest, chi2

model = Pipeline([
    ('vectorizer', CountVectorizer(input='filename', analyzer='word')),
    ('selector', SelectKBest(chi2, k=1000)),
    ('clf', LogisticRegression(solver='liblinear', random_state=0)),
])

In [None]:
cv_scores = cross_validate(model, X, y, cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=0), return_train_score=False, scoring=['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted'])
print_cv_scores_summary("Accuracy", cv_scores['test_accuracy'])
print_cv_scores_summary("Precision", cv_scores['test_precision_weighted'])
print_cv_scores_summary("Recall", cv_scores['test_recall_weighted'])
print_cv_scores_summary("F1", cv_scores['test_f1_weighted'])

If you would like to view the selected features, the code below will help.

In [None]:
vectorizer = CountVectorizer(input='filename', analyzer='word')
selector = SelectKBest(chi2, k=100)
feats = vectorizer.fit_transform(X_train)
filtered = selector.fit_transform(feats, y_train)

In [None]:
from itertools import compress
cols = selector.get_support()
names = vectorizer.get_feature_names_out()

print(list(compress(names,cols)))

<a name="gridsearch"></a>
## Optimisation with GridSearch

We have played around with different options available in sklearn, but for actual experiments you will want to be more structured and scientific. [Various optimizers are available](https://scikit-learn.org/stable/modules/classes.html#hyper-parameter-optimizers). [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) allows for an exhaustive search across a range of listed parameter values, and then runs cross validation for every model built (for every combination of parameters). With a large number of parameters, the search will take some time to run due to the number of models that need to be fit and tested.

An example is given below with top_k from chi squared feature selection set to different values, and two different classifiers (Naive Bayes and Logistic Regression). This could easily be extended to introduce new paramaters and steps in the pipeline.

Note, we are performing our cross-validation over just the training data (X_train). X_test is reserved for testing the best model. By supplying a scoring parameter to `refit`, the best performing parameters are selected from the grid search, and a fitted model across the entire provided instances (here `X_train`) is produced and returned.

In [None]:
from sklearn.model_selection import GridSearchCV

model = Pipeline([
    ('vectorizer', CountVectorizer(input='filename', analyzer='word')),
    ('selector', SelectKBest(score_func = chi2)),
    ('clf', None), #clf set in param_grid.
])

search = GridSearchCV(model, cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=0), 
                      return_train_score=False, 
                      scoring=['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted'],
                      refit = 'f1_weighted',
                      param_grid={
                          'selector__k': [10, 50, 100, 500],
                          'clf': [MultinomialNB(), LogisticRegression(solver='liblinear', random_state=0)],
                      },
                      n_jobs=-1) #run folds in parallel on multiple processors if available.
search.fit(X_train, y_train)

We can see the results of all classifiers. Note which is ranked highest.

In [None]:
pd.DataFrame(search.cv_results_)

`search` contains the highest ranked model, which is already fit, and can be used to make prdictions and evaluate on the test set.

In [None]:
predictions = search.predict(X_test)

print("Accuracy: ", accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))

confusion_matrix_heatmap(confusion_matrix(y_test,predictions), search.classes_)

<a name="custom"></a>
## Customising preprocessing and feature extraction
So far we have been using sklearn's own tokenisation and basic bag-of-words. What about all that you've learnt for preprocessing, tokenisation, and feature extraction in previous labs? Surely we can do better than BoW with broken tokenisation!

Included below are some basic methods for preprocessing and tokenisation from last week's lab.

In [None]:
!pip install ftfy

In [None]:
import ftfy
import nltk
import re

hashtag_re = re.compile(r"#\w+")
mention_re = re.compile(r"@\w+")
url_re = re.compile(r"(?:https?://)?(?:[-\w]+\.)+[a-zA-Z]{2,9}[-\w/#~:;.?+=&%@~]*")


def preprocess(text):
    p_text = hashtag_re.sub("[hashtag]",text)
    p_text = mention_re.sub("[mention]",p_text)
    p_text = url_re.sub("[url]",p_text)
    p_text = ftfy.fix_text(p_text)
    return p_text

tokenise_re = re.compile(r"(\[[^\]]+\]|[-'\w]+|[^\s\w\[']+)") #([]|words|other non-space)
def custom_tokenise(text):
    return tokenise_re.findall(text.lower())

def nltk_twitter_tokenise(text):
    twtok = nltk.tokenize.TweetTokenizer()
    return twtok.tokenize(text.lower())

We can utilise these methods with `CountVectorizer` by simply passing the callable functions for `preprocessor` and `tokenizer`:

In [None]:
vectorizer = CountVectorizer(input='filename', analyzer='word', tokenizer=custom_tokenise, preprocessor=preprocess)
feats = vectorizer.fit_transform(X_train)

In [None]:
feats.shape

Notice that the vocabulary has reduced by more than half. Why is this?

In [None]:
vectorizer.get_feature_names_out()[::2500]

A quick look at the features also indicates that they are much more sensible as "words".

In [None]:
model = Pipeline([
    ('vectorizer', CountVectorizer(input='filename', analyzer='word',tokenizer=custom_tokenise, preprocessor=preprocess)),
    ('selector', SelectKBest(chi2, k=1000)),
    ('clf', LogisticRegression(solver='liblinear', random_state=0)),
])

In [None]:
cv_scores = cross_validate(model, X, y, cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=0), return_train_score=False, scoring=['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted'])
print_cv_scores_summary("Accuracy", cv_scores['test_accuracy'])
print_cv_scores_summary("Precision", cv_scores['test_precision_weighted'])
print_cv_scores_summary("Recall", cv_scores['test_recall_weighted'])
print_cv_scores_summary("F1", cv_scores['test_f1_weighted'])

<a name="ex"></a>
## Exercise

This week's exercise is to experiment and build the best gender classifier you can, utilising what has been shown above. Using the gb data is fine, but you can try the usa data if you wish. Use `GridSearchCV` to evaluate different models, and fit the best performing model & parameters. Report the final results on the the test set. Example parameters you could change:

- Classifiers
- Classifier hyper-parameters
- Feature selections
- Normalization / standardization / binarization
- Feature size cut-offs
- Tokenisers
- Preprocessing steps
- Word / Characters
- Ngram range