<a href="https://colab.research.google.com/github/lauradang/FYOUZE/blob/master/calendar_model/agenda_classifier_svm_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training Agenda Type/Template SVM
For the agenda type classifiers, we need a way to identify items that do not belong in any category. Creating a garbage class for the classifier makes it hard for the neural network to generalize, and thus, lowers the overall validation accuracy of the model. 

Creating a separate one class SVM that detects outliers from the dataset ensures that we can identify items that do not fit into any category while keeping the accuracies of the classifier high.


**IMPORTANT DEPENDENCY NOTE**:

If the `scikit-learn` version in the API is different from the version used in this notebook, then it will most likely not work in the API. You can train and save the model again with the desired `scikit-learn` version using the same hyperparameters shown in this notebook.


In [5]:
!pip install scikit-learn==0.20.3

Collecting scikit-learn==0.20.3
[?25l  Downloading https://files.pythonhosted.org/packages/5e/82/c0de5839d613b82bddd088599ac0bbfbbbcbd8ca470680658352d2c435bd/scikit_learn-0.20.3-cp36-cp36m-manylinux1_x86_64.whl (5.4MB)
[K     |████████████████████████████████| 5.4MB 2.8MB/s 
Installing collected packages: scikit-learn
  Found existing installation: scikit-learn 0.22.2.post1
    Uninstalling scikit-learn-0.22.2.post1:
      Successfully uninstalled scikit-learn-0.22.2.post1
Successfully installed scikit-learn-0.20.3


In [5]:
!pip list

Package                  Version        
------------------------ ---------------
absl-py                  0.9.0          
alabaster                0.7.12         
albumentations           0.1.12         
altair                   4.1.0          
asgiref                  3.2.7          
astor                    0.8.1          
astropy                  4.0.1.post1    
astunparse               1.6.3          
atari-py                 0.2.6          
atomicwrites             1.3.0          
attrs                    19.3.0         
audioread                2.1.8          
autograd                 1.3            
Babel                    2.8.0          
backcall                 0.1.0          
beautifulsoup4           4.6.3          
bleach                   3.1.4          
blis                     0.4.1          
bokeh                    1.4.0          
boto                     2.49.0         
boto3                    1.12.39        
botocore                 1.15.39        
Bottleneck      

In [1]:
from google.colab import drive
drive.mount("/content/gdrive", force_remount=True)

Mounted at /content/gdrive


In [0]:
import re
import pandas as pd
import pickle
import numpy as np
import itertools
from tqdm.notebook import tqdm

from sklearn import svm, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.utils import column_or_1d
from sklearn.svm import OneClassSVM

## Glance at Dataset

In [0]:
path = "/content/gdrive/My Drive/Colab Notebooks/agenda_classifiers/agenda_data_generation/balanced_agenda_type_dataset.csv"
df = pd.read_csv(path, index_col=0)

In [6]:
df.head()

Unnamed: 0,channel_name,channel_type,channel_description,channel_size
0,mark.hughes@intercom.io,one-on-one,,2.0
1,Djiboutian franc,none,,10.0
2,John Stacy Lagasca,one-on-one,,2.0
3,Jason & Raymond,one-on-one,,2.0
4,goodwin,none,,2.0


In [7]:
df.channel_type.value_counts()

none          31578
group         15789
one-on-one    15789
Name: channel_type, dtype: int64

## Constants and Helper Functions

In [0]:
SVM_MODEL_DIR = "/content/gdrive/My Drive/models/svm_models"

In [0]:
def preprocess_text(sentence):
    if not isinstance(sentence, str):
      return ""

    sentence = sentence.lower().replace("1", "one").replace("&", "and")
    
    # Remove punctuation 
    sentence = re.sub("[^a-zA-Z\@\:]", " ", sentence)

    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", " ", sentence)

    # Removing multiple spaces
    sentence = re.sub(r"\s+", " ", sentence)

    return sentence

In [0]:
def build_dataset(X_train, root_path, pkl_file):
  tfidf_vect = TfidfVectorizer(max_features=5000)
  tfidf_vect.fit(X_train)

  # Save the fitted vectorizer
  pickle.dump(tfidf_vect, open(f"{root_path}/{pkl_file}", "wb"))

  X_tfidf_train = tfidf_vect.transform(X_train)

  return X_tfidf_train

In [0]:
def train_model(X_train, kernel, nu, gamma):
  model = OneClassSVM(kernel=kernel, nu=nu, gamma=gamma)
  model.fit(X_train)

  return model

In [0]:
def optimize_one_class_svm(X_train, X_test, y_test, vectorizer):
  kernels = ["linear", "poly", "rbf", "sigmoid"]
  gammas = [0.01, 0.001, 0.0001, "auto", "scale"]
  nus = [0.15, 0.25, 0.5, 0.75, 0.95]

  permutations = list(itertools.product(kernels, gammas, nus))
  curr_acc = 0
  model_info = {}

  for permutation in tqdm(permutations):
    kernel, gamma, nu = permutation

    svm = train_model(X_train, kernel, nu, gamma)
    preds = get_predictions(X_test, svm, vectorizer)
    acc = calculate_accuracy(preds, y_test)

    if acc > curr_acc:
      curr_acc = acc

      model_info["model"] = svm
      model_info["acc"] = curr_acc

      with open(f"{SVM_MODEL_DIR}/optimal_svm.pkl", "wb") as svm_pkl:
        pickle.dump(model_info, svm_pkl)

  return svm, curr_acc

In [0]:
def get_predictions(X_test, model, vectorizer):
  items = vectorizer.transform(X_test)
  predictions = model.predict(items)

  return predictions

In [0]:
def calculate_accuracy(predictions, y_test):
  return sum(
      [pred == actual for pred, actual in zip(predictions, y_test)]
  ) / len(predictions)

## Preparing Dataset
Our `OneClassSVM` is an unsupervised algorithm, so there is no need to have labels for the training set. We also only want to feed the model "positive" data (data that is not considered an outlier).

For the purposes of determining outliers, `channel_description` and `channel_size` are not very helpful features, so they are dropped.

In [0]:
df = df.drop(columns=["channel_description", "channel_size"])

In [0]:
df["channel_type"] = df["channel_type"].apply(
    lambda x: "relevant" if x == "group" or x == "one-on-one" else x
)

In [17]:
df.channel_type.value_counts()

none        31578
relevant    31578
Name: channel_type, dtype: int64

In [0]:
df["channel_name"] = df["channel_name"].apply(preprocess_text)

In [0]:
df = df.drop_duplicates()

In [0]:
df = df.sample(frac=1)
X = df.drop(columns=["channel_type"])
y = df.drop(columns=["channel_name"])

In [0]:
# y_train will not be used since OneClassSVM is unsupervised
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.1
)

Even though our algorithm is unsupervised, the **test** set should still have labels since it is used for evaluation purposes, not training. The test set should also be a balanced dataset to ensure that our accuracy metrics are unbiased.

In [22]:
y_test.channel_type.value_counts()

relevant    2947
none        1248
Name: channel_type, dtype: int64

In [0]:
diff = y_test.channel_type.value_counts()["relevant"] - y_test.channel_type.value_counts()["none"]

In [0]:
idxs_to_drop = y_test[y_test.channel_type == "relevant"][:diff].index
X_test = X_test.drop(idxs_to_drop)
y_test = y_test.drop(idxs_to_drop)

In [25]:
y_test.channel_type.value_counts()

none        1248
relevant    1248
Name: channel_type, dtype: int64

In [0]:
X_train = X_train.drop(y_train[y_train.channel_type == "none"].index)

In [0]:
y_test["channel_type"] = y_test["channel_type"].apply(
    lambda x: -1 if x == "none" else 1
)
y_test = np.array(y_test["channel_type"])

We vectorize the `channel_name` text in order for the model to process the input. Then, we load and save the vectorizer since it is also used for training and making predictions.

In [0]:
svm_dataset = build_dataset(
    X_train["channel_name"], 
    SVM_MODEL_DIR, 
    "svm_vectorizer.pickle"
)

In [0]:
with open(f"{SVM_MODEL_DIR}/svm_vectorizer.pickle", "rb") as vec_file:
  svm_vectorizer = pickle.load(vec_file)

## Building the Model
To ensure the highest possible accuracy, we run an automated hyperparameter tuner to determine which parameters produce the highest test set accuracy. 

The hyperparameters for the `OneClassSVM` that are tuned are:
- kernels
- nu
- gamma

Details on the model and its parameters can be found in the [sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html#sklearn.svm.OneClassSVM). 

In [36]:
opt_model, opt_acc = optimize_one_class_svm(
    svm_dataset, 
    X_test["channel_name"], 
    y_test, 
    svm_vectorizer
)

HBox(children=(IntProgress(value=0), HTML(value='')))




In [0]:
import pickle 

with open(f"{SVM_MODEL_DIR}/optimal_svm.pkl", "rb") as svm_file:
  loaded_model = pickle.load(svm_file)

In [38]:
loaded_model

{'acc': 0.8241185897435898,
 'model': OneClassSVM(cache_size=200, coef0=0.0, degree=3, gamma=0.001,
       kernel='sigmoid', max_iter=-1, nu=0.25, random_state=None,
       shrinking=True, tol=0.001, verbose=False)}

## Predictions
Here are some input examples that are predicted from the model with the highest accuracy that was found by the hyperparameter tuner.

-1 represents an outlier while 1 represents a text that can be classified as either a group or one-on-one event name.



In [39]:
item1 = preprocess_text("meeting with the team")
item2 = preprocess_text("one on one with harry every week")
item3 = preprocess_text("is this an outlier?")
item4 = preprocess_text("project kickoff")
item5 = preprocess_text("Sales / marketing")
item6 = preprocess_text("amy & john catch-up")
item7 = preprocess_text("i am a human")

test_item = pd.Series([item1, item2, item3, item4, item5, item6, item7])
preds = get_predictions(test_item, loaded_model["model"], svm_vectorizer)

for i in range(len(preds)):
  print(f"Text Input: {test_item[i]}")
  print(f"Prediction: {preds[i]}")
  print("-----------------------------")

Text Input: meeting with the team
Prediction: 1
-----------------------------
Text Input: one on one with harry every week
Prediction: 1
-----------------------------
Text Input: is this an outlier 
Prediction: -1
-----------------------------
Text Input: project kickoff
Prediction: 1
-----------------------------
Text Input: sales marketing
Prediction: 1
-----------------------------
Text Input: amy and john catch up
Prediction: 1
-----------------------------
Text Input: i am human
Prediction: -1
-----------------------------
