<h1>Support Ticket Classification</h1>
Please download the data from the below source:

<a href="https://www.kaggle.com/code/aniketg11/support-tickets-classification/input">Dataset</a>


In [None]:
import pandas as pd # Library for data manipulation and analysis
import numpy as np # A fundamental package for scientific computing
import re # Support for regular expressions
import string # Support operations on strings
import matplotlib.pyplot as plt # A plotting library that provides a MATLAB-like plotting framework
import seaborn as sns # A Python data visualization library based on matplotlib
from sklearn.model_selection import train_test_split # A function from the scikit-learn library used to split data arrays into two subsets: for training data and for testing data
from sklearn.feature_extraction.text import CountVectorizer # Part of scikit-learn's text processing module. Converts a collection of text documents to a matrix of token counts
from sklearn.ensemble import RandomForestClassifier # A class that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay, roc_auc_score, roc_curve # This module includes score functions, performance metrics, and pairwise metrics and distance computations
from nltk.corpus import stopwords # NLTK function to access lists of stop words for several languages.
from nltk.tokenize import word_tokenize # NLTK function to split strings into tokens.
from nltk.util import ngrams # NLTK function to generate n-grams from sequences of items.
import nltk # Natural Language Toolkit (NLTK) is a leading platform for building programs to work with human language data.

# Download NLTK data files
nltk.download('punkt')
nltk.download('stopwords')

EXERCISE 1: Fill in the code to load and display the dataset.

In [None]:
# LOAD DATASET
# TODO: Read the downloaded dataset CSV file to pandas data frame using pandas read_csv function
df = 

In [None]:
# DATA EXPLORATION
# Let's print out the full table. There is a 'body' column with the content of the message and 'ticket_type' column
# with values 0 or 1, which is 0 = incident, 1= no_incident. Other columns are not used in this experiment.
# TODO: Display the dataset
df.head()

### PREPROCESSING - DATA CLEAN UP
In the preprocessing phase we will prepare the data to be properly consumed by the model. In this experiment we will apply the following (most common) techniques:
1. Tokenization and lowercasing - splitting text into words and change all to lowercase
2. Removing stopwords - Remove common words that do not contribute to the meaning
3. Removing punctuation and special characters - clean the text by removing punctuation and special characters
4. Bag of words/ Bigram character vectorization - convert the cleaned tokens into bigram characters.

EXERCISE 2: Fill in the code for preprocessing

In [None]:
# Define preprocessing function
def preprocess_text(text):
    # TODO: transform the text to lower case
    text_lower_case = 

    # TODO: tokenize lower case version of the text using function imported form nltk package
    tokens = 

    stop_words = set(stopwords.words('english'))
    # TODO: remove stopwords from tokens
    tokens = 
    
    # Remove punctuation and special characters
    tokens = [re.sub(r'[^\w\s]', '', word) for word in tokens]
    # Remove empty strings
    tokens = [word for word in tokens if word]

    return tokens

# Apply preprocessing
# TODO: apply the preprocessing function to the 'body' column and assign result to new processed_text column
df['processed_text'] = 

df['processed_text']

EXERCISE 3: Fill in the code to vectorize the text data.

In [None]:
def tokens2bigram(tokens):
    # Function transforming tokens into bigram character ngrams
    bigrams = list(ngrams(' '.join(tokens), 2))
    return [''.join(bigram) for bigram in bigrams]

# TODO: apply tokens2bigram function to processed_text column to and save result to new bigrams column
df['bigrams'] = 

# Convert the list of bigrams to a string for CountVectorizer
df['bigrams_str'] = df['bigrams'].apply(lambda x: ' '.join(x))

df['bigrams_str']

### VECTORIZATION

Convert the bigrams into vectors.

<a href="https://www.kaggle.com/code/samuelcortinhas/nlp3-bag-of-words-and-similarity">Useful article explaining 'Bag of Words' method </a>

In [None]:
# Vectorize (create feature vetors) bigram strings using CountVectorizer from sklearn
vectorizer = CountVectorizer()
# TODO: use fit_transform method to produce feature matrix from data in the bigrams_str column
X = 

# TODO: display the shape of the resulting feature matrix


### DATASET SPLIT

Splitting dataset into training and testing set. In this experiment we will use 80/20 ratio.

In [None]:
# Label
y = df['ticket_type']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### TRAINIG THE CLASSIFIER MODEL

In this experiment we will use Random Forest Classifier.

EXERCISE 4: Selecting Training Model

In [None]:
## Alternative models
# from sklearn.naive_bayes import MultinomialNB
# model = MultinomialNB()
# from sklearn.linear_model import LogisticRegression
# model = LogisticRegression(max_iter=200)

# Training the classifier
model = RandomForestClassifier(n_estimators=20, random_state=42)

# TODO: Fit the model on the training data
model.

In [None]:
# Predictions and evaluation
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

### EVALUATION

EXERCISE 5: Evaluate the model

Evaluation metrics

In [None]:
# Print evaluation metrics
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Confusion Matrix

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(10, 7))
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Incident', 'No Incident'])
disp.plot()
plt.show()

AUROC evaluation 

In [None]:
# AUROC evaluation
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
roc_auc = roc_auc_score(y_test, y_pred_proba)

# Plotting the ROC curve - Receiver Operating Characteristic curve
plt.figure(figsize=(10, 7))
plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (area = {:.2f})'.format(roc_auc))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()