### Ingest Data
Describes the Data ingestions strategy

- What subset of the available corpus is used?
- What are the criteria defined for measuring performance?
- Is filtering on date, version, tags employed?
- Size of the data being used in this demo

The tickets that were used to train model: "sample_data.json".
No filtering was employed. Only the ticket text (tittle and body) were used.
The number of tickets used to train the model: 31794 tickets.

In [37]:
import pandas as pd
import json
import re

import nltk
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

In [27]:
"""Get data for training"""

with open("data/sample_data1.json") as f:
  data = json.load(f)

tickets = {}
tickets = data['tickets']

with open('data/ticket_data.json', 'w') as json_file:
  json.dump(tickets, json_file)


"""Get first 500 tickets from all of the data for the keyword extraction"""
sample_tickets = []
sample_tickets =  tickets[0:500]

with open('data/sample_ticket_data.json', 'w') as json_file:
  json.dump(sample_tickets, json_file)


### Preprocessing and Feature Engineering steps

Inlcude the series of data cleaning steps that are included in your workflow

- stopword removal strategy
- tokenization/stemming strategy
- normalization

#### Data for training

In [28]:
def pre_process(text):
    
    # lowercase
    text=text.lower()
    
    #remove tags
    text=re.sub("</?.*?>"," <> ",text)
    
    # remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)
    
    return text

In [29]:
# read json into a dataframe
df_idf = pd.read_json("data/ticket_data.json")

# print schema
print("Schema:\n\n",df_idf.dtypes)
print("Number of questions,columns=",df_idf.shape)

df_idf['text'] = df_idf['title'] + df_idf['content']
df_idf['text'] = df_idf['text'].apply(lambda x:pre_process(x))

#show the first 'text'
#df_idf['text'][0]

Schema:

 ticket_id             int64
title                object
content              object
timestamp    datetime64[ns]
tags                 object
dtype: object
Number of questions,columns= (31794, 5)


In [30]:
#get the text column 
docs=df_idf['text'].tolist()

#create a vocabulary of words, 
#ignore words that appear in 85% of documents, 
#eliminate stop words
cv=CountVectorizer(max_df=0.90,stop_words=stopwords.words('english'), min_df=1)
word_count_vector=cv.fit_transform(docs)

#### Sample Data

In [31]:
# read test docs into a dataframe and concatenate title and body
df_test=pd.read_json("data/sample_ticket_data.json")
df_test['text'] = df_test['title'] + df_test['content']
df_test['text'] =df_test['text'].apply(lambda x:pre_process(x))

# get test docs into a list
docs_test=df_test['text'].tolist()
docs_title=df_test['title'].tolist()
docs_body=df_test['content'].tolist()

### Tag lexicon definition description

Define the set of tags that are defined for the task

#### Compute Inverse Document Frequency (IDF)

In [32]:
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector)

tfidf_transformer.idf_

array([ 9.57530485,  9.57530485, 10.26845204, ..., 10.67391714,
       10.67391714, 10.67391714])

In [33]:
def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    """get the feature names and tf-idf score of top n items"""
    
    #use only topn items from vector
    sorted_items = sorted_items[:topn]

    score_vals = []
    feature_vals = []

    for idx, score in sorted_items:
        fname = feature_names[idx]
        
        #keep track of feature name and its corresponding score
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])

    #create a tuples of feature,score
    #results = zip(feature_vals,score_vals)
    results= {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]
    
    return results

In [21]:
# you only needs to do this once
feature_names=cv.get_feature_names()

# get the document that we want to extract keywords from
doc=docs_test[0]

#generate tf-idf for the given document
tf_idf_vector=tfidf_transformer.transform(cv.transform([doc]))

#sort the tf-idf vectors by descending order of scores
sorted_items=sort_coo(tf_idf_vector.tocoo())

#extract only the top n; n here is 10
keywords=extract_topn_from_vector(feature_names,sorted_items,10)

# now print the results
print("\n=====Title=====")
print(docs_title[0])
print("\n=====Body=====")
print(docs_body[0])
print("\n===Keywords===")
for k in keywords:
    print(k,keywords[k])


=====Title=====
firefox keeps going back to previous page randomly , but i have not changed anything

=====Body=====
<p>i have the same OS and all else is same, but recently firefox keeps going back to the previous page randomly , and i have to click that forward arrow to get it back. a real pain as it reloads and i have to start over from what i was reading/copy-pasting etc.
Perhaps it's a bug with a recent firefox update?
</p>

===Keywords===
randomly 0.367
back 0.287
keeps 0.269
previous 0.268
going 0.259
reloads 0.237
pasting 0.233
pain 0.209
perhaps 0.195
reading 0.185


In [34]:
# put the common code into several methods
def get_keywords(idx):

    #generate tf-idf for the given document
    tf_idf_vector=tfidf_transformer.transform(cv.transform([docs_test[idx]]))

    #sort the tf-idf vectors by descending order of scores
    sorted_items=sort_coo(tf_idf_vector.tocoo())

    #extract only the top n; n here is 10
    keywords=extract_topn_from_vector(feature_names,sorted_items,10)
    
    return keywords

def print_results(idx,keywords):
    # now print the results
    print("\n=====Title=====")
    print(docs_title[idx])
    print("\n=====Body=====")
    print(docs_body[idx])
    print("\n===Keywords===")
    for k in keywords:
        print(k,keywords[k])

In [36]:
idx=220
keywords=get_keywords(idx)
print_results(idx,keywords)


=====Title=====
Firefox crashes

=====Body=====
<p>I am having crashes in FF 15.0.1 when I use Outlook Web Access. It usually happens when I change folders or have an item open and go back to the main OWA screen. This is happening many times a day and I always submit a crash report. I have put the latest crash report ID below. Please help as I rely on OWA for my work email.
</p>

===Keywords===
owa 0.539
report 0.299
crash 0.258
rely 0.23
crashes 0.229
submit 0.186
outlook 0.182
item 0.171
main 0.157
folders 0.157


### Classifier/Annontator Training step

Describe the model and the model training step

- Include a description the feature space used
- Include a description of the selected classification or annotation model
- Describe the training process and expected runtime for training

In [8]:
"""code that executes model training step"""

'code that executes model training step'

### Classifier/Annotator Testing step

Describe the testing of the trained model's performace against a defined test set.

- Include the raw performance
- Include the source of ground truth for the evaluation
- Include figures for FP/FN/ROC type metrics describing the model performance.

In [9]:
"""code that executes the model testing step"""

'code that executes the model testing step'

### Interpretation
Sumamrize the model performance and findings related to specific misclassified items also a breif description of the findings as they correpsonde to generalizability.