## Overview of TF-IDF

TF-IDF (short for term frequency–inverse document frequency) is a numerical statistic that is intended to reflect how important a word is to a document in a corpus.

The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word.

So, words that are common in every document, such as "this", "what", and "if", rank low even though they may appear many times, since they don’t mean much to that document in particular. However, if the word "Bug" appears many times in a document, while not appearing many times in others, it probably means that it’s very relevant.

## Ingest Data

* The tickets that were used to train model (created during the Code Sprint): "sample_data.json".
* The number of tickets used to train the model: 31794 tickets.
* No filtering was employed. Only the ticket text (tittle and body) were used.

**Data used for training:** all tickets (title and body text data) from "sample_data.json"
**Data used for keyword extraction:** first 500 tickets (only title and body text data) from "sample_data.json"

In [4]:
import pandas as pd
import json
import csv
import re

import nltk
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

In [5]:
"""Get data for training"""

csv_file_path = 'data/tickets_clean.csv'
json_file_path = 'data/ticket_data.json'

data = []
with open(csv_file_path, encoding="utf8") as csv_file:
    csv_reader = csv.DictReader(csv_file)
    
    for rows in csv_reader:
        ticket = {}
        
        ticket["id"] = rows['id']
        ticket["title"] = rows['title']
        ticket["content"] = rows['content']
        
        data.append(ticket)
        
with open(json_file_path, 'w') as json_file:
    json_file.write(json.dumps(data))

## Preprocessing and Feature Engineering steps

Inlcude the series of data cleaning steps that are included in your workflow

- stopword removal strategy
- tokenization/stemming strategy
- normalization

Steps:
1. read json into a dataframe
2. make all text lowercase
3. Remove all tags
4. Remove special characters and digits
5. Ignore words that appear in 85% of documents
6. Eliminate stopwords


### Data for training

In [6]:
# read json into a dataframe
df_idf = pd.read_json("data/ticket_data.json")

df_idf['text'] = df_idf['title'] + df_idf['content']


"""
# print schema
print("Schema:\n\n",df_idf.dtypes)
print("Number of questions,columns=",df_idf.shape)

#show the first 'text'
df_idf['text'][0]
"""

'\n# print schema\nprint("Schema:\n\n",df_idf.dtypes)\nprint("Number of questions,columns=",df_idf.shape)\n\n#show the first \'text\'\ndf_idf[\'text\'][0]\n'

### Sample Data

In [7]:
# read test docs into a dataframe and concatenate title and body
df_test=pd.read_json("data/ticket_data.json")
df_test['text'] = df_test['title'] + df_test['content']

# get test docs into a list
docs_test=df_test['text'].tolist()
docs_title=df_test['title'].tolist()
docs_body=df_test['content'].tolist()

## Tag lexicon definition description

The next step is to compute the tf-idf value for a given document in our test set by invoking tfidf_transformer.transform(...). This generates a vector of tf-idf scores. Next, we sort the words in the vector in descending order of tf-idf values and then iterate over to extract the top-n items with the corresponding feature names, In the example below, we are extracting keywords for the first document in our test set.

The sort_coo(...) method essentially sorts the values in the vector while preserving the column index. Once you have the column index then its really easy to look-up the corresponding word value as you would see in extract_topn_from_vector(...) where we do feature_vals.append(feature_names[idx]).

In [8]:
#get the text column 
docs=df_idf['text'].tolist()

#create a vocabulary of words, 
#ignore words that appear in 85% of documents
#eliminate stop words
cv=CountVectorizer(max_df=0.50,stop_words=stopwords.words('english'), min_df=1)
word_count_vector=cv.fit_transform(docs)

We can see that we have 31794 tickets with 45336 unique words in our dataset minus stopwords:

In [9]:
word_count_vector.shape

(319303, 473041)

#### Compute Inverse Document Frequency (IDF)

In [10]:
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector)

tfidf_transformer.idf_

array([ 6.89853282,  8.74664522,  9.07877906, ..., 12.98075173,
       12.98075173, 12.98075173])

In [11]:
def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    """get the feature names and tf-idf score of top n items"""
    
    #use only topn items from vector
    sorted_items = sorted_items[:topn]

    score_vals = []
    feature_vals = []

    for idx, score in sorted_items:
        fname = feature_names[idx]
        
        #keep track of feature name and its corresponding score
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])

    #create a tuples of feature,score
    #results = zip(feature_vals,score_vals)
    results= {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]
    
    return results

In [12]:
# you only needs to do this once
feature_names=cv.get_feature_names()

In [13]:
# put the common code into several methods
def get_keywords(idx):

    #generate tf-idf for the given document
    tf_idf_vector=tfidf_transformer.transform(cv.transform([docs_test[idx]]))

    #sort the tf-idf vectors by descending order of scores
    sorted_items=sort_coo(tf_idf_vector.tocoo())

    #extract only the top n; n here is 10
    keywords=extract_topn_from_vector(feature_names,sorted_items,10)
    
    return keywords

def print_results(idx,keywords):
    # now print the results
    print("\n=====Title=====")
    print(docs_title[idx])
    print("\n=====Body=====")
    print(docs_body[idx])
    print("\n===Keywords===")
    for k in keywords:
        print(k,keywords[k])

In [26]:
idx=999

keywords=get_keywords(idx)
print_results(idx,keywords)


=====Title=====
tell secure site lock feature anymore

=====Body=====
credit union firefox page tell website secure enter website

===Keywords===
anymorecredit 0.542
secure 0.451
union 0.385
tell 0.343
website 0.279
lock 0.233
enter 0.198
feature 0.194
site 0.129
page 0.11


In [15]:
# go through all tickets one by one 
# 31794
tags_dict = {}
num_tags = 0

for idx in range(30000):
    keywords=get_keywords(idx)
    
    for k in keywords:
        if keywords[k] >= 0.2:            
            #check if this tag is already in the dictionary
            if k in tags_dict:
                tags_dict[k] += 1
            else:
                tags_dict[k] = 1
                num_tags += 1

print(len(tags_dict))

with open('tf-idf_ALL_tags.txt', 'w') as file:
    file.write(json.dumps(tags_dict))
    
usefull_tag_dict = {}

for key, value in tags_dict.items():
    if value > 10:
        usefull_tag_dict[key] = value

print(len(usefull_tag_dict))
        
with open('tf-idf_tags.txt', 'w') as file:
    file.write(json.dumps(usefull_tag_dict))


# create tags for them
# add their tags that are >= 0.2 to the distionary
# if there is already a key with tha same tag increase value by 1



45602
2190


In [16]:
import operator

sorted_list = sorted(usefull_tag_dict.items(), key=operator.itemgetter(1), reverse = True)

print("{:<20} {:<7}".format('tag', 'count'))

for k, v in sorted_list:
    print("{:<20} {:<7}".format(k, v))

tag                  count  
tab                  1673   
open                 1516   
download             1339   
page                 1284   
bookmark             1219   
version              1086   
new                  920    
update               903    
search               851    
toolbar              842    
bar                  777    
button               746    
google               709    
file                 692    
save                 682    
yahoo                680    
email                669    
home                 653    
window               650    
load                 648    
close                637    
site                 630    
password             626    
mail                 607    
ff4                  592    
link                 586    
icon                 567    
upgrade              564    
work                 548    
print                546    
ff                   545    
click                520    
screen               475    
old           

firefoxopen          18     
firefoxinstall       18     
magnify              18     
untitled             18     
thx                  18     
friendly             18     
cable                18     
course               18     
hope                 18     
clue                 18     
sender               18     
obvious              18     
technical            18     
runtime              18     
wall                 17     
extend               17     
fireftp              17     
lately               17     
pageopen             17     
combination          17     
inactive             17     
age                  17     
2003                 17     
frustrating          17     
cool                 17     
tray                 17     
tiscali              17     
core                 17     
designate            17     
os10                 17     
inconvenient         17     
critical             17     
troubleshooting      17     
suitable             17     
approx        

In [17]:
import matplotlib.pyplot as plt

dict_graph = usefull_tag_dict
#dict_graph = sorted(usefull_tag_dict.items(), key=lambda x: x[1], reverse=True)

plt.bar(range(len(dict_graph)), list(dict_graph.values()), align='center')
plt.xticks(range(len(dict_graph)), list(dict_graph.keys()))
plt.savefig('tags.png')

### Classifier/Annontator Training step

Describe the model and the model training step

- Include a description the feature space used
- Include a description of the selected classification or annotation model
- Describe the training process and expected runtime for training

In [18]:
"""code that executes model training step"""

'code that executes model training step'

### Classifier/Annotator Testing step

Describe the testing of the trained model's performace against a defined test set.

- Include the raw performance
- Include the source of ground truth for the evaluation
- Include figures for FP/FN/ROC type metrics describing the model performance.

In [19]:
"""code that executes the model testing step"""

'code that executes the model testing step'

### Interpretation
Sumamrize the model performance and findings related to specific misclassified items also a breif description of the findings as they correpsonde to generalizability.