*** This script constructs a Naive Bayes classifier for each permit type using the permit subtype taxonomy derived from the prior unsupervised learning step and labeling each training sample according to these. ***

Briefly, this impllies performing the following steps:

1) Read the taxonomy of permit subtypes and their associated 'synonym' words. 

2) Determine the subtypes for each building permit by on the taxonomy and their keyword (i.e. map the class labels of each permit to be used in the Naive Bayes step). This proceduces the y_train class labels. Note that this mapping is not 1-1 and one permit amy belong to several subtypes

3) Vecotrize the building permits and construct a tf/idf matrix from the building permit training set. This produces the x_train features

4) Train a Naive Bayes classifier on the  (we assume a multinomial distribution for the words) for each permit subtype. We use x_train set and y_train for the training labels. This implies we will have a number of classifiers equal to the number of subtypes.

5) Save each of the permit type classifiers.

General declarations, including used libraries, support and plotting packages.
Python version 2.7+ must be installed along with the following packages:
os (built-in), numpy, pandas, nltk (Natural Language Processing Toolkit), codecs, sklearn (statistical learning package), mpld3, MySQLdb, pylab, csv, string, matplotlib (plotting), wordcloud (only if word clouds plotting is desired), seaborn (plotting), pickle (to save clusters and other data objects).

In [1]:
import numpy as np
import pandas as pd
import nltk
import re
import os
import codecs
from sklearn import feature_extraction
import mpld3
import MySQLdb
import pandas.io.sql as sql
import matplotlib.pyplot as plt
import pylab as py
import csv
import string
from nltk.stem.snowball import SnowballStemmer
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.cluster import KMeans
from sklearn.externals import joblib
from sklearn import metrics
from sklearn.cross_validation import train_test_split
from pystruct.learners import NSlackSSVM
from pystruct.models import MultiLabelClf
from collections import defaultdict

import seaborn as sns
import pickle

sns.set(style="white", color_codes=True)

import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

stopwords=set(unicode("john american james hbcode thomas david michael robert subtype richard permit building none america rochester william needs brother jessica give like send estimate im youre chat details regarding hi available email call please interested contact looking project need job phone work so some somebody somehow someone something sometime sometimes somewhat somewhere soon sorry specified specify specifying still sub such sup sure t's take taken tell tends th than thank thanks thanx that that's thats the their theirs them themselves then thence there there's thereafter thereby therefore therein theres thereupon these they they'd they'll they're they've think third this thorough thoroughly those though three through throughout thru thus to together too took toward towards tried tries truly try trying twice two un under unfortunately unless unlikely until unto up upon us use used useful uses using usually value various very via viz vs want wants was wasn't way we we'd we'll we're we've welcome well went were weren't what what's whatever when whence whenever where where's whereafter whereas whereby wherein whereupon wherever whether which while whither who who's whoever whole whom whose why will willing wish with within without won't wonder would wouldn't yes yet you you'd you'll you're you've your yours yourself yourselves zero a about above after again against all am an and any are aren't as at be because been before being below between both but by can can't cannot could couldn't did didn't do does doesn't doing don't down during each few for from further had hadn't has hasn't have haven't having he he'd he'll he's her here here's hers herself him himself his how how's i i'd i'll i'm i've if in into is isn't it it's its itself let's me more most mustn't my myself no nor not of off on one once only or other ought our ours ourselves out over own same shan't she she'd she'll she's should shouldn't so some such than that that's the their theirs them themselves then there there's these they they'd they'll they're they've this those through to too under until up very was wasn't we we'd we'll we're we've were weren't what what's when when's where where's which while who who's whom why why's with won't would wouldn't you you'd you'll you're you've your yours yourself yourselves").split())


General functions for reading data, doing cluster to keyword mapping and text pre-processing.

In [2]:
# Function to read words associated to permit subtypes
# Input: file_name
# Output: list of identifier for each trade/permit type (label), list of 'synonym' words for each permit subtype. 
def read_words(input_file):
    label=[]
    words=[]
    with open(input_file, 'rU') as f:
        reader = csv.reader(f,delimiter='\n')
        # For each row in the file, read the job_values and the associated strings. 
        for row in reader:
            row=row[0].split(',')
            label.append(row.pop(0))
            temp=[]
            for x in row:
                x=x.lower()
                x=x.replace('"','')
                x=x.replace(' ','')
                if x != '':
                    temp.append(x)
            words.append(temp)
    return label,words

# Function to read BuildZoom building permits from a saved filed.
# Inputs: file name that has the data
def read_permit(input_file):
    # List that will hold the job_values (val) and the text from the data (a)
    a=[]
    val=[]
    with open(input_file, 'rU') as f:
        reader = csv.reader(f,delimiter='\t')
        # For each row in the file, read the job_values and the associated strings. 
        for row in reader:
            # Ignore rows that do not have the id(from pandas) + 6 queried features (can be NULL too)
            if len(row) == 7:
                # Remove the pandas id (just trash from the pandas dataframe)
                row.pop(0)
                try: 
                    # Save the job_value for each building permit if the value is a valid integer
                    val.append(float(row.pop(0)))
                except:
                    # Simply ignore and place a zero value, if not
                    val.append(0)
                # Save building permit text
                a.append(row)
    return val,a

# Function to join text for a given list. 
# Inputs: list with strings
def join_txt(x):
    content = ' '.join(filter(None,x))
    return content

# Text pre-processing function, that also applies the Snowball stemmer to each word
# This function's only difference compared to content2tokens is the use of the stemmer. 
# Filter out spaces, punctuation, unnecessary whitespaces. 
# Any stop-word is also removed
# Additionally, this function also removes any word smaller than 4 characters
# Inputs: list with string for each building permit
def content2stems(s):
    # Initialize Snowball stemmer
    stemmer = SnowballStemmer("english")
    # Transform all letters to lower-case
    content = s.lower()
    # Remove periods
    content = content.replace('.', ' ')    
    # Split words
    content = ' '.join(content.split())
    # Remove punctuation
    for c in content:
        if c in string.punctuation or c.isdigit():
            content=content.replace(c," ")
    
    # Split words into a list
    content = content.split() 
    # Filter words from the stoplist
    content = filter(lambda w: w not in stopwords, content)  
    # Filter words smaller than 3 characters
    content = filter(lambda w: len(w) > 3, content)
    # Apply Snowball stemmer to words
    content = [stemmer.stem(t) for t in content]    
    return content

# Function to label permit subtypes to individual samples using the cluster assignements
# Input: permit raw text (permits), list of keywords associated to each subtype and label for each subtype
# Output: matrix were rows are samples and columns are subtype assignments (1: positive, 0: negative)
def sample_mapping(permits,keyword_map,labels):
    sample_map=[]
    for s in permits:
        # Stem the building permit
        permit_stem= content2stems(s)
        
        # Determine the subtypes of the stemmed permit
        intersect_list = [filter(lambda x: x in permit_stem, sublist) for sublist in keyword_map]
        sample_map.append([1 if len(x) > 0 else 0 for x in intersect_list])
        
    return pd.DataFrame(sample_map,columns=labels)

# Function to stem the keywords associated to building permit subtypes
# Input: list of building subtypes in which each list has a list of keywords associated to that subtype
# Outputs: list of building subtypes in which each list has a list of 
def stem_keywords(keyword_list):
    stemmer = SnowballStemmer("english")
    keyword_stems=[]
    # For every list of keywords associated to a building permit, stem all keywords
    for l in keyword_list:
        l=map(lambda x: stemmer.stem(x),l)
        # Eliminate potential duplicates
        l=list(set(l))
        keyword_stems.append(l)
    
    return keyword_stems

Read the keywords associated to each building permit subtype and then stem them.

In [3]:
# Read building permit subtypes keywords and stem them.
keyword_label,keywords=read_words('taxonomy_cluster_keywords.csv')
keywords_stems=stem_keywords(keywords)
# Turn all label subtypes to lower_case
keyword_label=map(lambda x: x.lower(),keyword_label)

Read all building permits to be used for training, stem them and related them to the building subtypes.

In [4]:
# Read input files and format original data
job_values, build_permits=read_permit('./Data/BPermit_Sample_Dke.tsv')
build_permits=map(join_txt,build_permits)
build_permits=map(lambda x : unicode(x, errors='ignore'),build_permits)
# Stem, return
# Use the raw text, apply filters and tokenize. 
map_train=sample_mapping(build_permits,keywords_stems,keyword_label)

Build tf/idf matrix for the training set data

In [5]:
# Build tf/idf matrix for the classifiers
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=300,
                                   min_df=50, stop_words=stopwords,
                                   use_idf=True, tokenizer=content2stems, 
                                   decode_error='ignore',ngram_range=[1,1])
tfidf_matrix_train = tfidf_vectorizer.fit_transform(build_permits)
tfidf_matrix_train=tfidf_matrix_train.toarray()
terms = tfidf_vectorizer.get_feature_names() # Save the terms that remain from the vectorization

# Save building permit terms used when training
with open('./NB_Classifiers/permit_terms.pickle', 'w') as f:
    pickle.dump([terms], f)

# Also convert the permit labels (y, for the predictions) into a numpy array
map_train=np.array(map_train)

Construct a Naive Bayes classifier for each permit subtype, and save them to disk. 

In [6]:
# Build a Naive Bayes classifier for each subtype
scores_indiv=[]
predict_test=[]
for i, key in enumerate(keyword_label):
    # Grab the class labels for each specific subtype
    sample=map_train[:,i]

    # Build Naive Bayes classifier, given tfidf-matrix and class labels.
    cls=MultinomialNB()
    cls.fit(tfidf_matrix_train,sample)
    
    # Save NB model to file for future use
    joblib.dump(cls,'./NB_Classifiers/NB_'+str(key)+'.pkl')