# Intro to Text Analysis with Python
### Roadmap
1. Working with data in pandas dataframes
2. Preparing unstructured text data for analysis (tokenizing, stemming, and stop words)
3. Document term matrix
4. Two topic modeling alorigthms (Non-negative Matrix Factorization & Latent Dirichlet Allocation)
5. Predictive text classifer (Naive-Bayes)

### 1. Pandas Dataframe

In [33]:
#Read in data files into a pandas dataframe
import pandas as pd
import numpy as np

#Force certain columns to be string data type instead of mix type
#Dataframes will perform better
dtypes = {
    'Description' : object,
    'Summary*' : object,
    'Memo 2' : object,
    'Contact Organization' : object,
    'Contact Department' : object,
    'Contact Desk Location' : object,
    'Contact E-mail' : object, 
    'Memo 1' : object, 
    'Resolution Categorization Tier 1' : object
}

#Data was separated into three files so use pandas concat function to union them together
#Python3 and pandas is preferred when working with text data as it is more compatible with unicode
df = pd.concat([
    pd.read_csv("../tickets.csv", dtype=dtypes, encoding="ISO-8859-1"),
    pd.read_csv("../ticketsarc.csv", dtype=dtypes, encoding="ISO-8859-1"),
    pd.read_csv("../ticketsarc2.csv", dtype=dtypes, encoding="ISO-8859-1")
])

In [34]:
#Check data (rows, columns)
df.shape

(51069, 79)

In [35]:
#For analysis we will only work with a subset of rows and columns
#We create a separate dataframe to hold this data

#Select rows based on a criteria, i.e. the "where" clause
dftxt = df[df['Operational Categorization Tier 3'].isin(['Cat A','Cat B'])]

#Select only 4 of the 79 columns, i.e. the "select" clause
dftxt = dftxt[['Incident ID*+','Operational Categorization Tier 3','Description','Summary*']]

#Replace all null with empty strings to avoid any null pointer errors downstream
dftxt = dftxt.replace(np.nan, '', regex=True) 

#Concatenate the two text columns into one single column
dftxt['text'] = dftxt['Summary*'].map(str) + u'. ' + dftxt['Description'].map(str)

#Remove the original two text columns
dftxt = dftxt.drop(['Description','Summary*'], axis=1)

#Reset the row counter of the new dataframe to start at 0
dftxt.reset_index(drop=True, inplace=True)


In [36]:
#Check data subset (rows, columns)
dftxt.shape

(7628, 3)

In [37]:
#Take a peek at the data
dftxt.head(2)

Unnamed: 0,Incident ID*+,Operational Categorization Tier 3,text
0,ID000011111111,Cat A,ERS Supplier. Request a change to information ...
1,ID000011111112,Cat B,"85555 Fisk, Wen - confirming address. 55555 Fa..."


### 2. Tokenizing, Stemming, and Stop Words
#### Sample text:
>*I changed my email address to jzhang28@stanford.edu.*

#### Tokens:
>*i, changed, my, email, address, to, jzhang28@stanford.edu*

#### Stems:
>*i, <span style="color:red">chang</span>, my, email, address, to, jzhang28@stanford.edu*

#### Stop words:
>*<strike style="color:red">i</strike>, chang, <strike style="color:red">my</strike>, email, address, <strike style="color:red">to</strike>, jzhang28@stanford.edu*

#### N-grams (bi-grams):
>*chang-email, email-address, address-jzhang28@stanford.edu*

In [38]:
import re
from nltk import tokenize, stem, corpus

#Create regular expressions in order to tokenize string
email_regex_str = '\w[a-z0-9._-]+\w@[a-z0-9._-]+\w' #email pattern 
currency_regex_str = '[$£€¥][\d,\.]+' #currency pattern
ssn_regex_str = '\d{3}-\d{2}-\d{4}' #ssn pattern
word_regex_str = '\w+' #any word
token_regex = '|'.join([email_regex_str, currency_regex_str, ssn_regex_str, word_regex_str]) #set precedence

#priming the regular expressions for faster performance
email_regex = re.compile(email_regex_str)
currency_regex = re.compile(currency_regex_str)
ssn_regex = re.compile(ssn_regex_str)

In [39]:
#Tokenize function
def get_tokens(text):
    text = text.lower()
    return tokenize.regexp_tokenize(text, token_regex , gaps=False)

In [40]:
#Stemmer function
#Iterate through tokens and stem the words if applicable
#Transformed currency amount to an entity tag
#Transformed SSN's to an entity tag
stemmer = stem.snowball.EnglishStemmer()
def get_stems(tokens):
    stems = []
    for t in tokens:
        if email_regex.match(t):
            stems.append(t)
        elif currency_regex.match(t):
            stems.append('_currency_amt_')
        elif ssn_regex.match(t):
            stems.append('_ssn_')
        elif t.isalpha() and len(t) > 1:
            stems.append(stemmer.stem(t)) 
    return stems

In [41]:
#Stop word removal function
stopwords = set(corpus.stopwords.words('english'))
addl_stopwords = set(['pleas','thank','edu'])
custom_stopwords = stopwords | addl_stopwords
def remove_stop_words(tokens):
    return [t for t in tokens if t not in custom_stopwords]

In [42]:
#Apply functions to dataset
dftxt['tokens'] = dftxt['text'].map(get_tokens)
dftxt['stems'] = dftxt['tokens'].map(get_stems)
dftxt['cleaned_stems'] = dftxt['stems'].map(remove_stop_words)

In [43]:
#Check text processing on a record
ind=0
print('''
Text:\n%s\n\n
Tokens:\n%s\n
Stems:\n%s\n
Cleaned Stems:\n%s\n
''' % (dftxt.iloc[ind]['text'], dftxt.iloc[ind]['tokens'],dftxt.iloc[ind]['stems'],dftxt.iloc[ind]['cleaned_stems']))


Text:
ERS Supplier. Request a change to information for an existing payee/supplier.  Identify payee/supplier as displayed in Supplier Query and Request module. Payee/Supplier Name: SMITH, JOHN Payee/Supplier Number: 	123456  Use this form to designate these types of changes. Please indicate the item(s) that are changing.  __ Contact (name, phone, email, fax)   Provide detailed descriptions of the changes (also indicate deletion, inactivation, or changes to existing):  Please update the email address to: johnsmith@gmail.com


Tokens:
['ers', 'supplier', 'request', 'a', 'change', 'to', 'information', 'for', 'an', 'existing', 'payee', 'supplier', 'identify', 'payee', 'supplier', 'as', 'displayed', 'in', 'supplier', 'query', 'and', 'request', 'module', 'payee', 'supplier', 'name', 'smith', 'john', 'payee', 'supplier', 'number', '123456', 'use', 'this', 'form', 'to', 'designate', 'these', 'types', 'of', 'changes', 'please', 'indicate', 'the', 'item', 's', 'that', 'are', 'changing', '__', '

<h3>3. Document Term Matrix</h3>

<img src="img/dtm.JPG" style="float: left"/>

$
\begin{align*}
&Term\ Count\ = \text{Count of number of times term t occurs in document D}\\
&Term\ Frequency\ (TF) = \frac{Term\ Count}{Total\ Term\ Count\ in\ document\ D}\\
&TF.IDF = TF \times Inverse\ Document\ Frequency\\
\end{align*}
$



### TF.IDF

* If a term appears in almost every document, it does not provide any information
* Conversely, if a term appears in a subset of documents, it is insight we want to extract
* Therefore, we want to weigh the latter higher than the former
* IDF is merely this weight

$IDF(t) = log{(1+\frac{Total\ number\ of\ documents}{Number\ of\ documents\ where\ term\ t\ appears})}$

In [44]:
#Generate Term Count and TFIDF matricies
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer

#sklearn vectorizer functions take text data in the form of a string
# and allows us to defind a function to process that string.
# Define that function below
def process_text(text):
    tokens = get_tokens(text)
    stems = get_stems(tokens)
    cleaned_stems = remove_stop_words(stems)
    return cleaned_stems

#Create Term Count matrix
#max_df: Exclude terms that appear in over 95% of the documents
#min_df: Exclude terms that appear in under 1% of the documents
#ngram_range: Single words and bi-grams
cnt_vect = CountVectorizer(tokenizer=process_text, max_df=0.95, min_df=0.01, ngram_range=(1,2))
cnt_matrix = cnt_vect.fit_transform(dftxt['text'])

#Extract list of terms to be used as reference later
terms = cnt_vect.get_feature_names()

#Using the Term Count matrix, we can create the TFIDF matrix
tfidf_matrix = TfidfTransformer().fit_transform(cnt_matrix)



In [45]:
print("Term count matrix rows and columns:", cnt_matrix.shape)
print("TFIDF matrix rows and columns:", tfidf_matrix.shape)
print("Number of terms:", len(terms))

Term count matrix rows and columns: (7628, 794)
TFIDF matrix rows and columns: (7628, 794)
Number of terms: 794


### 4. Topic Modeling
<img src="img/topics_matrix.jpg"/>

In [46]:
#Topic modeling with Non-negative Matrix Factorization
from sklearn.decomposition import NMF

#Specify number of topics to find
k_topics = 10

#Create NMF model object
nmf = NMF(init='nndsvd', n_components=k_topics, random_state=1)

#Run model and output the TFIDF document topic matrix 
nmf_topic_matrix = nmf.fit_transform(tfidf_matrix)

In [47]:
#Check matrix rows and columns
nmf_topic_matrix.shape

(7628, 10)

In [48]:
#Define convenience function to print out top words contained topic term matrix
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_): #model.components_ attribute contains the topic term matrix
        words = [feature_names[i].replace(' ','-') for i in topic.argsort()[:-n_top_words - 1:-1]]
        print("Topic #%d: %s" % (topic_idx, ", ".join(words)))

In [49]:
#NMF topics and terms
print_top_words(nmf, terms, 8)

Topic #0: chang, paye, paye-supplier, supplier, name, author-agent, author, agent
Topic #1: stanford, univers, stanford-univers, payment, inform, messag, univers-procur, paye
Topic #2: supplier-request, request, supplier, request-supplier, cancel, info, status, request-request
Topic #3: joni, fakename@stanford.edu, png, joni-fakenam, fakenam, supplier-set, procur-financi, manag-servic
Topic #4: supplier-enabl, enabl-team, team, enabl, email, sent, supplier, helpsu
Topic #5: reactiv, iprocur, inc, supplier, er, llc, enabl, paye
Topic #6: ssn, need-ssn, provid, need, ssn-supplier, call, _ssn_, setup
Topic #7: portal, portal-request, resend, secur-portal, secur, send, request, resent
Topic #8: address, inc, site, updat, po, address-updat, need, activ
Topic #9: req, supplier-req, supplier, iprocur, resent, status, set, see


In [50]:
print("Text of the first document:\n", dftxt.iloc[0]['text'])
print()
print("NMF matrix first row:")
for idx, value in enumerate(nmf_topic_matrix[0]):
    print("f%s: %s" % (idx,round(value,4)) )

Text of the first document:
 ERS Supplier. Request a change to information for an existing payee/supplier.  Identify payee/supplier as displayed in Supplier Query and Request module. Payee/Supplier Name: SMITH, JOHN Payee/Supplier Number: 	123456  Use this form to designate these types of changes. Please indicate the item(s) that are changing.  __ Contact (name, phone, email, fax)   Provide detailed descriptions of the changes (also indicate deletion, inactivation, or changes to existing):  Please update the email address to: johnsmith@gmail.com

NMF matrix first row:
f0: 0.1217
f1: 0.0
f2: 0.021
f3: 0.0
f4: 0.0
f5: 0.0009
f6: 0.0
f7: 0.0
f8: 0.0172
f9: 0.0036


In [51]:
#Topic modeling with Latent Dirichlet Allocation
from sklearn.decomposition import LatentDirichletAllocation

#Create LDA model object
lda = LatentDirichletAllocation(n_topics=k_topics, max_iter=5, learning_method='online', learning_offset=50., random_state=1)

#Run model on term count matrix
#LDA requires integer values
lda_topic_matrix = lda.fit_transform(cnt_matrix)

In [52]:
#Check matrix rows and columns
lda_topic_matrix.shape

(7628, 10)

In [53]:
#LDA topic and terms
print_top_words(lda, terms, 8)

Topic #0: stanford, subject, sent, payment, email, univers, help123@stanford.edu, inform
Topic #1: address, updat, bank, supplier, email, inc, wire, po
Topic #2: supplier, email, enabl, secur, request, sent, inform, team
Topic #3: need, paye, reimburs, set, complet, check, status, transact
Topic #4: request, supplier, supplier-request, portal, site, reactiv, inc, activ
Topic #5: supplier, paye, paye-supplier, chang, name, email, request, inform
Topic #6: supplier, ssn, req, supplier-req, call, er, ssn-supplier, su
Topic #7: email, messag, financi, ani, support, financi-support, receiv, center
Topic #8: stanford, univers, stanford-univers, payment, inform, request, supplier, paye
Topic #9: request, joni, supplier, number, fakename@stanford.edu, alto, palo, palo-alto


In [54]:
print("LDA matrix first row:")
for idx, value in enumerate(lda_topic_matrix[0]):
    print("f%s: %s" % (idx,round(value,4)))

LDA matrix first row:
f0: 0.0011
f1: 0.0011
f2: 0.0011
f3: 0.0011
f4: 0.0011
f5: 0.9776
f6: 0.0138
f7: 0.0011
f8: 0.0011
f9: 0.0011


In [55]:
print('Most likely topic for the first document:')
print('NMF topic: %s\nLDA topic: %s' % (nmf_topic_matrix[0].argsort()[:-2:-1],lda_topic_matrix[0].argsort()[:-2:-1]))

Most likely topic for the first document:
NMF topic: [0]
LDA topic: [5]


### 5. Predictive Text Classifier
<img src="img/ml.jpg" />

In [56]:
#Use this feature as the target variable y
dftxt['Operational Categorization Tier 3'].value_counts(normalize=True)

Cat B      0.555585
Cat A    0.444415
Name: Operational Categorization Tier 3, dtype: float64

In [57]:
#Prep data for predictive model
from sklearn.model_selection import train_test_split
from scipy import sparse

#Transform target variable into an array of binary outcomes
target_values = np.where(dftxt['Operational Categorization Tier 3'] == 'Cat B', 1, 0)

#Combine target values with tfidf_matrix data
pred_data_set = sparse.hstack([tfidf_matrix, target_values[:,None]]).toarray()

#Randomly split data into training and testing sets
train, test = train_test_split(pred_data_set)

#Separate features from target variable for train and test datasets
y_train = [t[0] for t in train[:,-1:]]
y_test = [t[0] for t in test[:,-1:]]
X_train = train[:,:train.shape[1]-1]
X_test = test[:,:test.shape[1]-1]

In [58]:
print("Training set features:", X_train.shape)
print("Training set target variable:", len(y_train))
print()
print("Test set features:", X_test.shape)
print("Test set target variable:", len(y_test))

Training set features: (5721, 794)
Training set target variable: 5721

Test set features: (1907, 794)
Test set target variable: 1907


In [59]:
#Establish baseline model
#The simplest model is to predict that everything is Cat B, i.e. 1
from sklearn.metrics import roc_auc_score

#Generate my baseline predictions of all 1's
baseline_pred_values = [1] * len(y_test)

#Score my prediction agains the test set y values
auc_baseline = roc_auc_score(y_test, baseline_pred_values)
auc_baseline

0.5

In [60]:
#Run Naive-Bayes classifier
from sklearn.naive_bayes import MultinomialNB

#Generate the model using the training set
nb_model = MultinomialNB().fit(X_train, y_train)

#Generate prediction from the model by feeding in the test set X values
nb_pred_values = nb_model.predict_proba(X_test)[:,1]

#Score the model predictions against the test set y values
auc_nb = roc_auc_score(y_test, nb_pred_values)
auc_nb

0.73245618955729574

### Resources

<a href="http://scikit-learn.org">scikit-learn</a>: Contains documentation on how to use the scikit python package along with information on the models used<br/>
<a href="https://www.youtube.com/watch?v=UzxYlbK2c7E&list=PLA89DCFA6ADACE599">Machine learning lecutures by Prof. Andrew Ng</a>: Lectures recorded at Stanford. A bit math heavy but provides fundamental concepts behind machine learning<br/>
<a href="http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/">Matrix factorization</a>: math and intuition behind matrix factorization, the basis of the NMF model<br/>
<a href="http://videolectures.net/mlss09uk_blei_tm/">Topic modeling lecture</a>: Has good explanation on how LDA works<br/>
<a href="https://www.youtube.com/watch?v=TpgiFIGXcT4">Bayesian statistics video</a>: Video lecture from PyCon 2016.<br/>
<a href="http://brandonrose.org/clustering">More examples of document clustering using Python</a><br />