# PROOF OF CONCEPT - LDA TOPIC MODELING - EVENTS DATA


This jupyter notebook provides code showing a potential implementation of LDA for events data.  A dictionary of words/documents based on the event description is created.  A topic model is created by factorizing an overall document/word matrix into separate document/word - topic/word matrices.  Extensive validation/ideal topic number & word number were carried out to ensure that enough/not too many topics are returned/matched given a previous click.  The ultimate utilization would be for a single city or cross-city matching environment.

In [157]:
import numpy as np
import pandas as pd
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string
import re

## Data cleaning
Stop words removed.  Headings more easily identified.  Floats removed.  

In [287]:
events_move = pd.read_csv('events.csv')

In [288]:
events_move.columns = ['ID', 'Name', 'City', 'Description', 'Date', 'Added_date']


In [289]:
for j in range(len(events_move.Description)):

    if type(events_move.iloc[j,3]) == float:
        events_move.iloc[j,3] = 'blank'

In [290]:
events_move['Description'].replace(regex=True,inplace=True,to_replace=r'\W',value=r' ')

In [291]:
events_move['Description'].replace(regex=True,inplace=True,to_replace=r'\d',value=r' ')

In [292]:
document_set = events_move.Description

Lemmatization and splitting of words

In [293]:

stop = set(stopwords.words('english'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

doc_clean = [clean(doc).split() for doc in document_set]   


In [295]:
events_move['cleaned'] = doc_clean

### Dictionary creation
Dictionary creation based on split words.  Also doc matrix found.

In [296]:
# Importing Gensim
import gensim
from gensim import corpora

# Creating the term dictionary of our courpus, where every unique term is assigned an index. dictionary = corpora.Dictionary(doc_clean)

dictionary = corpora.Dictionary(doc_clean)


# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

### Topic Modelling
LDA model created.  Extensive topic and pass validation were done to ensure successful model.

In [None]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=100, id2word = dictionary, passes=30)

In [361]:
topics = ldamodel.print_topics(num_topics=100, num_words=3)

In [362]:
topics

[(0, u'0.034*"class" + 0.032*"bring" + 0.023*"learn"'),
 (1, u'0.035*"balancing" + 0.035*"job" + 0.018*"well"'),
 (2, u'0.032*"agile" + 0.018*"leader" + 0.016*"workshop"'),
 (3, u'0.033*"attribution" + 0.022*"networking" + 0.022*"city"'),
 (4, u'0.045*"training" + 0.044*"pmp" + 0.036*"course"'),
 (5, u'0.035*"boc" + 0.028*"training" + 0.023*"course"'),
 (6, u'0.068*"risk" + 0.045*"management" + 0.037*"course"'),
 (7, u'0.000*"management" + 0.000*"service" + 0.000*"project"'),
 (8, u'0.023*"law" + 0.015*"thursday" + 0.015*"enforcement"'),
 (9, u'0.044*"professional" + 0.044*"transportation" + 0.030*"bay"'),
 (10, u'0.017*"barefoot" + 0.011*"training" + 0.011*"delivery"'),
 (11, u'0.054*"agile" + 0.041*"class" + 0.041*"leadership"'),
 (12, u'0.064*"redfin" + 0.056*"home" + 0.039*"buying"'),
 (13, u'0.037*"york" + 0.028*"dating" + 0.023*"new"'),
 (14, u'0.036*"agile" + 0.033*"coaching" + 0.033*"team"'),
 (15, u'0.046*"emg" + 0.039*"ncv" + 0.031*"medical"'),
 (16, u'0.035*"coaching" + 0.02

### Running of model

Individual topics found for each event.  Shown in final column below.

In [371]:
topic_holder = []

for j in range(len(events_move)):
    

    doc_new = [events_move.iloc[j, 6]]

# Creating the term dictionary of our courpus, where every unique term is assigned an index. dictionary = corpora.Dictionary(doc_clean)



# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
    doc_term_matrix2 = [dictionary.doc2bow(doc) for doc in doc_new]

    output = ldamodel[doc_term_matrix2]
    
    topic_holder.append(output)

In [394]:
events_move['topic_probs'] = 'hold'

for k in range(len(events_move)):
    events_move.topic_probs[k] = topic_holder[k][0]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [395]:
events_move.head()

Unnamed: 0,ID,Name,City,Description,Date,Added_date,cleaned,topic_probs
0,0,"NEW YORK, NY - FHA FIRST Design & Construction...","New+York,+NY",Free Fair Housing Design and Construction Trai...,2017-06-20,2017-04-20 11:35:48.481872,"[free, fair, housing, design, construction, tr...","[(84, 0.982)]"
1,1,Verizon Innovative Learning Lab: Music Mixing ...,"New+York,+NY",Saturday April Verizon Innovative Le...,2017-04-22,2017-04-20 11:35:48.481872,"[saturday, april, verizon, innovative, learnin...","[(57, 0.984769230769)]"
2,2,Confronting the Tragedy,"New+York,+NY",Confronting the Tragedy Law Enforcement Unio...,2017-04-28,2017-04-20 11:35:48.481872,"[confronting, tragedy, law, enforcement, union...","[(8, 0.602637470206), (28, 0.377820172179), (3..."
3,3,Leading SAFe® 4.0 Training - New York,"New+York,+NY",In this two day course attendees will gain th...,2017-06-14,2017-04-20 11:35:48.481872,"[two, day, course, attendee, gain, knowledge, ...","[(10, 0.399428220479), (52, 0.596488446188)]"
4,4,"Botox Training - New York, NY","New+York,+NY",Learn to perform Botox injections and other ae...,2017-04-29,2017-04-20 11:35:48.481872,"[learn, perform, botox, injection, aesthetic, ...","[(98, 0.995330188679)]"


In [417]:
for m in range(len(events_move)):
    n = len(events_move.topic_probs[m])
    hold = events_move.topic_probs[m]
    events_move.topic_probs[m] = [hold[n][0] for n in range(n)]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


### Data Scrubbing
Appropriate topics found for each.

In [418]:
events_move.head()

Unnamed: 0,ID,Name,City,Description,Date,Added_date,cleaned,topic_probs
0,0,"NEW YORK, NY - FHA FIRST Design & Construction...","New+York,+NY",Free Fair Housing Design and Construction Trai...,2017-06-20,2017-04-20 11:35:48.481872,"[free, fair, housing, design, construction, tr...",[84]
1,1,Verizon Innovative Learning Lab: Music Mixing ...,"New+York,+NY",Saturday April Verizon Innovative Le...,2017-04-22,2017-04-20 11:35:48.481872,"[saturday, april, verizon, innovative, learnin...",[57]
2,2,Confronting the Tragedy,"New+York,+NY",Confronting the Tragedy Law Enforcement Unio...,2017-04-28,2017-04-20 11:35:48.481872,"[confronting, tragedy, law, enforcement, union...","[8, 28, 31]"
3,3,Leading SAFe® 4.0 Training - New York,"New+York,+NY",In this two day course attendees will gain th...,2017-06-14,2017-04-20 11:35:48.481872,"[two, day, course, attendee, gain, knowledge, ...","[10, 52]"
4,4,"Botox Training - New York, NY","New+York,+NY",Learn to perform Botox injections and other ae...,2017-04-29,2017-04-20 11:35:48.481872,"[learn, perform, botox, injection, aesthetic, ...",[98]


## Proof of concept displayed
A random event is chosen.  In this case a beer camp in Austin, TX.  Suggested events are then found by matching events with similar derived topic numbers.  In this case the resulting matches are beer camps in other cities, demonstrating the proof of concept.

In [448]:
test_index = np.random.choice(range(len(events_move)))

In [449]:
test_index

65

In [452]:
events_move.Name[65]

'Beer Camp on Tour: Austin, TX'

In [453]:
similar_topics = []
for p in events_move.topic_probs[test_index]:
    for q in range(len(events_move)):
        if q != test_index:
            if p in events_move.topic_probs[q]:
                similar_topics.append(q)
            
similar_topics = list((set(similar_topics)))

In [454]:
similar_topics

[497, 107, 399, 357, 207]

In [455]:
events_move.Description[497]

'America   s largest craft beer festival is coming to Chicago  This summer  Beer Camp on Tour takes over Navy Pier with hundreds of craft beers  the city   s best food trucks  live music and good times  Every craft brewer in the nation is invited and you are  too  The Midwest was the nexus of the beer industry for a century  but today   s craft brewers are shifting the focus from industrial to artisan with some of the most exciting beers in the nation  Join us in Chicago to celebrate the craft beer community and the importance of collaboration nationally and internationally  Check the website for all the information you   re looking for as it   s revealed   Beer Camp On Tour This event is for attendees     and older '

In [456]:
events_move.Description[107]

'America   s largest craft beer festival is coming to San Francisco  This summer  Beer Camp on Tour takes over Pier    with hundreds of craft beers  the city   s best food trucks  live music and good times  Every craft brewer in the nation is invited and you are  too  Northern California is the birthplace of craft beer  and don   t you forget it  Fortunately for beer lovers  brewers take that legacy seriously by always pushing the boundaries of new and innovative beers  Join us in San Francisco to celebrate the craft beer community and the importance of collaboration nationally and internationally  Check the website for all the information you   re looking for as it   s revealed   Beer Camp On Tour This event is for attendees     and older '

In [457]:
events_move.Description[399]

'America   s largest craft beer festival is coming to San Francisco  This summer  Beer Camp on Tour takes over Pier    with hundreds of craft beers  the city   s best food trucks  live music and good times  Every craft brewer in the nation is invited and you are  too  Northern California is the birthplace of craft beer  and don   t you forget it  Fortunately for beer lovers  brewers take that legacy seriously by always pushing the boundaries of new and innovative beers  Join us in San Francisco to celebrate the craft beer community and the importance of collaboration nationally and internationally  Check the website for all the information you   re looking for as it   s revealed   Beer Camp On Tour This event is for attendees     and older '

In [459]:
events_move.Description[207]

'America   s largest craft beer festival is coming to Chicago  This summer  Beer Camp on Tour takes over Navy Pier with hundreds of craft beers  the city   s best food trucks  live music and good times  Every craft brewer in the nation is invited and you are  too  The Midwest was the nexus of the beer industry for a century  but today   s craft brewers are shifting the focus from industrial to artisan with some of the most exciting beers in the nation  Join us in Chicago to celebrate the craft beer community and the importance of collaboration nationally and internationally  Check the website for all the information you   re looking for as it   s revealed   Beer Camp On Tour This event is for attendees     and older '