# Load data

In [6]:
data('AssociatedPress')
AssociatedPress

<<DocumentTermMatrix (documents: 2246, terms: 10473)>>
Non-/sparse entries: 302031/23220327
Sparsity           : 99%
Maximal term length: 18
Weighting          : term frequency (tf)

In [23]:
data_fpath <- '../av_survey_data/bikepgh_av_survey.csv' # fill the path to this file here
text_col_classes = c(
    'interaction_details'='character',
    'positive_av_interaction'='character',
    'negative_av_interaction'='character',
    'other_av_regulations'='character',
    'elaborate_bikepgh_position'='character',
    'other_comments'='character'
                )
survey_data <- read.csv(data_fpath, colClasses=text_col_classes, na.strings=c(''))
print(nrow(survey_data))
print(sapply(survey_data, class))
survey_data

[1] 1608
               participant_id                           age 
                    "integer"                      "factor" 
      av_disclose_performance            av_reduce_injuries 
                     "factor"                      "factor" 
             av_report_safety                av_school_zone 
                     "factor"                      "factor" 
                av_share_data                av_speed_limit 
                     "factor"                      "factor" 
               av_two_drivers                bikegph_member 
                     "factor"                      "factor" 
             bikepgh_position       bikepgh_should_advocate 
                     "factor"                      "factor" 
   elaborate_bikepgh_position                      end_date 
                  "character"                      "factor" 
             familiar_av_tech                 feel_safe_avs 
                     "factor"                     "numeric" 
               

participant_id,age,av_disclose_performance,av_reduce_injuries,av_report_safety,av_school_zone,av_share_data,av_speed_limit,av_two_drivers,bikegph_member,⋯,other_comments,own_car,own_smartphone,paying_attention_av_news,positive_av_interaction,regulate_av_testing,start_date,thoughts_pgh_av_testing,year,zipcode
0,,,Yes,,No,Yes,No,,,⋯,,,,To a moderate extent,,Not sure,02/22/2017 10:11:47 AM PST,Approve,2017,15216
1,,,Yes,,Not sure,Yes,Yes,,,⋯,"I would really like them to share their data with City Planning so we aren't just being guinea pigs for no payoff, but otherwise I am pretty in favor of them.",,,To some extent,,Yes,02/22/2017 10:15:25 AM PST,Neutral,2017,15224
2,,,Yes,,No,Yes,No,,,⋯,,,,To a moderate extent,,Yes,02/22/2017 10:17:08 AM PST,Approve,2017,15206
3,,,Maybe,,Yes,Yes,Not sure,,,⋯,i have spent a lot of time conversing with people who work at uber and third parties about these technologies and the responsibility that's carried along with using our city as a testing ground. feel free to call me at 5129448987 anna bieberdorf,,,To a large extent,,Yes,02/22/2017 10:19:36 AM PST,Disapprove,2017,15201
4,,,Maybe,,Yes,Yes,Yes,,,⋯,,,,To a moderate extent,,Yes,02/22/2017 10:27:29 AM PST,Somewhat Approve,2017,15224
5,,,Yes,,Not sure,Yes,Yes,,,⋯,,,,To a moderate extent,,Yes,02/22/2017 10:29:01 AM PST,Approve,2017,15201-1720
6,,,Yes,,No,Yes,Yes,,,⋯,"I don't think that pedestrians and cyclists should fear this technology. However, we should be cautious of our personal safety around unmanned vehicles as we would with any vehicle!",,,To a large extent,,Yes,02/22/2017 10:30:07 AM PST,Approve,2017,15215
7,,,Yes,,Not sure,Yes,Not sure,,,⋯,"I think eventually they'll be safer for bicyclists and pedestrians to interact with than human-operated vehicles. I'm not sure we're at that point yet, though.",,,To a moderate extent,,Yes,02/22/2017 10:32:10 AM PST,Somewhat Approve,2017,15227
8,,,,,,,,,,⋯,,,,,,,02/22/2017 10:37:00 AM PST,,2017,
9,,,Yes,,No,Yes,No,,,⋯,,,,To a large extent,,Yes,02/22/2017 10:38:05 AM PST,Approve,2017,15217


# Choose one of the text fields

In [24]:
colname <- 'interaction_details'
colname

In [25]:
# Get non-empty rows from that columns
filtered_data <- subset(survey_data, !is.na(survey_data[colname]))
nrow(filtered_data)

# Tokenize (split text into words)
This may seem trivial, but you'll want to detach punctuation from words, since "person" and "person," aren't very different. And what about contractions such as "I'm"? Will you want to lowercase everything or is there some distinction between "polish" and "Polish" you'd want to preserve?

You'll also want to think about "stopwords", function words such as "the" and "and", or "or" and "that". Counts for these words are often distracting to machine learning models, and they're often removed unless there may be important or meaningful variation in stopword usage.

In [36]:
library(tidytext)
library(dplyr)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



In [26]:
class(filtered_data[2, colname])

In [28]:
sapply(filtered_data, class)

In [31]:
colname

In [33]:
tokenized_data <- unnest_tokens(filtered_data, word, !!colname)
nrow(tokenized_data)

In [37]:
# Remove stopwords
tokenized_data <- anti_join(tokenized_data, get_stopwords())
nrow(tokenized_data)

Joining, by = "word"


# Extract features (words to numbers)
One of the simplest ways to get documents into numeric format for machine learning is to simply count each unique word and treat each document as collection of these counts. For example, "the dog barked loudly at the hat" would become {the: 2, dog: 1, barked: 1, loudly: 1, at: 1, hat: 1}. Each unique word in the vocabulary is usually given an ID. Because order information is lost, this is referred to as the "bag-of-words" model of documents.

In [39]:
# Make word counts
word_counts <- tokenized_data %>% count(participant_id, word, sort=TRUE)
word_counts

participant_id,word,n
94,vehicles,6
387,vehicle,6
1319,way,6
229,follow,5
364,car,5
3,time,4
43,lane,4
77,stop,4
229,bicyclists,4
229,rules,4


In [40]:
# Make document-term matrix

dtm <- word_counts %>% cast_dtm(participant_id, word, n)
dtm

<<DocumentTermMatrix (documents: 919, terms: 1978)>>
Non-/sparse entries: 9916/1807866
Sparsity           : 99%
Maximal term length: 17
Weighting          : term frequency (tf)

# Run LDA
Now let's let LDA find topics. Here you'll want to vary the number of topics and compare results in the interpretation later. Start with 5 or 10 and go up to as much as you feel comfortable trying to interpret.

In [41]:
library(topicmodels)

In [42]:
lda <- LDA(dtm, k=10, control=list(seed=9))
lda

A LDA_VEM topic model with 10 topics.

# Interpretation
This is one of the tougher parts. You'll examine the words and documents given the highest probability for each topic and see if they make any sense (they might not). If they don't, go back and change the number of topics, change preprocessing (tokenization, etc), or throw up your hands and tell me how terrible topic modeling is :)

## Top words/topic

In [45]:
lda_topics <- tidy(lda, matrix='beta')

top_topic_terms <- lda_topics %>% 
    group_by(topic) %>%
    top_n(5, beta) %>%
    ungroup() %>%
    arrange(topic, -beta)

top_topic_terms

topic,term,beta
1,stop,0.07957208
1,street,0.04693404
1,traffic,0.02519528
1,cross,0.02253984
1,crossing,0.02155735
2,drivers,0.04620747
2,like,0.03973692
2,cars,0.03890876
2,human,0.03725183
2,car,0.034488


## Top documents/topic

In [51]:
lda_topics <- tidy(lda, matrix='gamma')

top_topic_docs <- lda_topics %>% 
    group_by(topic) %>%
    top_n(5, gamma) %>%
    ungroup() %>%
    arrange(topic, -gamma)

top_topic_docs

document,topic,gamma
229,1,0.9882656
77,1,0.9787195
378,1,0.9752705
224,1,0.9462052
1394,1,0.9296758
141,2,0.9745831
1492,2,0.9428475
1604,2,0.9390428
442,2,0.9388686
1406,2,0.9285627


In [57]:
top_topic_docs_test <- mutate(top_topic_docs, text = filtered_data[which(filtered_data$participant_id==document), colname])
top_topic_docs_test

“longer object length is not a multiple of shorter object length”

ERROR: Error: Column `text` must be length 50 (the number of rows) or one, not 2


In [56]:
filtered_data[which(filtered_data$participant_id=='94'), colname]

## See how distribution of other fields varies across topics
Here, you can "assign" documents to their highest-ranking topic and see how other fields vary across topics