Branch: master
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
267 lines (198 sloc) 9.32 KB


Build Status

LE-CAT is a Lexicon-based Categorization and Analysis Tool developed by the Centre for Interdisciplinary Methodologies in collaboration with the Media of Cooperation Group at the University of Siegen.

The tool allows you to apply a set of word queries associated with a category (a lexicon) to a data set of textual sources (the corpus). LE-CAT determines the frequency of occurrence for each query and category in the corpus, as well as the relations between categories (co-occurrence) by source.

The tool also allows you to quickly generate the data for lexicon-based analysis, by extracting descriptions from the Youtube API for URLs provided by the user.

The purpose of this technique is to automate and scale up user-led data analysis as it allows the application of a custom-built Lexicon to large data sets. The quick iteration of analysis allows the user to refine a corpus and deeply analyse a given phenomenon.

LE-CAT was coded by James Tripp. It has been used to support the workshop Youtube as Test Society (University of Siegen) and the Digital Test of the News (University of Warwick) and will soon be tested by students on the MA Module Digital Objects, Digital Methods.

Academic correspondence should be sent to Noortje Marres.


You can install the released version of lecat from Github with:


Interfaces to the package

The package includes functions to run the analysis and also a Shiny interface to the functions. The Shiny interface can be started by entering the command.


The below example assumes you are not running the shiny app.


Downloading descriptions from YouTube

You may wish to create a corpus of video descriptions from YouTube. A recent workshop took this approach.

You should have a list of YouTube URLs and a YouTube data API key. You can generate an API key by creating a Google account and going to the developers console. Your YouTube URLs shoud be loaded as a character vector as below:

youtube_urls <- c(

youtube_key <- 'YOURKEY'

The extract_ids function will attempt to extract the ids from your URLs. The function will try to resolve any URLs and discard those which cannot be resolved to a recognised format (

youtube_ids <- extract_video_ids(youtube_urls)

These ids can be passed to the download_youtube_video_descriptions function. The function returns a corpus dataframe with one video per colum and the id, publishedAt, title and description of each video.

corpus <- download_youtube_video_descriptions(video_ids = youtube_ids, api_key = youtube_key)

Running a LE-CAT analysis


The LE-CAT lexicon should contain the queries you wish to look for, the categories associated with the queries and the type of category. The lexicon can be in a wide form such as

Type Category Query Query1 Query2
Technology Software Windows MacOS Linux
Technology Hardware Mac mini Alienware iMac
Tree Hardwood Oak Pine Maple

which is then converted to a long format using the function parse_lexicon

lexicon <- parse_lexicon(wide_lexicon = lexicon, query_column = 'Query')

where Query is the first column containing your queries and there is no other type of data to the right. The parse_lexicon function can handle a wide lexicon containing differing numbers of queries per category, such as:

Type Category Query Query1 Query2
Technology Software Windows MacOS Linux
Technology Hardware iMac
Tree Hardwood Oak Pine Maple

The parse lexicon function retuns a long form lexicon which is used in the lecat analysis. The long form lexicon looks like this

Type Category Query
Technology Software Windows
Technology Software MacOS
Technology Software Linux
Technology Hardware iMac

and, alternatively, one can use a lexicon already in a long form.

Query searching

LE-CAT searches for terms in the corpus text. You may wish to specify a different search column for each Type. The search column for each type is specified in a search data frame detailing each Type and the corresponding search column. For example,

searches <- data.frame(
  Type = c('Technology', 'Tree'),
  Column = c('description', 'title'),
  stringsAsFactors = FALSE

where Technology queries should be searched for in the description column of the corpus and Tree queries should be searched for in the title column. You need to create this data frame to define the search column for each Type.

To carry out your search of the corpus, pass the long form lexicon, the search data frame and corpus to the run_lecat_analysis function.

lecat_result <- run_lecat_analysis(lexicon = lexicon,
                                   corpus = corpus,
                                   searches = searches,
                                   id = 'id',
                                   regex_expression = '\\Wquery\\W')

Note that you can pass your own regex expression. The function replaces the text ‘query’ with the relevent search query. The above searches for cases where there are non-word characters (e.g., spaces or periods) located on either side of the search query.

The query result is a table like the below

Type Category Query Column_examined id1 id2
Technology Software Windows description 1 4
Technology Software MacOS description 0 2
Technology Software Linux description 3 1
Technology Hardware iMac description 4 2

where id1 and id2 are the ids specified above for each entry in the corpus.


The LE-CAT raw count data shown above may be hard for one to process. The above can be summarised into a diagnostic file using the create_unique_total_diagnostics function

diagnostics <- create_unique_total_diagnostics(lecat_result)

to create a table like so

Type Category Queries Column_examined
Technology(2,2) Software(2,2) Hardware(2,2) Windows(2,2), MacOS(2,2), Linux(2,2), iMac(2,2) description

where the first number in the brackets is the total occurance and second is the number of corpus elements (e.g., individual YouTube videos) each Type, Category and Query occur in.


You may calculate the cooccurance of queries (irrespective of the corresponding Type and Category) using the create_cooccurrence_graph function

cooccurrence <- create_cooccurrence_graph(lecat_result = lecat_result,
                                          level = 'Query')

Note Creating the cooccurence graph can take a lot of time depending on the size of corpus and lexicon.

You can calculate cooccurance based on the categories by changing the level argument to ‘Category’.

By default the files ‘result.graphml’ and ‘cotable.csv’ are created in the working directory.

The result.grapml file contains a network of the cooccurences. The Type, Category and Column_examined are attributes of the nodes and the weight of the edges are the cooccurance of the queries. You may view graphml network graphs using the excellent Gephi program. Note if you have duplicate edges (where there are edges between node1-node2 and node2-node1 then select merge first withing Gephi).

The cotable.csv file is a table of the cooccurences. This file can be loaded into programs such as Excel, LibreOffice or R.

The create_cooccurrence_graph function returns a list containing the cooccurence table as a data frame and the cooccurence network as an igraph network. The igraph object may be useful as the igraph package contains many functions for calculating statistics on the generated netowrk. You may also plot the object in R using the following command:


and view the cooccurance table like so


Bugs or feature requests

Please enter any bugs or feature requests via github.

Dr James Tripp, Academic Technologist, CIM