This course is an introduction to corpus linguistics. We will start with a brief introduction to textual corpora, including linguistic annotation and representation schemas. We will then address aspects such as the extraction of relevant information from corpora, such as collocations or keyword extraction, using statistical and distributional techniques. Finally, we will learn the XML markup language. During the module we will introduce several corpora in various languages (English, Spanish, Basque, etc).
-
Introduction to Corpus Linguistics
- Introduction
- Corpus Linguistics
- Uses of corpora
- Corpus types
- Corpus annotation and standards for linguistic representation
-
XML
- XML introduction
- XML schemas and validation
- XPath
-
Laboratories
- Linux commands
- Word frequencies and Zipf law
- Collocations
- Keyword extraction
- XML and XPath
-
Assignments
- Brown collocations
- Hyperpartisan log-odd ratios
- python (version 3 or higher)
- lxml library (http://lxml.de/)
- R (https://www.r-project.org/)
- AntCoc (http://www.laurenceanthony.net/software.html)
- Attendance and participation: 10%
- Class assignments: 55%
- Assignments: three choices
- Regular assignment: 20%
- ’Hard’ assignment: 35%
- Propose a subject for the final project: 35%