Skip to content

ivan-carrera/ir24a

Repository files navigation

Information Retrieval - Escuela Politécnica Nacional - Computación 2024-A

Welcome to the Information Retrieval course for the Bachelor of Science in Computer Science. This repository contains the assignments and hands-on exercises that will help you understand and apply the concepts of Information Retrieval (IR).

Course Description

This course introduces the fundamental concepts of Information Retrieval systems, including the processing, indexing, and retrieval of textual data. Students will learn about various algorithms and techniques used in search engines, document and query processing, and relevance feedback. The course will also cover recent advancements in IR such as web search, semantic analysis, and information retrieval in big data.

Prerequisites

  • Basic programming skills in Python
  • Familiarity with data structures
  • Basic knowledge of algorithms

Course Description

This course addresses advanced aspects of access and information retrieval, focusing on various points: models (probabilistic, vector space, and logical), multimedia indexing, web information retrieval, and its connections with machine learning. This content provides opportunities to introduce students to the processing of a large amount of semi-structured data. The theoretical content is accompanied by examples associated with different applications.

Learning Outcomes

By the end of this course, students will be able to:

  1. Understand the architecture of information retrieval systems.
  2. Implement basic algorithms for indexing and searching text.
  3. Evaluate the effectiveness of IR systems.
  4. Apply IR techniques to real-world text processing tasks.
  5. Analyze the models related to information retrieval and their application in specific environments.
  6. Discriminate the most appropriate retrieval, classification, and clustering strategies for different application cases.
  7. Act with ethical and professional responsibility in managing information retrieval systems.

Course Structure

  • Lectures: Weekly lectures covering theory and practical applications.
  • Assignments: Two assignments per week to be completed individually.
  • Hands-On Exercises: Practical sessions during labs to implement and test IR concepts.
  • Project: A term-end project focusing on a specific IR challenge.

Assignments

This section will be regularly updated with assignment descriptions and deadlines. Assignments are intended to be completed using Python and will require the use of specific libraries detailed in each assignment's instructions.

Hands-On Exercises

Hands-on exercises will be conducted during lab sessions. These exercises are designed to reinforce the theoretical concepts covered in lectures through practical implementation.

Getting Started

  1. Clone this repository to your local machine using git clone https://github.com/ivan-carrera/ir24a.git.
  2. Install any required software and libraries as specified in the setup guide for each assignment.
  3. Follow the instructions provided in each directory for assignments and exercises.

Resources

Books:

  • Baeza-Yates R., Ribeiro-Neto B. (2010). Modern Information Retrieval, The Concepts and Technology Behind Search. Addison Wesley
  • Manning C., Raghavan P., Schütze H. (2008). Introduction to Information Retrieval. Cambridge University Press
  • Konchady M. (2008). Building Search Applications: Lucene, LingPipe, and Gate. Mustru Publishing
  • Büttcher S., Clarke C., Cormack G. (2016). Information Retrieval: Implementing and Evaluating Search Engines. The MIT Press

Papers:

Foundational Papers

  • Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1), 11-21. Introduces the concept of Inverse Document Frequency (IDF), which is a fundamental part of the TF-IDF weighting scheme used in document scoring.
  • Rocchio Jr, J. J. (1971). Relevance feedback in information retrieval. The SMART retrieval system: experiments in automatic document processing. Describes the Rocchio algorithm for relevance feedback in vector space models, which is crucial for improving search accuracy based on user feedback.
  • Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems, 30(1-7), 107-117. This is the original paper on Google's search engine design, detailing the architecture and algorithms that revolutionized how search engines operate.
  • Bates, M. J. (1989). The design of browsing and berrypicking techniques for the online search interface. Online review, 13(5), 407-424. Introduces the concept of berrypicking, a model of browsing in which the user refines their search query iteratively based on the information retrieved.

Algorithms and Models

  • Improving Retrieval Performance by Relevance FeedbackSalton, G., & Buckley, C. (1990). Improving retrieval performance by relevance feedback. Journal of the American society for information science, 41(4), 288-297. Explores and provides algorithms for relevance feedback, a key technique in improving the effectiveness of IR systems.
  • Manning, C., & Klein, D. (2003, May). Optimization, maxent models, and conditional estimation without magic. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Tutorials-Volume 5 (pp. 8-8). Discusses maximum entropy models, which have been influential in the development of probabilistic approaches to IR.

Modern Techniques and Machine Learning

  • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Introduces BERT, a method using Transformers that has set new standards for how natural language processing tasks are tackled, including in IR.
  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022. Describes LDA, a generative statistical model that allows sets of observations to be explained by unobserved groups, explaining how these groups can be used to learn the themes running through a text corpus.

Evaluation and Metrics

  • Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval, 3(4), 333-389. A comprehensive review of the BM25 ranking function and the probabilistic relevance framework, which are widely used in document retrieval.

Ethics and Bias

  • Hajian, S., Bonchi, F., & Castillo, C. (2016, August). Algorithmic bias: From discrimination discovery to fairness-aware data mining. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 2125-2126). Discusses the impact of bias in algorithmic decision-making, including retrieval systems, and approaches for creating fairness-aware algorithms.

Contributing

Students are encouraged to contribute to the course by sharing resources, discussing topics, and submitting pull requests with improvements to code or documentation.

Contact Information

For any inquiries regarding the course, please contact Prof. Iván Carrera, PhD.

This repository is maintained by Prof. Iván Carrera, and is intended for the use of students enrolled in the Information Retrieval course at Escuela Politécnica Nacional.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages