# Information Retrieval

Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).

![Sort-Based-Index](img/treclogo-c.gif)

## Contact Details

Boris Velichkov

Dimitar Dimitrov:
mitko.bg.ss@gmail.com /
ilijanovd@fmi.uni-sofia.bg 

Aleksis Datseris

## Information Retrieval - what is it about & what do you gain from it?
* Introductory course about finding relevant information within large collections of *unstructured* data.
* The course explores the core concepts behind search engines
* Builds the foundation for advanced fields like NLP, recommender systems, and AI-driven information systems
* *Specific to our course* -> You will learn Python basics which will be used in future courses. *Recommendation:* Enroll in the Python Basics course to enhance your proficiency in Python.


## What is the agenda for practical lessons ?

* Course Intro
* Intro to Python (Basic)
* Text processing (Basic)
* Indexing - Inverted Index 
* TF-IDF & Ranking 
* Metrics - Evaluation, Analysis 
* Web crawling/scraping 
* Spell-checking & Keyword extraction 
* Summarization - Extractive & Abstractive 
* Classification (Features) - NB & Logistic Regression 
* Language models (Extra)
* Course project idea discussion
    * Every practical lesson is potentially a project idea discussion!!!
* Course project idea presentation

## Course dynamics - why does it matter?
* No more chasing, now it is your turn.
* Questions and discussions are always welcome.
* Try to attend practical lessons at the very least. We always share personal experiances which ChatGPT cannot tell you about!

## **Start thinking about your project early. It will save you a lot of time.**
* Choosing your project topic early always yelds better results. Why? Because there is time to discuss!
* Aim to do something you have not done before. 
* Set high goals and complete what you can. We do not penalize effort, regardless of the end results.
* You can choose your own course project topic...
* ... Or you can come to us for suggestions. 

# Conferences

Special Interest Group on Information Retrieval: <a href='https://sigir.org'>SIGIR</a> <br>
Text REtrieval Conference: <a href='https://trec.nist.gov/proceedings/proceedings.html'>TREC</a>  <br>
European Conference on Information Retrieval: <a href='http://www.ecir2018.org/'>ECIR</a>  <br>
Conference and Labs of the Evaluation Forum: <a href='https://www.clef-initiative.eu/'>CLEF</a>  <br>


# Example papers:
<ul>
<li><a href='http://nrl.northumbria.ac.uk/30863/1/SIGIR2017_Elsweiler.pdf'>Exploiting Food Choice Biases for Healthier <b>Recipe Recommendation</b></a> -> dataset of food reciped with nutrition information crawled from Allrecipes.com
<li>CitySearcher: A <b>City Search</b> Engine For Interests 
<li>A Test Collection for Evaluating <b>Legal Case Law Search</b>
<li>Multihop Attention Networks for <b>Question Answer Matching</b>
<li>Semantic Location in <b>Email Query Suggestion</b>
<li>Online <b>Job Search</b>: Study of Users’ Search Behavior using Search Engine Query Logs	
</ul>

# Most Valued Projects:

- Something Useful for Sofia University, the Master's Degree, etc. (contact Prof. Koychev)
- Participating in Shared Tasks (contact us or Prof. Koychev)

## Some project ideas
- Grammarly or [Hemingway](http://www.hemingwayapp.com/) for Bulgarian
- !!!Collect/crawl questions and answers from exams after 4th/12th grade (there are a lot of on-line resources!). This will serve as a good stating point for building a Machine Reading/Question Answering model for Bulgarian!

# Some Shared Tasks

## <a href='http://alt.qcri.org/semeval2019/index.php?id=tasks'>SemEval</a>
- Fact Checking in Community Question Answering Forums
- Suggestion Mining from Online Reviews and Forums
- RumourEval 2019: Determining Rumour Veracity and Support for Rumours
- many more

# CheckThat!

# Basic (pre-)requisites

## Python basics:
- http://nbviewer.jupyter.org/github/justmarkham/python-reference/blob/master/reference.ipynb
- https://www.cs.put.poznan.pl/csobaniec/software/python/py-qrc.html
- https://www.stavros.io/tutorials/python/

## Jupyter Notebooks
- https://www.dataquest.io/blog/jupyter-notebook-tutorial/

## Text Processing Libraries
- NLTK - collection of libraries and tools for text processing, created by academics (not production ready)
- Spacy - Industrial-Strength Natural Language Processing
- scikit-learn - Machine Learning in Python
- Pandas - Structures and data analysis tools for Python
- Numpy - scientific computing with Python
- Keras, TensorFlow, Pytorch- deep learning libraries for Python

# Books:
## Information Retrieval 
- Book for the course : https://nlp.stanford.edu/IR-book/information-retrieval-book.html
## NLP
- Foundations of Statistical Natural Language Processing https://nlp.stanford.edu/fsnlp/
- Speech and Language Processing (you can find also Youtube videos https://web.stanford.edu/~jurafsky/slp3/ )

# Online sources

## Search for papers to find relevant work and existing approaches
- https://scholar.google.com/
- https://www.researchgate.net/

## Corpora
- https://toolbox.google.com/datasetsearch
- https://www.kaggle.com/
- https://archive.ics.uci.edu/ml/datasets.html

## Facebook groups
- https://www.facebook.com/groups/1034542806576291/
- https://www.facebook.com/groups/829586007120477/
- https://www.facebook.com/groups/machine.learning.bg/
- https://www.facebook.com/datasciencesoc/

## Misc
- https://www.kdnuggets.com/
- https://machinelearningmastery.com/start-here/
- https://nlpprogress.com/ - latest research in NLP
- https://paperswithcode.com/ - implementations of papers