snlp-final-project

Final Project for Statistical Natural Language Processing Course

Baseline Document Retreival Model
- Extract text from corpus
- Preprocess the texts from corpus and apply tokenisation
- Compute idf
- Comput tf
- Give list of query terms as product of term's idf and tf-value
- Relavance based on cosine similarity
- Sort similarity scores and output top 50 most relevant documents
- Function to evaluate performance of document using precision at r with r = 50
- Test on test_questions.txt
Advanced Document Retriever with Re-Ranking
- Use the baseline model and return the top 1000 documents
- Re-rank the top 1000 documents with a more advanced approach
Sentence Ranker
- Split the top 50 documents into sentences (sent_tokenize)
- Treat the sentences likedocuments to rank them and return the top 50 sentences (same approach as above)
- Evaluate performance using Mean Reciprocal Rank

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md
dataset.csv		dataset.csv
model.py		model.py
patterns.txt		patterns.txt
test_questions.txt		test_questions.txt
trec_documents.xml		trec_documents.xml

Provide feedback