GitHub - nohaelprince/MiniSearchEngine: Text mining using big data

Introduction

In information retrieval or text mining, the term frequency – inverse document frequency also called tf-idf, is a well know method to evaluate how important is a word in a document. tf-idf are also a very interesting way to convert the textual representation of information into a Vector Space Model (VSM), or into sparse features. VSM, is a space where text is represented as a vector of numbers instead of its original string textual representation; the VSM represents the features extracted from the document.

Program Goal

Given a big corpus of book data (about million records): bookid, Author and Title, this program implements a tf*idf index in memory which does the following: ∙ Read English text data into the index ∙ For a given query, output the top 10 results ranked by their tf-idf scores

In order to run this program:

Store the corpus file under the name “title_author.tab.txt” in the folder called “data”.
open a terminal and change directory to BookIndexing then type

** $ python3 main.py **

Perquisites:

The following programs should be installed inorder to run the program:

Python v. 3.4
scipy v. 0.14.0
numpy v.1.9.1

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Library.py		Library.py
README.md		README.md
book.py		book.py
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Program Goal

In order to run this program:

Perquisites:

Reference:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Introduction

Program Goal

In order to run this program:

Perquisites:

Reference:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages