jpmcResearchBot

The main function of this tool is to extract suggested tags from a given document or a given string content.

The suggested tags will partly contain the keyword extracted from the content using the tf-idf algorithm and partly

contain the name entity extracted using the NER(name entity recognition algorithm)

I. File List

II. Design

In the tf-idf algorithm, term frequency(tf) is the raw frequency of a term in a document. Inverse document frequency(idf) is a measure of how much information the word provides and it is the logiarithmically scaled fration of the documents contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient.

The NER algorithm will try to extract the name entities from the document.

A mySQL database is created for the usage of idf, the table named idf and the table named doc_num contain all the necessary information to calculate idf. These tables shall be created and initialized at the beginning and shall only be updated if new documents are imported.

Inclusion table and exclusion table are also maintained in the mySQL database as two tables named T_inclusion and T_exclusion accordingly.

III. Configuration

Python Dependencies:

Flask
Numpy
NLTK (see http://www.nltk.org/data.html, additional data needs to downloaded to /usr/share in order for the module to work correctly)
MySQL-python (if you decide to use mysql as the backend)

Python + Flask needs to sit on some webserver in order for the API to serve extrernal HTTP calls. The test server is hosted on an Ubuntu Amazon cloud instance with Apache. Hosting the API on other webservers are possible. Please see this deployment section for examples and configuration:

http://flask.pocoo.org/docs/0.10/deploying/

For pure testing on localhost. It is possible to run __init__.py directly from your favorite Python IDE and ping your server on via localhost:5000.

Configuration parameters for the API are saved in the file config.py

mySQL Configuration

to enable mySQL, 4 paramters must be set in the config.py. MYSQL_HOST, MYSQL_USER, MYSQL_PASSWD, MYSQL_DB_NAME

Documents configuration

all the jpmc research documents are supposed to be kept in ./TFIDF/txtAssets/

Those text files with extention '.txt' will enable the algorithm to generate the idf data in the 1st step.

IV. Initialization

create tables (idf table, doc_sum table, inclusion and exclusion tables)

Run 'python createTablesInDB.py' will automatically create all the tables(after setting up config.py)

create or update the idf data table

move all the documents to the directory ./TFIDF/txtAssets/

go to directory ./TFIDF/

Run 'python initialIDF.py'

V. Examples

Once your server is running. There are 3 types of HTTP calls available, they all require the POST method:

http://:5000/getKeywords

Uses a half and half approach for the algorithm: half from Named Entities, half from TFIDF

http://:5000/getTFIDF

Uses TFIDF. It returns unigrams (single words only).

http://:5000/getNER

Uses Named Entity Recognition (NER) returns significant name-like nouns (could be longer than a single word)

They all take the same text input. In the Header section of the HTTP request, you can specify this as a raw text or JSON input:

{ "text":"", "max_n": }

Example:

{ "text":" Near term, however, we would expect investors to continue reducing underweight positions indiscriminately in both E.ON and RWE given the perceived potential upside from capacity payments, higher power prices, further cost cutting and a burden sharing deal with the government on long-term nuclear liabilities. ", "max_n":30 }

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
NER		NER
TFIDF		TFIDF
DB.py		DB.py
DB.pyc		DB.pyc
JPMC Presentation - DRAFT.pptx		JPMC Presentation - DRAFT.pptx
README.md		README.md
__init__.py		__init__.py
config.py		config.py
config.pyc		config.pyc
createTablesInDB.py		createTablesInDB.py
deployscript.sh		deployscript.sh
sample.py		sample.py
sample.pyc		sample.pyc
test.py		test.py
utils.pyc		utils.pyc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

jpmcResearchBot

I. File List

II. Design

III. Configuration

IV. Initialization

V. Examples

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

jpmcResearchBot

I. File List

II. Design

III. Configuration

IV. Initialization

V. Examples

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages