Skip to content

Commit

Permalink
Update README for Identification of themes component
Browse files Browse the repository at this point in the history
  • Loading branch information
albamoralest committed Jun 27, 2022
1 parent 8c2e045 commit 0889947
Show file tree
Hide file tree
Showing 5 changed files with 151 additions and 70 deletions.
109 changes: 56 additions & 53 deletions 03_Identify_TimeE.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -9,68 +9,71 @@
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "d9f40d95-e1cc-4c98-8df9-c9770887d7fa",
"cell_type": "markdown",
"id": "334df2da-86a7-482e-b153-5277c0ab603d",
"metadata": {},
"outputs": [],
"source": [
"### This notebook processes text from music personalities' biographies and extract historical meetups information \n",
"#### Pre-requirements:\n",
"#### Text organised in sentences\n",
"This notebook processes text from music personalities' biographies and extract historical meetups information \n",
" Pre-requirements:\n",
" Text organised in sentences\n",
"\n",
"The implementation of the algorithm is based in the work presented by Zhong et al.\n",
"\n",
"#### The implementation of the algorithm is based in the work presented by Zhong et al.\n",
"#### @inproceedings{Zhong_Sun_Cambria_2017, address={Vancouver, Canada}, \n",
"#### title={Time Expression Analysis and Recognition Using Syntactic Token Types and General Heuristic Rules}, \n",
"#### url={http://aclweb.org/anthology/P17-1039}, DOI={10.18653/v1/P17-1039}, booktitle={Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, publisher={Association for Computational Linguistics}, \n",
"#### author={Zhong, Xiaoshi and Sun, Aixin and Cambria, Erik}, year={2017}, pages={420–429}, language={en} }\n",
"#### The authors use HEURISTIC rules to identify time tokens, and POST tags to filter out ambiguos time tokens\n",
"#### Implementation in JAVA https://github.com/zhongxiaoshi/syntime\n",
" @inproceedings{Zhong_Sun_Cambria_2017, address={Vancouver, Canada}, \n",
" title={Time Expression Analysis and Recognition Using Syntactic Token Types and General Heuristic Rules}, \n",
" url={http://aclweb.org/anthology/P17-1039}, DOI={10.18653/v1/P17-1039}, booktitle={Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, publisher={Association for Computational Linguistics}, \n",
" author={Zhong, Xiaoshi and Sun, Aixin and Cambria, Erik}, year={2017}, pages={420–429}, language={en} }\n",
"\n",
"#### Implementation of the approach, from Zhong et al. analysis:\n",
"#### 1) identify time tokents, 2) identify time segments, 3) identify time expressions\n",
"#### In our implementation:\n",
"#### To identify time tokens, we use three types of tokens: TIME, MODIFIER, NUMERAL.\n",
"#### Each type have more specific types:\n",
"#### MODIFIER = [\"PREFIX\",\"SUFFIX\",\"LINKAGE\",\"COMMA\",\"PARENTHESIS\",\"INARTICLE\"]\n",
"#### NUMERAL = [\"BASIC\",\"DIGIT\",\"ORDINAL\"]\n",
"#### TIME = [\"DECADE\", \"YEAR\", \"SEASON\", \"MONTH\", \"WEEK\", \"DATE\", \"TIME\", \"DAY_TIME\", \"TIMELINE\", \"HOLIDAY\", \"PERIOD\", \"DURATION\", \"TIME_UNIT\",\"TIME_ZONE\", \"ERA\",\"MID\",\"TIME_ZONE\",\"DAY\",\"HALFDAY\"]\n",
"The authors use HEURISTIC rules to identify time tokens, and POST tags to filter out ambiguos time tokens\n",
"Implementation in JAVA https://github.com/zhongxiaoshi/syntime\n",
"\n",
"Implementation of the approach, from Zhong et al. analysis:\n",
" 1) identify time tokents, 2) identify time segments, 3) identify time expressions\n",
" \n",
"Our implementation:\n",
"\n",
" - Identify time tokens, we use three types of tokens: TIME, MODIFIER, NUMERAL.\n",
" Each type have more specific types:\n",
" MODIFIER = [\"PREFIX\",\"SUFFIX\",\"LINKAGE\",\"COMMA\",\"PARENTHESIS\",\"INARTICLE\"]\n",
" NUMERAL = [\"BASIC\",\"DIGIT\",\"ORDINAL\"]\n",
" TIME = [\"DECADE\", \"YEAR\", \"SEASON\", \"MONTH\", \"WEEK\", \"DATE\", \"TIME\", \"DAY_TIME\", \"TIMELINE\", \"HOLIDAY\", \"PERIOD\", \"DURATION\", \"TIME_UNIT\",\"TIME_ZONE\", \"ERA\",\"MID\",\"TIME_ZONE\",\"DAY\",\"HALFDAY\"]\n",
" Added PARENTHESIS and improving regular expressions\n",
" - Initialize regular expressions:\n",
" Read regular expressions stored in:\n",
" - timeRegex.txt\n",
" - Build additional regular expressions using the base expressions in timeRegex.txt\n",
" - Compile regex objects just once for better performance\n",
" \n",
"For each sentence in the biography:\n",
"\n",
"#### Added PARENTHESIS and improving regular expressions\n",
"#### Initialize regular expressions:\n",
"#### Read regular expressions stored in:\n",
"#### - timeRegex.txt\n",
"#### Build additional regular expressions using the base expressions in timeRegex.txt\n",
"#### Compile regex objects just once for better performance\n",
" a) Identify token types. Function \"def get_time_tokens(text):\"\n",
" - Tokenize \n",
" - Obtain POS tags \n",
" - Use regular expressions to identify type of token: time, modifier, numeral\n",
" - Filter out ambiguous words by matching POS tags and type of token \n",
" - Output: A list of all the tokenized words, POS tags, type of token and token\n",
"\n",
"#### For each sentence in the biography:\n",
"#### a) Identify token types. Function \"def get_time_tokens(text):\"\n",
"#### - Tokenize \n",
"#### - Obtain POS tags \n",
"#### - Use regular expressions to identify type of token: time, modifier, numeral\n",
"#### - Filter out ambiguous words by matching POS tags and type of token \n",
"#### - Output: A list of all the tokenized words, POS tags, type of token and token\n",
" b) Identify time segments. Function \"def get_time_segments(time_token_list):\"\n",
" - A time segment has one time token and one or zero modifiers or numerals\n",
" - Search for a time token, once found search the surroundings:\n",
" - Search tokens on the left \n",
" If PREFIX or NUMERAL or IN_ARTICLE continue searching\n",
" - Search tokens on the right \n",
" If SUFIX or NUMERAL continue searching\n",
" For right and left search, if token is COMMA or LINKAGE then stop\n",
" If time segments overlap, then apply heuristic rules and merge segments\n",
" - Output: A list of time segments, each time segment has the word's index in the sentence\n",
"\n",
"#### b) Identify time segments. Function \"def get_time_segments(time_token_list):\"\n",
"#### - A time segment has one time token and one or zero modifiers or numerals\n",
"#### - Search for a time token, once found search the surroundings:\n",
"#### - Search tokens on the left \n",
"#### If PREFIX or NUMERAL or IN_ARTICLE continue searching\n",
"#### - Search tokens on the right \n",
"#### If SUFIX or NUMERAL continue searching\n",
"#### For right and left search, if token is COMMA or LINKAGE then stop\n",
"#### If time segments overlap, then apply heuristic rules and merge segments\n",
"#### - Output: A list of time segments, each time segment has the word's index in the sentence\n",
" c) Identify time expressions. Function \"classify_type_time_expression(time_expression_list):\"\n",
" - Three types of time expressions: time point, time reference and time range\n",
" - Apply heuristic rules to classify the type of time expression\n",
" - Output: A dataframe with the sentence, the type of time expression, the time expression and indexes\n",
" - Store each biography as a CSV file in extractedTimeExpressions/\n",
"\n",
"#### c) Identify time expressions. Function \"classify_type_time_expression(time_expression_list):\"\n",
"#### - Three types of time expressions: time point, time reference and time range\n",
"#### - Apply heuristic rules to classify the type of time expression\n",
"#### - Output: A dataframe with the sentence, the type of time expression, the time expression and indexes\n",
"#### - Store each biography as a CSV file in extractedTimeExpressions/\n",
"Directories information:\n",
"\n",
"#### Directories information:\n",
"#### indexedSentences/ : collection of biographies in CSV format. Each row of the file represents a sentence. Each row has a section name and paragraph index, and sentence index\n",
"#### extractedTimeExpressions/ : collection of annotated time expressions grouped by biography"
" - indexedSentences/ : collection of biographies in CSV format. Each row of the file represents a sentence. Each row has a section name and paragraph index, and sentence index\n",
" - extractedTimeExpressions/ : collection of annotated time expressions grouped by biography"
]
},
{
Expand Down
12 changes: 6 additions & 6 deletions README_data_cleaning.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,16 +39,16 @@ MEETUPS data cleaning is a tool developed using Python and Jupyter Notebook. Thi

|_ 01_CleaningText.ipynb

Raw corpus location:

Raw corpus location
Data output:
|_ text_dataset/

Clean text location:

Clean text location
Data input:
|_ cleanText/

Index data location:

Index data location
Data output:
|_ indexedParagraphs/
|_ indexedSentences/

Expand Down
80 changes: 80 additions & 0 deletions README_identification_themes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
---
id:
name: MEETUPS - Identification of themes
brief-description: This tool is part of the MEETUPS pilot and processes text from music personalities' biographies to find encounter types. It uses "sklearn" and a set of Machine Learning algorithms to classify sentences according to the established type of events. The tool extracts information from one of the four elements defining a meetup: the type of encounter (what). Encounter type, along with data of the people involved (who), the place (where) and the time it took place (what), complete the historical meetup information.
type: Software
release-date: 20/06/2022
release-number: v1.0
work-package:
- WP3
pilot: MEETUPS
keywords:
- Wikipedia
- Music
- Text classification
- Encounter type
licence: GPLv3
release link:
-
credits:
- https://github.com/enridaga
---

# MEETUPS - Identification of themes

MEETUPS identification of people and places is a tool developed using Python and Jupyter Notebook. SKLEARN and a set of Machine Learning algorithms to classify sentences according to the established type of events. The tool allows the extraction of one (the type of encounter) of the four elements that define a historical meetup.
The encounter types are music-making, business meetings, personal life, social life, coincidence, public celebration, and education.

This implementation is divided in three main tasks:
a) Generation of the training dataset
In order to identify and classify sentences according to the encounter type we need first to build a dataset with sentences that describe the different encounter types.
Approach:
- Manually prepare seed terms for each meetup type
- Randomly select sentences with those words from the corpus
- Assign the relevant meetup type to each one of those sentences

b) Training the classifier
Approach:
- Build a balanced training set by selecting first sentences from low represented classes
- Train and test MLPClassifier

c) Applying the classifier
Use the model tested in b) and infer the type of encounter for all the data in the corpus

### Information on installation and setup

- Jupyter Notebook:
MeetupType_applyClassifier.ipynb

### Details of the data

Running the Themes classifier:
|_ MeetupType_applyClassifier.ipynb

Training the Themes classifier:
|_ MeetupType_prototypeSentences.ipynb

Generating the training dataset:
|_ MeetupType_trainClassifier.ipynb

Data location

Data input:
|_ indexedSentences/

Data output:
|_ extractedMeetupTypes/

Classifier:
meetupType/models/MLPClassifier_2.clf'

Prototype sentences:
|_ meetupType/prototypeSentences_*.csv


|_ README_identification_themes.md


### DOI:

TODO
10 changes: 4 additions & 6 deletions README_people_places_identification.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,20 +51,18 @@ The second notebook uses the responses from DBpedia Spotlight to capture data of
#### Details of the data

Code location:

|_ 02_queryDbpedia.ipynb
|_ 02_Identify_PP.ipynb

Index data location:

Index data location
Data input:
|_ indexedSentences/

DBpedia Spotlight annotations:

|_ cacheSpotlightResponse/

People and places annotation:

People and places annotation
Data output:
|_ extractedEntitiesPersonPlaceOnly/


Expand Down
10 changes: 5 additions & 5 deletions README_time_expressions.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ credits:

# MEETUPS

MEETUPS identification of people and places is a tool developed using Python and Jupyter Notebook. This software uses NLTK Toolkit and heuristic rules to identify and annotate time expressions from input text. The tool allows the extraction of one (when a historical meetup happened) of the four elements that define a historical meetup.
MEETUPS identification of temporal knowledge is a tool developed using Python and Jupyter Notebook. This software uses NLTK Toolkit and heuristic rules to identify and annotate time expressions from input text. The tool allows the extraction of one (when a historical meetup happened) of the four elements that define a historical meetup.

This implementation is a rule-based Time Expression recognition tagger based on research by Zhong et al. and SynTime software (https://github.com/zhongxiaoshi/syntime). Their work was originally tested using three datasets: TimeBank, WikiWars and Tweets.
The authors implement a three-layer system that recognises time expressions using syntactic token types and general heuristic rules.
Expand Down Expand Up @@ -68,18 +68,18 @@ Finally the tool stores the results as a CSV file in extractedTimeExpressions/
### Details of the data

Code location:

|_ 03_Identify_TimeE.ipynb

Regular expressions:
|_ timeRegex.txt

Data location:
Data location

Data input:
|_ indexedSentences/

Time expressions annotations:

Time expressions annotations
Data ouput:
|_ extractedTimeExpressions/


Expand Down

0 comments on commit 0889947

Please sign in to comment.