Update README for Identification of themes component

polifonia-project · Jun 27, 2022 · 0889947 · 0889947
1 parent 8c2e045
commit 0889947
Show file tree

Hide file tree

Showing 5 changed files with 151 additions and 70 deletions.
diff --git a/03_Identify_TimeE.ipynb b/03_Identify_TimeE.ipynb
@@ -9,68 +9,71 @@
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": 1,
-   "id": "d9f40d95-e1cc-4c98-8df9-c9770887d7fa",
+   "cell_type": "markdown",
+   "id": "334df2da-86a7-482e-b153-5277c0ab603d",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "### This notebook processes text from music personalities' biographies and extract historical meetups information \n",
-    "#### Pre-requirements:\n",
-    "#### Text organised in sentences\n",
+    "This notebook processes text from music personalities' biographies and extract historical meetups information \n",
+    " Pre-requirements:\n",
+    "    Text organised in sentences\n",
+    "\n",
+    "The implementation of the algorithm is based in the work presented by Zhong et al.\n",
     "\n",
-    "#### The implementation of the algorithm is based in the work presented by Zhong et al.\n",
-    "#### @inproceedings{Zhong_Sun_Cambria_2017, address={Vancouver, Canada}, \n",
-    "#### title={Time Expression Analysis and Recognition Using Syntactic Token Types and General Heuristic Rules}, \n",
-    "#### url={http://aclweb.org/anthology/P17-1039}, DOI={10.18653/v1/P17-1039}, booktitle={Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, publisher={Association for Computational Linguistics}, \n",
-    "#### author={Zhong, Xiaoshi and Sun, Aixin and Cambria, Erik}, year={2017}, pages={420–429}, language={en} }\n",
-    "#### The authors use HEURISTIC rules to identify time tokens, and POST tags to filter out ambiguos time tokens\n",
-    "#### Implementation in JAVA https://github.com/zhongxiaoshi/syntime\n",
+    "    @inproceedings{Zhong_Sun_Cambria_2017, address={Vancouver, Canada}, \n",
+    "    title={Time Expression Analysis and Recognition Using Syntactic Token Types and General Heuristic Rules}, \n",
+    "    url={http://aclweb.org/anthology/P17-1039}, DOI={10.18653/v1/P17-1039}, booktitle={Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, publisher={Association for Computational Linguistics}, \n",
+    "    author={Zhong, Xiaoshi and Sun, Aixin and Cambria, Erik}, year={2017}, pages={420–429}, language={en} }\n",
     "\n",
-    "#### Implementation of the approach, from Zhong et al. analysis:\n",
-    "#### 1) identify time tokents, 2) identify time segments, 3) identify time expressions\n",
-    "#### In our implementation:\n",
-    "#### To identify time tokens, we use three types of tokens: TIME, MODIFIER, NUMERAL.\n",
-    "#### Each type have more specific types:\n",
-    "#### MODIFIER = [\"PREFIX\",\"SUFFIX\",\"LINKAGE\",\"COMMA\",\"PARENTHESIS\",\"INARTICLE\"]\n",
-    "#### NUMERAL = [\"BASIC\",\"DIGIT\",\"ORDINAL\"]\n",
-    "#### TIME = [\"DECADE\", \"YEAR\", \"SEASON\", \"MONTH\", \"WEEK\", \"DATE\", \"TIME\", \"DAY_TIME\", \"TIMELINE\", \"HOLIDAY\", \"PERIOD\", \"DURATION\", \"TIME_UNIT\",\"TIME_ZONE\", \"ERA\",\"MID\",\"TIME_ZONE\",\"DAY\",\"HALFDAY\"]\n",
+    "The authors use HEURISTIC rules to identify time tokens, and POST tags to filter out ambiguos time tokens\n",
+    "Implementation in JAVA https://github.com/zhongxiaoshi/syntime\n",
+    "\n",
+    "Implementation of the approach, from Zhong et al. analysis:\n",
+    "    1) identify time tokents, 2) identify time segments, 3) identify time expressions\n",
+    "    \n",
+    "Our implementation:\n",
+    "\n",
+    "    - Identify time tokens, we use three types of tokens: TIME, MODIFIER, NUMERAL.\n",
+    "        Each type have more specific types:\n",
+    "            MODIFIER = [\"PREFIX\",\"SUFFIX\",\"LINKAGE\",\"COMMA\",\"PARENTHESIS\",\"INARTICLE\"]\n",
+    "            NUMERAL = [\"BASIC\",\"DIGIT\",\"ORDINAL\"]\n",
+    "            TIME = [\"DECADE\", \"YEAR\", \"SEASON\", \"MONTH\", \"WEEK\", \"DATE\", \"TIME\", \"DAY_TIME\", \"TIMELINE\", \"HOLIDAY\", \"PERIOD\", \"DURATION\", \"TIME_UNIT\",\"TIME_ZONE\", \"ERA\",\"MID\",\"TIME_ZONE\",\"DAY\",\"HALFDAY\"]\n",
+    "        Added PARENTHESIS and improving regular expressions\n",
+    "    - Initialize regular expressions:\n",
+    "        Read regular expressions stored in:\n",
+    "        - timeRegex.txt\n",
+    "    - Build additional regular expressions using the base expressions in timeRegex.txt\n",
+    "    - Compile regex objects just once for better performance\n",
+    "    \n",
+    "For each sentence in the biography:\n",
     "\n",
-    "#### Added PARENTHESIS and improving regular expressions\n",
-    "#### Initialize regular expressions:\n",
-    "#### Read regular expressions stored in:\n",
-    "#### - timeRegex.txt\n",
-    "#### Build additional regular expressions using the base expressions in timeRegex.txt\n",
-    "#### Compile regex objects just once for better performance\n",
+    "    a) Identify token types. Function \"def get_time_tokens(text):\"\n",
+    "        - Tokenize \n",
+    "        - Obtain POS tags \n",
+    "        - Use regular expressions to identify type of token: time, modifier, numeral\n",
+    "        - Filter out ambiguous words by matching POS tags and type of token \n",
+    "        - Output: A list of all the tokenized words, POS tags, type of token and token\n",
     "\n",
-    "#### For each sentence in the biography:\n",
-    "#### a) Identify token types. Function \"def get_time_tokens(text):\"\n",
-    "#### - Tokenize \n",
-    "#### - Obtain POS tags \n",
-    "#### - Use regular expressions to identify type of token: time, modifier, numeral\n",
-    "#### - Filter out ambiguous words by matching POS tags and type of token \n",
-    "#### - Output: A list of all the tokenized words, POS tags, type of token and token\n",
+    "    b) Identify time segments. Function \"def get_time_segments(time_token_list):\"\n",
+    "        - A time segment has one time token and one or zero modifiers or numerals\n",
+    "        - Search for a time token, once found search the surroundings:\n",
+    "        - Search tokens on the left \n",
+    "          If PREFIX or NUMERAL or IN_ARTICLE continue searching\n",
+    "          - Search tokens on the right \n",
+    "          If SUFIX or NUMERAL continue searching\n",
+    "             For right and left search, if token is COMMA or LINKAGE then stop\n",
+    "          If time segments overlap, then apply heuristic rules and merge segments\n",
+    "        - Output: A list of time segments, each time segment has the word's index in the sentence\n",
     "\n",
-    "#### b) Identify time segments. Function \"def get_time_segments(time_token_list):\"\n",
-    "#### - A time segment has one time token and one or zero modifiers or numerals\n",
-    "#### - Search for a time token, once found search the surroundings:\n",
-    "####   - Search tokens on the left \n",
-    "####    If PREFIX or NUMERAL or IN_ARTICLE continue searching\n",
-    "####   - Search tokens on the right \n",
-    "####    If SUFIX or NUMERAL continue searching\n",
-    "####   For right and left search, if token is COMMA or LINKAGE then stop\n",
-    "####   If time segments overlap, then apply heuristic rules and merge segments\n",
-    "#### - Output: A list of time segments, each time segment has the word's index in the sentence\n",
+    "    c) Identify time expressions. Function \"classify_type_time_expression(time_expression_list):\"\n",
+    "        - Three types of time expressions: time point, time reference and time range\n",
+    "        - Apply heuristic rules to classify the type of time expression\n",
+    "        - Output: A dataframe with the sentence, the type of time expression, the time expression and indexes\n",
+    "        - Store each biography as a CSV file in extractedTimeExpressions/\n",
     "\n",
-    "#### c) Identify time expressions. Function \"classify_type_time_expression(time_expression_list):\"\n",
-    "#### - Three types of time expressions: time point, time reference and time range\n",
-    "#### - Apply heuristic rules to classify the type of time expression\n",
-    "#### - Output: A dataframe with the sentence, the type of time expression, the time expression and indexes\n",
-    "#### - Store each biography as a CSV file in extractedTimeExpressions/\n",
+    "Directories information:\n",
     "\n",
-    "#### Directories information:\n",
-    "#### indexedSentences/ : collection of biographies in CSV format. Each row of the file represents a sentence. Each row has a section name and paragraph index, and sentence index\n",
-    "#### extractedTimeExpressions/ : collection of annotated time expressions grouped by biography"
+    "    - indexedSentences/ : collection of biographies in CSV format. Each row of the file represents a sentence. Each row has a section name and paragraph index, and sentence index\n",
+    "    - extractedTimeExpressions/ : collection of annotated time expressions grouped by biography"
    ]
   },
   {

diff --git a/README_data_cleaning.md b/README_data_cleaning.md
@@ -39,16 +39,16 @@ MEETUPS data cleaning is a tool developed using Python and Jupyter Notebook. Thi
 
     |_ 01_CleaningText.ipynb
 
-    Raw corpus location:
-
+    Raw corpus location
+    Data output:
     |_ text_dataset/            
 
-    Clean text location:
-
+    Clean text location
+    Data input:
     |_ cleanText/
 
-    Index data location:
-
+    Index data location
+    Data output:
     |_ indexedParagraphs/
     |_ indexedSentences/
 

diff --git a/README_identification_themes.md b/README_identification_themes.md
@@ -0,0 +1,80 @@
+---
+id: 
+name: MEETUPS - Identification of themes
+brief-description: This tool is part of the MEETUPS pilot and processes text from music personalities' biographies to find encounter types. It uses "sklearn" and a set of Machine Learning algorithms to classify sentences according to the established type of events. The tool extracts information from one of the four elements defining a meetup: the type of encounter (what). Encounter type, along with data of the people involved (who), the place (where) and the time it took place (what), complete the historical meetup information.
+type: Software
+release-date: 20/06/2022
+release-number: v1.0
+work-package:
+- WP3
+pilot: MEETUPS
+keywords:
+- Wikipedia
+- Music
+- Text classification
+- Encounter type
+licence: GPLv3
+release link:
+  - 
+credits:
+  - https://github.com/enridaga
+---
+
+# MEETUPS - Identification of themes
+
+MEETUPS identification of people and places is a tool developed using Python and Jupyter Notebook. SKLEARN and a set of Machine Learning algorithms to classify sentences according to the established type of events. The tool allows the extraction of one (the type of encounter) of the four elements that define a historical meetup. 
+The encounter types are music-making, business meetings, personal life, social life, coincidence, public celebration, and education. 
+
+This implementation is divided in three main tasks:
+a) Generation of the training dataset
+In order to identify and classify sentences according to the encounter type we need first to build a dataset with sentences that describe the different encounter types.
+Approach:
+- Manually prepare seed terms for each meetup type
+- Randomly select sentences with those words from the corpus
+- Assign the relevant meetup type to each one of those sentences
+
+b) Training the classifier
+Approach:
+- Build a balanced training set by selecting first sentences from low represented classes
+- Train and test MLPClassifier
+
+c) Applying the classifier
+Use the model tested in b) and infer the type of encounter for all the data in the corpus
+
+### Information on installation and setup
+
+  - Jupyter Notebook:
+    MeetupType_applyClassifier.ipynb
+
+### Details of the data
+
+    Running the Themes classifier:
+    |_ MeetupType_applyClassifier.ipynb
+
+    Training the Themes classifier:
+    |_ MeetupType_prototypeSentences.ipynb
+
+    Generating the training dataset:
+    |_ MeetupType_trainClassifier.ipynb
+
+    Data location
+
+    Data input:
+    |_ indexedSentences/
+
+    Data output:
+    |_ extractedMeetupTypes/        
+
+    Classifier:
+    meetupType/models/MLPClassifier_2.clf'
+
+    Prototype sentences:
+    |_ meetupType/prototypeSentences_*.csv
+
+
+    |_ README_identification_themes.md
+
+
+### DOI:
+
+    TODO
diff --git a/README_people_places_identification.md b/README_people_places_identification.md
@@ -51,20 +51,18 @@ The second notebook uses the responses from DBpedia Spotlight to capture data of
 #### Details of the data
 
     Code location:
-
     |_ 02_queryDbpedia.ipynb
     |_ 02_Identify_PP.ipynb
 
-    Index data location:
-
+    Index data location
+    Data input:
     |_ indexedSentences/
 
     DBpedia Spotlight annotations:
-
     |_ cacheSpotlightResponse/        
 
-    People and places annotation:
-
+    People and places annotation
+    Data output:
     |_ extractedEntitiesPersonPlaceOnly/
 
 

diff --git a/README_time_expressions.md b/README_time_expressions.md
@@ -22,7 +22,7 @@ credits:
 
 # MEETUPS 
 
-MEETUPS identification of people and places is a tool developed using Python and Jupyter Notebook. This software uses NLTK Toolkit and heuristic rules to identify and annotate time expressions from input text. The tool allows the extraction of one (when a historical meetup happened) of the four elements that define a historical meetup. 
+MEETUPS identification of temporal knowledge is a tool developed using Python and Jupyter Notebook. This software uses NLTK Toolkit and heuristic rules to identify and annotate time expressions from input text. The tool allows the extraction of one (when a historical meetup happened) of the four elements that define a historical meetup. 
 
 This implementation is a rule-based Time Expression recognition tagger based on research by Zhong et al. and SynTime software (https://github.com/zhongxiaoshi/syntime). Their work was originally tested using three datasets: TimeBank, WikiWars and Tweets. 
 The authors implement a three-layer system that recognises time expressions using syntactic token types and general heuristic rules.
@@ -68,18 +68,18 @@ Finally the tool stores the results as a CSV file in extractedTimeExpressions/
 ### Details of the data
 
     Code location:
-
     |_ 03_Identify_TimeE.ipynb
 
     Regular expressions:
     |_ timeRegex.txt
 
-    Data location:
+    Data location
 
+    Data input:
     |_ indexedSentences/
 
-    Time expressions annotations:
-
+    Time expressions annotations
+    Data ouput:
     |_ extractedTimeExpressions/