## Text and Data Mining Fundamentals


Before you proceed with the practical examples you first need to authenticate your Google account. To do that, you need to run the code that is given below. When the code stops running it will produce a long url and display a text box (see screenshot below); click on the url and you will be directed to another page, which will containt an id string. Copy this id, come back to this page and paste the id in the text box. Then press "Enter" or "Return" in your keyboard.

Here is an screenshot showing the long url and the code as I have pasted it into the text area. 

![google_authentication](https://lh3.google.com/u/0/d/1JBA2JvXdd19P5GAimUPFf9JByZCCuaX1=w1439-h780-iv2)

You will need to follow this procedure everytime you leave this page for some time and you need to revisit it. 

In the block below, place your cursor at `[ ]` and click on the "Play" button. This may take some time, depending on your connection, please be patient. 

You may receive the following warning message. You can choose "Run Anyway". 

![error](https://lh3.google.com/u/0/d/1trZibFecCs1zFPUy7FBK5Uvj1hdUo5CV=w1300-h500-p-k-nu-iv1)

In [0]:
import pandas


#
# Replace the assignment below with your file ID
# to download a different file.
#
# A file ID looks like: 1uBtlaggVyWshwcyP6kEI-y_W3P8D26sz
file_id = 'target_file_id'

import io
from io import StringIO
from googleapiclient.http import MediaIoBaseDownload
from google.colab import auth
from googleapiclient.discovery import build

auth.authenticate_user()
drive_service = build('drive', 'v3')

def load_from_gdrive(file_id):
  request = drive_service.files().get_media(fileId=file_id)
  downloaded = io.BytesIO()
  downloader = MediaIoBaseDownload(downloaded, request)
  done = False
  while done is False:
    # _ is a placeholder for a progress object that we ignore.
    # (Our file is small, so we skip reporting progress.)
    _, done = downloader.next_chunk()

    downloaded.seek(0)
    return downloaded.read()



Let's move on with some TDM fundamentals! 


In this page you will see how you can use CORE data from the CORE API to apply some text and data mining pre-processes. The purpose of this page is to demonstrate some TDM examples and introduce you to the concept of TDM practices. This page was designed for all those interested in TDM with no technical skills. So the examples are for demonstration only and do not provide guidance and instructions on how to implement things. 

The examples presented here relate to:
* Sentence segmentation
* Tokenization
* Part of Speech (POS) tagging
* Named Entity Recognition (NER)
* Stemming



To create the examples we used the paper [Linked Data - the story so far](https://core.ac.uk/display/1511033)  by Christian Bizer, Tom Heath and Tim Berners-Lee. 

In the following block we show the Python code we used to extract the title of the paper. The actual title is shown as an output `[->` at the block below the code. 

In [2]:

import json
data=load_from_gdrive("1ZbTXJheXICd97MmJ55yPRd3t2gftWUOF")
article=json.loads(data)
article["title"]

'Linked Data - the story so far'

Now we will focus on the last line of the block above that contains the code `article["title"]`. We will delete the word `"title"` and add the word `"fullText"`. This gave us the whole text as it shows in the next block. 

*   List item
*   List item



(Note: We only used the last code line and not the whole string of code. We did that because blocks in the notebook are connected and share the same information and data.)

In order to make the full text flow better in the notebook we use the `print() ` function.

In [3]:
print(article["fullText"])

Linked Data - The Story So Far
Christian Bizer, Freie Universität Berlin, Germany
Tom Heath, Talis Information Ltd, United Kingdom
Tim Berners-Lee, Massachusetts Institute of Technology, USA
This is a preprint of a paper to appear in: Heath, T., Hepp, M., and Bizer, C. (eds.). Special
Issue on Linked Data, International Journal on Semantic Web and Information Systems
(IJSWIS). http://linkeddata.org/docs/ijswis-special-issue
Abstract
The term Linked Data refers to a set of best practices for publishing and connecting
structured data on the Web. These best practices have been adopted by an increasing
number of data providers over the last three years, leading to the creation of a global data
space containing billions of assertions - the Web of Data. In this article we present the
concept and technical principles of Linked Data, and situate these within the broader context
of related technological developments. We describe progress to date in publishing Linked
Data on the Web, review appl

**Practical exercises:** 
 - **Easy:** Access the [full text PDF](https://core.ac.uk/download/pdf/1511033.pdf) and ensure that the first and last word of the PDF matches the first and last word of the full text as appears in the notebook.
 - **Easy** Remove the `print()` from the block above. What is the difference between the two visual layouts? (*tip:* the line breaks in text are identifiable by `\n`)
 - **Advanced:** Change the code in the cell above to extract all the authors of this paper. (*tip:* the article's content and metadata in a [Json file format](https://en.wikipedia.org/wiki/JSON) are accessible [here](https://drive.google.com/file/d/1ZbTXJheXICd97MmJ55yPRd3t2gftWUOF/view?usp=sharing). For an easy to read version, you might want to use [this link](https://gist.github.com/mcancellieri/4762b5d87b81333daecc65aad5d370d2)). 

We will now start with some cool TDM stuff. 

Before we do that, we need to download the Natural Language Toolkit (NTLK) library. This library is like a swiss army knife to TDM as it offers lots of functions that are the foundations to TDM.

To make this library work we need also to download something else as well. What is this? I will explain. 

Because the full text we are using is written in the English language, we need to download a model (corpora) that contains the English language grammar and syntax rules. If we had to analyse full text in more than one languages then we would have to download the models for the other languages as well. This model will help us with the all the aforementioned TDM techniques. (*Note:* in the following code read the lines marked in red - these show notes that explain what we did and are not part of the code.)



In [4]:
'''
The next line installs the NLTK package
'''
!pip install nltk

'''
This line imports that NLTK library in the notebook
'''
import nltk

'''
The following lines download the required models for this example. If you want to you can run: 
nltk.download() 
that will prompt which specific packages you wish to download
'''
nltk.download("punkt")
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package punkt to /content/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /content/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /content/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /content/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

##### Sentence segmentation

> *Sentence segmentation* is the problem of dividing written text into meaningful units, such as words, sentences, or topics. ([Wikipedia](https://en.wikipedia.org/wiki/Text_segmentation))

We use sentence segmentation to extract each sentence in the full text separately. This method is handy with English language, because a period (.) usually indicates the end of the sentence, but this practice can be more challenging or unhelpful in other languages. 

In the following example we run the code to split the text into sentences.  

In [5]:
'''
Let's split the text into sentences:
'''
sentences = nltk.sent_tokenize(article["fullText"])
sentences

['Linked Data - The Story So Far\nChristian Bizer, Freie Universität Berlin, Germany\nTom Heath, Talis Information Ltd, United Kingdom\nTim Berners-Lee, Massachusetts Institute of Technology, USA\nThis is a preprint of a paper to appear in: Heath, T., Hepp, M., and Bizer, C.',
 '(eds.).',
 'Special\nIssue on Linked Data, International Journal on Semantic Web and Information Systems\n(IJSWIS).',
 'http://linkeddata.org/docs/ijswis-special-issue\nAbstract\nThe term Linked Data refers to a set of best practices for publishing and connecting\nstructured data on the Web.',
 'These best practices have been adopted by an increasing\nnumber of data providers over the last three years, leading to the creation of a global data\nspace containing billions of assertions - the Web of Data.',
 'In this article we present the\nconcept and technical principles of Linked Data, and situate these within the broader context\nof related technological developments.',
 'We describe progress to date in publish

**Practical exercises:** 
 - **Easy:** Locate in the [full text PDF](https://core.ac.uk/download/pdf/1511033.pdf) the sentence: 
*"We describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked Data community as it moves forward."* 
Try locating the same sentence in the box above. Look how a new sentence and how a new line is being marked. 
 - **Advanced:** Go to the code block, find the line 
 
 `sentences = nltk.sent_tokenize(article["fullText"])` 
 
 and replace it with  
 
 `sentences = article["fullText"].split(".")` 
 
 ( or you can just copy and paste). 
 
 Observe how the text divided this time? What are the main differences? 

##### Tokenization
> *Tokenization* is the process of demarcating and possibly classifying sections of a string of input characters. ([Wikipedia](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization)).

We will now show another technique which separates each word in the full text, called tokenization. A sentence, i.e. text string, can be divided into its components, i.e. words or tokens. 

You might think, "so what ?". Well on its own tokenization may not look very impressive, but it simplifies possible language complexities and it is the beginning of other TDM processes that create meaningful and useful results. 


In [6]:
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tokenized_sentences[1:50]

[['(', 'eds', '.', ')', '.'],
 ['Special',
  'Issue',
  'on',
  'Linked',
  'Data',
  ',',
  'International',
  'Journal',
  'on',
  'Semantic',
  'Web',
  'and',
  'Information',
  'Systems',
  '(',
  'IJSWIS',
  ')',
  '.'],
 ['http',
  ':',
  '//linkeddata.org/docs/ijswis-special-issue',
  'Abstract',
  'The',
  'term',
  'Linked',
  'Data',
  'refers',
  'to',
  'a',
  'set',
  'of',
  'best',
  'practices',
  'for',
  'publishing',
  'and',
  'connecting',
  'structured',
  'data',
  'on',
  'the',
  'Web',
  '.'],
 ['These',
  'best',
  'practices',
  'have',
  'been',
  'adopted',
  'by',
  'an',
  'increasing',
  'number',
  'of',
  'data',
  'providers',
  'over',
  'the',
  'last',
  'three',
  'years',
  ',',
  'leading',
  'to',
  'the',
  'creation',
  'of',
  'a',
  'global',
  'data',
  'space',
  'containing',
  'billions',
  'of',
  'assertions',
  '-',
  'the',
  'Web',
  'of',
  'Data',
  '.'],
 ['In',
  'this',
  'article',
  'we',
  'present',
  'the',
  'concept',

##### Part of speech tagging (POS):

> *POS tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context—i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph.* ([Wikipedia](https://en.wikipedia.org/wiki/Part-of-speech_tagging)).

The extracted words are given a tag, which helps us understand their syntactic position in a sentence. An example of the acronyms displayed right next to words can be found [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

- NN -> noun (plural)
- NNP -> noun (singular)
- VB -> Verb
- CC -> conjuction
... 

In [7]:
import pprint
pos_tagged_0 = nltk.pos_tag(tokenized_sentences[10])
pprint.pprint(pos_tagged_0)

[('This', 'DT'),
 ('functionality', 'NN'),
 ('has', 'VBZ'),
 ('been', 'VBN'),
 ('enabled', 'VBN'),
 ('by', 'IN'),
 ('the', 'DT'),
 ('generic', 'NN'),
 (',', ','),
 ('open', 'JJ'),
 ('and', 'CC'),
 ('extensible', 'JJ'),
 ('nature', 'NN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('Web', 'NNP'),
 ('(', '('),
 ('Jacobs', 'NNP'),
 ('&', 'CC'),
 ('Walsh', 'NNP'),
 (',', ','),
 ('2004', 'CD'),
 (')', ')'),
 (',', ','),
 ('which', 'WDT'),
 ('is', 'VBZ'),
 ('also', 'RB'),
 ('seen', 'VBN'),
 ('as', 'IN'),
 ('a', 'DT'),
 ('key', 'JJ'),
 ('feature', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('Web', 'NNP'),
 ("'s", 'POS'),
 ('unconstrained', 'JJ'),
 ('growth', 'NN'),
 ('.', '.')]


**Practical exercises:** 
 - **Easy:** This list contains 10 results. Use this [guide](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) to explain what part of speech is each word. 
 - **Advanced:** In this example in the code we have chosen to run POS-tagging on sentence number 10 (marked with green). Go to the block with the code and try choosing another sentence. Observe the words of each sentence and the tags assigned to each word. In which cases is the tool less useful, e.g. does it fail to tag correctly a word?

## NER Named Entity Recognition
>*... locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. ([Wikipedia](https://en.wikipedia.org/wiki/Named-entity_recognition)).*

Named Entity Recognition does not merely understand what is the grammatical position of a word in a sentence, like POS-tagging, but it recognises words when they form a specific entity or concept and denote a semantic meaning. 

In [8]:
sentence_number=141
parse_tree = nltk.ne_chunk(nltk.tag.pos_tag(sentences[sentence_number].split()), binary=True)
named_entities = []

for t in parse_tree.subtrees():
    if t.label() == 'NE':
        named_entities.append(list(t))  
(sentences[sentence_number], named_entities)

('The Open Provenance Model\n(Moreau et al., 2008) provides terms for describing data transformation workflows.',
 [[('Open', 'JJ'), ('Provenance', 'NNP'), ('Model', 'NNP')]])

**Practical exercises:** 
 - **Easy:** In this example we are looking at sentence number 141. Look into another sentence. 
 

## Stemming
> *Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form ([Wikipedia](https://en.wikipedia.org/wiki/Stemming)).*

The role of stemming is to find a word in the text and extract the stem of each word. Look at the first red line to understand what does the code do.

In [0]:
'''
We import the stemming functionality from the NLTK library
'''
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

We picked the word "management" to extract the stem of this word.

In [10]:
stemmer.stem("Management")


'manag'

**Practical exercises:**
 - **Easy:** Change the word we are giving and try different words, for example manager or managerial. 
 - **Really advanced:** In the example above we used the `PorterStemmer` Library to help us find the stemmer. Another implementation is the `SnowballStemmer`. Go to this [page](http://www.nltk.org/howto/stem.html), locate the section "Unit tests for Snowball stemmer" and run it. (Tip: replace the `PorterStemmer` with the `SnowballStemmer` - don't forget to import the correct library! 

Would you like to try more examples? Are you ready for something more advanced? Try the practical activities at "Examples using the CORE API" section by clicking [here](https://drive.google.com/file/d/1lIsEfw6DroDYiKNrmb4bZLadxD4CsTuM/view?usp=sharing).