<center>
    <h1> Natural Language Processing and Large Language Models for Research Data Exploration and Analysis
 </h1> </center>

<center> <h1> Day-1: Basic Text Processing  </h1> </center>

<center> <h2> Sample code </h2> </center>

<center> <h4> Raghava Mukkamala (rrm.digi@cbs.dk)  </h4> </center>


### Instructions

#### If you are working on a Jupyter Notebook, you will most likely need to install libraries using e.g. !pip install. On Google Colab most libraries should come preinstalled. Remember to import the in any case libraries!

#### Here we show some of the different text processing steps using the libraries TextBlob and nltk. Note that for most NLP tasks, there are many alternatives regarding libraries to use.


In [1]:
# !pip install --upgrade numpy
# !pip install prettytable
# !pip install spacy
# !pip install nltk
# !pip install textblob

import nltk
from textblob import TextBlob
from prettytable import PrettyTable

In [17]:
# Downloading required resources
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker_tab')
nltk.download('words')
nltk.download('brown')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker_tab is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

## Tokenization: splitting text into words



In [12]:
# Using only TextBlob

text = ("Natural language processing (NLP) is a field " +
       "of computer science, artificial intelligence " +
       "and computational linguistics concerned with " +
       "the interactions between computers and human " +
       "(natural) languages, and, in particular, " +
       "concerned with programming computers to " +
       "fruitfully process large natural language " +
       "corpora. Challenges in natural language " +
       "processing frequently involve natural " +
       "language understanding, natural language" +
       "generation frequently from formal, machine" +
       "-readable logical forms), connecting language " +
       "and machine perception, managing human-" +
       "computer dialog systems, or some combination " +
       "thereof.")

# If you want to use a shorter text, comment out the one above and use the one below instead.

#text = "Apple is looking at buying U.K. startup for $1 billion"

# create a TextBlob object
tb = TextBlob(text)

# tokenize the text into words.
print("Words :\n", tb.words)


Words :
 ['Natural', 'language', 'processing', 'NLP', 'is', 'a', 'field', 'of', 'computer', 'science', 'artificial', 'intelligence', 'and', 'computational', 'linguistics', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', 'natural', 'languages', 'and', 'in', 'particular', 'concerned', 'with', 'programming', 'computers', 'to', 'fruitfully', 'process', 'large', 'natural', 'language', 'corpora', 'Challenges', 'in', 'natural', 'language', 'processing', 'frequently', 'involve', 'natural', 'language', 'understanding', 'natural', 'languagegeneration', 'frequently', 'from', 'formal', 'machine-readable', 'logical', 'forms', 'connecting', 'language', 'and', 'machine', 'perception', 'managing', 'human-computer', 'dialog', 'systems', 'or', 'some', 'combination', 'thereof']


In [13]:
# Using TextBlob and PrettyTable

tb = TextBlob(text)
print('Raw Document: ', tb, "\n")

index = 0
tab = PrettyTable(['index','word'])

for word in tb.words:
    tab.add_row([index,word])
    index += 1

print(tab)

Raw Document:  Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora. Challenges in natural language processing frequently involve natural language understanding, natural languagegeneration frequently from formal, machine-readable logical forms), connecting language and machine perception, managing human-computer dialog systems, or some combination thereof. 

+-------+--------------------+
| index |        word        |
+-------+--------------------+
|   0   |      Natural       |
|   1   |      language      |
|   2   |     processing     |
|   3   |        NLP         |
|   4   |         is         |
|   5   |         a          |
|   6   |       field        |
|   7   |         of         |
|   8   |      computer      |
|   9   |

## Parts of Speech Tagging and Entity Recongition

### POS tagging with nltk

In [14]:
from nltk import word_tokenize, pos_tag, ne_chunk

sentence = 'they lay back on the San Francisco grass in U.S.A. and looked at the stars and their'

print(ne_chunk(pos_tag(word_tokenize(sentence))))

(S
  they/PRP
  lay/VBD
  back/RB
  on/IN
  the/DT
  (ORGANIZATION San/NNP Francisco/NNP)
  grass/NN
  in/IN
  (GPE U.S.A./NNP)
  and/CC
  looked/VBD
  at/IN
  the/DT
  stars/NNS
  and/CC
  their/PRP$)


### POS tagging and Entity extraction with TextBlob

In [15]:
def pos_tagging_with_textblob(text):
    # Create a TextBlob object
    blob = TextBlob(text)

    # Perform POS tagging
    pos_tags = blob.tags

    return pos_tags

# Example text
text = """
Elon Musk, the CEO of Tesla and SpaceX, is known for his groundbreaking innovations in technology.
"""

# Get POS tags
tags = pos_tagging_with_textblob(text)

# Display POS tags
print("Part-of-Speech Tags:")
for word, tag in tags:
    print(f" - {word}: {tag}")


Part-of-Speech Tags:
 - Elon: NNP
 - Musk: NNP
 - the: DT
 - CEO: NNP
 - of: IN
 - Tesla: NNP
 - and: CC
 - SpaceX: NNP
 - is: VBZ
 - known: VBN
 - for: IN
 - his: PRP$
 - groundbreaking: NN
 - innovations: NNS
 - in: IN
 - technology: NN


In [18]:
def extract_entities_with_textblob(text):
    # Create a TextBlob object
    blob = TextBlob(text)

    # Extract noun phrases (a simple form of entity recognition)
    noun_phrases = blob.noun_phrases

    return noun_phrases

# Example text
text = """
Elon Musk, the CEO of Tesla and SpaceX, was born on June 28, 1971, in Pretoria, South Africa.
He founded companies like PayPal and Neuralink and is currently one of the richest people in the world.
"""

# Extract entities (noun phrases) from the text
entities = extract_entities_with_textblob(text)

# Display extracted noun phrases
print("Extracted Noun Phrases:")
for entity in entities:
    print(f" - {entity}")


Extracted Noun Phrases:
 - elon musk
 - ceo
 - tesla
 - spacex
 - june
 - pretoria
 - africa
 - paypal
 - neuralink


## Stemming and lemmatization with nltk

In [20]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [21]:
# Example on stemming and lemmatization
from nltk.stem.porter import PorterStemmer


st = PorterStemmer()

# lemmatization
wnl = nltk.WordNetLemmatizer()

print('stemming sing:',st.stem("sing"))
print('lemmatize sing:',wnl.lemmatize("sing"))
print('')
print('stemming ponies:',st.stem("ponies"))
print('lemmatize ponies:',wnl.lemmatize("ponies"))
print('')
print('stemming example:',st.stem("example"))
print('lemmatize example:', wnl.lemmatize("example"))
print('')
print('stemming equivalent:', st.stem("equivalent"))
print('lemmatize equivalent:',wnl.lemmatize("equivalent"))




stemming sing: sing
lemmatize sing: sing

stemming ponies: poni
lemmatize ponies: pony

stemming example: exampl
lemmatize example: example

stemming equivalent: equival
lemmatize equivalent: equivalent
